Unleashing the Power of Deequ for Efficient Spark Data Analysis

Unleashing the Power of Deequ for Efficient Spark Data Analysis

https://unsplash.com/fr/photos/LqKhnDzSF-8

In the big data world, ensuring data quality is even more important due to the large volume and variety of data being generated and collected. Poor data quality can lead to incorrect insights, reduced efficiency, increased costs, and a negative impact on decision-making. On the other hand, high-quality data enables organizations to make informed decisions, improve processes, and gain a competitive advantage.

However, ensuring data quality in big data environments can be challenging due to the sheer scale and complexity of the data. Organizations need to implement automated data quality checks and data profiling processes to identify and address data quality issues in a timely and efficient manner. Organizations must establish data quality policies, standards, and procedures to ensure the accuracy, completeness, and consistency of their data.

They also need to invest in the right technology and tools to support their data quality efforts, such as data quality management software, data integration tools, and data governance frameworks. Moreover, companies can create procedures from scratch or use a ready-to-use solution… like Deequ.

What is Deequ?

Deequ is a library built on top of Apache Spark for defining “unit tests for data” which measure data quality in large datasets. It offers a complete testing toolbox covering:

  • Dataframe properties.
  • Integrity constraints (completeness, unicity, and nullability).
  • Checks over values (values in a specific range or one of a distinct set of values, etc.).

After you define the checks to run on your data and integrate them into a test suite, they’ll run and generate individual reports. Every report will have a:

  • Status: Either the check is a success or a failure.
  • Message: The name of your check.
  • Rule violation rate: a percentage of lines that don’t respect the predefined constraints.

The check result can be saved into a log system for a dashboard platform or simply be used as an early detection system to block the dataframe write since it doesn’t fulfill the required constraints.

Example

This example will be written in Scala and built with SBT. For Python users, PyDeequ is your friend. It uses the same function names, and the PyDeequ documentation is quite extensive.

Add Deequ as a dependency

The available version can be found here.

// Build.sbt
...
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1"

Create a product class and specifiying integrity constraints

case class Product(
name: String,
price: Double,
availability: String,
url: String
)

The business requirements are the following:

  1. As we’re a small firm, product numbers should always be less than 10 but greater than 4.
  2. The product name should never be empty and should include at least four characters. It should be a unique identifier.
  3. The product price should never be null and should contain strictly positive values.
  4. The availability should either be true or false, but never be null.
  5. At least half of the lines should contain a hyperlink. The url column should be unique.

⚠️ If any of those rules are violated, the data quality checks should fail and the new lines shouldn’t be inserted.

Create a SparkSession and run our tests

As we said before, Deequ runs on top of Spark. It takes as input the dataframe we want to test and the constraints. Let’s create a sample dataframe based on the class we’ve just created.

import org.apache.spark.sql.{DataFrame, SparkSession}

def main(args: Array[String]): Unit = {
// create a SparkSession
val spark = SparkSession.builder().appName("DeequExample").getOrCreate()

// create a sample dataframe
val data = Seq(
Product("Table", 199.99, "true", "my_brand.com/table"),
Product("Candle", 12.78, "false", ""),
Product("Chair", 49.00, "true", null),
Product("Sofa", 1499.99, null, "my_brand.com/sofa"),
Product("Carpet", -149.99, "true", "my_brand.com/sofa")
)
val dataFrame: DataFrame = spark.createDataFrame(data)

/*
Testing code
*/

/*
Controling test results
*/

// stop the SparkSession
spark.stop()
}
}

Let’s keep in mind the constraints we provided and write our test suite based on them.

import com.amazon.deequ.VerificationSuite
import org.apache.spark.sql.{DataFrame, SparkSession}

def main(args: Array[String]): Unit = {
// create a SparkSession
val spark = SparkSession.builder().appName("DeequExample").getOrCreate()

// create a sample dataframe
val data = Seq(
Product("Table", 199.99, "true", "my_brand.com/table"),
Product("Candle", 12.78, "false", ""),
Product("Chair", 49.00, "true", null),
Product("Sofa", 1499.99, null, "my_brand.com/sofa"),
Product("Carpet", -149.99, "true", "my_brand.com/sofa")
)
val dataFrame: DataFrame = spark.createDataFrame(data)

// create a verification suite
val verificationResult: VerificationResult = VerificationSuite()
.onData(dataFrame)
.addCheck(
Check(CheckLevel.Error, "dataframe level checks")
.hasSize(_ >= 4)
.hasSize(_ <= 10)
)
.addCheck(
Check(CheckLevel.Error, "product name check")
.hasMinLength("name", _ >= 4.0)
.isComplete("name")
)
.addCheck(
Check(CheckLevel.Error, "product price check")
.isNonNegative("price")
.isComplete("price")
)
.addCheck(
Check(CheckLevel.Error, "product availability check")
.isContainedIn("availability", Array(true.toString, false.toString))
.isComplete("availability")
)
.addCheck(
Check(CheckLevel.Warning, "product url check")
.containsURL("url", _ >= 0.5)
.isUnique("url")
)
.run()

/*
Controling test results
*/

// stop the SparkSession
spark.stop()
}
}

It’s possible to create fewer checks than we did or group all tests into one check, but for more code visibility and clarity, we’ll add as many checks as the constraints we specified before. It’s a good practice to have detailed insights related to each check.

Each check takes as input a VerificationRunBuilder object and returns a new VerificationRunBuilder object. The run function takes that VerificationRunBuilder object and returns a VerificationResult object.

This object will contain all the test results. We can iterate over it to get the result for each test we added.

import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.CheckStatus
import com.amazon.deequ.constraints.ConstraintStatus
import org.apache.log4j.Logger
import org.apache.spark.sql.{DataFrame, SparkSession}

// creating the logger
val logger = Logger.getLogger(this.getClass)

def main(args: Array[String]): Unit = {
// create a SparkSession
val spark = SparkSession.builder().appName("DeequExample").getOrCreate()

// create a sample dataframe
val data = Seq(
Product("Table", 199.99, "true", "my_brand.com/table"),
Product("Candle", 12.78, "false", ""),
Product("Chair", 49.00, "true", null),
Product("Sofa", 1499.99, null, "my_brand.com/sofa"),
Product("Carpet", -149.99, "true", "my_brand.com/sofa")
)
val dataFrame: DataFrame = spark.createDataFrame(data)

// create a verification suite
val verificationResult: VerificationResult = VerificationSuite()
.onData(dataFrame)
.addCheck(
Check(CheckLevel.Error, "dataframe level checks")
.hasSize(_ >= 4)
.hasSize(_ <= 10)
)
.addCheck(
Check(CheckLevel.Error, "product name check")
.hasMinLength("name", _ >= 4.0)
.isComplete("name")
)
.addCheck(
Check(CheckLevel.Error, "product price check")
.isNonNegative("price")
.isComplete("price")
)
.addCheck(
Check(CheckLevel.Error, "product availability check")
.isContainedIn("availability", Array(true.toString, false.toString))
.isComplete("availability")
)
.addCheck(
Check(CheckLevel.Warning, "product url check")
.containsURL("url", _ >= 0.5)
.isUnique("url")
)
.run()

val resultsForAllConstraints = verificationResult.checkResults
.flatMap { case (_, checkResult) => checkResult.constraintResults }

require(
verificationResult.status == CheckStatus.Success,
logger.error(
resultsForAllConstraints
.filter(_.status == ConstraintStatus.Failure)
.foldLeft("Constraint violated ->")((accum, elem) =>
accum +
s"""
| Constraint -> ${elem.constraint}:
| Status -> ${elem.status}
| Message -> ${elem.message.getOrElse("")}
| """.stripMargin)
)
)

dataframe.write.mode("Overwrite").format("parquet").save("./tables/products")

// stop the SparkSession
spark.stop()
}
}

The assertion we added is responsible for blocking the write operation if any of the constraints we added are violated.

Data from tests that violated constraints

The custom log we added printed the violated constraints. If we analyze each constraint:

  • The availability constraints force us to choose between true and false (80% of lines respected the constraints but not 100%, and 80% did not meet the constraint requirements).
  • The url constraints force a URL in at least half of the lines (40% of lines respected the constraints, not 100%, and 40% didn’t meet the constraint requirements). We also had duplicate values in at least half of the lines (40% of lines respected the constraints, but not 100%, and 40% didn’t meet the constraint requirements).
  • The url constraints force a URL in at least half of the lines (40% of lines respected the constraints, not 100%, and 40% didn’t meet the constraint requirements). We also had duplicate values in half of the not-null lines (sofa and carpet products) (50% of lines respected the constraints, not 100%, and 50% didn’t meet the constraint requirements).
  • The price constraint forces us to have a strictly positive price. because the carpet had a negative one (80% of lines respected the constraints rather than 100%, 80% does not meet the constraint requirements).

Setting our data to a good format using some transformation rules will pass the tests.

val data = Seq(
Product("Table", 199.99, "true", "my_brand.com/table"),
Product("Candle", 12.78, "false", ""),
Product("Chair", 49.00, "true", "my_brand.com/chair"),
Product("Sofa", 1499.99, "true", "my_brand.com/sofa"),
Product("Carpet", 149.99, "true", "my_brand.com/carpet")
)

Test results containing valid data

The constraint definition could be generic, based on the concerned table. For instance, having a generic Table class to define the unicity keys of a table and the nullability conditions could be used as input to Deequ so that the conditions would be automatically handled for many tables using the same code.

Data profiling

As part of the constraint verification benefits of Deequ, it could also be used to compute statistics over the data we’re dealing with. Value ranges, percentiles, standard deviations, and all other statistics can be calculated to give the maximum value to the business (to detect outliers, for instance). Deequ’s data profiling module is rich in descriptive statistics.

Based on valid data, Deequ can also suggest constraints based on the values provided so that their integration into the test for future data will be easy.

Conclusion

Dequ provides several benefits for Spark data analysis, including improved data quality through automated data validation and faster data processing through parallelization on Spark. These benefits result in increased accuracy, reduced processing time, and improved overall efficiency in data analysis tasks.

To be notified about my upcoming articles, you can stay connected with me. Follow me or sign up for an email subscription to get notified when I post something valuable. Enjoy learning!

Resources

Further Reading

[Spark caching, when and how?
A guide to wisely use caching on Sparkmedium.com](https://medium.com/@omarlaraqui/caching-in-spark-when-and-how-367e77db454d "medium.com/@omarlaraqui/caching-in-spark-wh..")

[A Beginner’s Guide to Spark Execution Plan
If you want to know how Spark works from reading input data until writing the transformed one, you’re mistery will be…medium.com](https://medium.com/@omarlaraqui/a-beginners-guide-to-spark-execution-plan-b11f441005c9 "medium.com/@omarlaraqui/a-beginners-guide-t..")

[Understand Spark Execution Modes
Understanding Spark Execution Modes — Local, Client and Cluster — is critical. From the perspective of a Big Data…medium.com](https://medium.com/@omarlaraqui/understand-spark-execution-modes-cc9978290dd6 "medium.com/@omarlaraqui/understand-spark-ex..")

[The Medallion Architecture
Data is a hot topic in the business world. Everyone wants to talk about the insights and value they can derive from…medium.com](https://medium.com/@omarlaraqui/the-medallion-architecture-21fe878d1aca "medium.com/@omarlaraqui/the-medallion-archi..")

[Build efficient tests for your Spark data pipeline using BDD with Cucumber
Using Cucumber to define & validate acceptance criteria, teams can collaborate effectively and build software that is…medium.com](https://medium.com/@omarlaraqui/build-efficient-tests-for-your-spark-data-pipeline-using-bdds-with-cucumber-61f1bdc08faf "medium.com/@omarlaraqui/build-efficient-tes..")

Did you find this article valuable?

Support Omar LARAQUI by becoming a sponsor. Any amount is appreciated!