DDQ is a small library for checking constraints on Spark data structures. It can be used to assure a certain data quality, especially when continuous imports happen.
In order to use DDQ, you can add it as a dependency to your project using JitPack.io. Just add it to your build.sbt
like this:
resolvers += "jitpack" at "https://jitpack.io"
libraryDependencies += "com.github.FRosner" % "drunken-data-quality" % "x.y.z"
If you are not using any of the dependency management systems supported by JitPack, feel free to download one of the compiled artifacts in the release section. Alternatively you may of course also build from source.
import de.frosner.ddq._
val customers = sqlContext.table("customers")
val contracts = sqlContext.table("contracts")
Check(customers)
.hasNumRowsEqualTo(100000)
.isNeverNull("customer_id")
.hasUniqueKey("customer_id")
.satisfies("customer_age > 0")
.isConvertibleToDate("customer_birthday", new SimpleDateFormat("yyyy-MM-dd"))
.hasForeignKey(contracts, "customer_id" -> "contract_owner_id")
.run()
- Frank Rosner (Creator)
- Slavo N. (Contributor)
This project is licensed under the Apache License Version 2.0. For details please see the file called LICENSE.