Skip to content

FRosner/drunken-data-quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drunken Data Quality (DDQ) Build Status Latest Release

Description

DDQ is a small library for checking constraints on Spark data structures. It can be used to assure a certain data quality, especially when continuous imports happen.

Getting DDQ

In order to use DDQ, you can add it as a dependency to your project using JitPack.io. Just add it to your build.sbt like this:

resolvers += "jitpack" at "https://jitpack.io"

libraryDependencies += "com.github.FRosner" % "drunken-data-quality" % "x.y.z"

If you are not using any of the dependency management systems supported by JitPack, feel free to download one of the compiled artifacts in the release section. Alternatively you may of course also build from source.

Using DDQ

import de.frosner.ddq._

val customers = sqlContext.table("customers")
val contracts = sqlContext.table("contracts")
Check(customers)
  .hasNumRowsEqualTo(100000)
  .isNeverNull("customer_id")
  .hasUniqueKey("customer_id")
  .satisfies("customer_age > 0")
  .isConvertibleToDate("customer_birthday", new SimpleDateFormat("yyyy-MM-dd"))
  .hasForeignKey(contracts, "customer_id" -> "contract_owner_id")
  .run()

Authors

License

This project is licensed under the Apache License Version 2.0. For details please see the file called LICENSE.