Skip to content

A library to store metadata of relational databases including the schema, statistics, and integrity constraints.

License

Notifications You must be signed in to change notification settings

davidimmhahn/metadata-ms

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metadata Management System

Status: Build Status

Architecture overview

Metadata of relational datasets (e.g., column statistics and inclusion dependencies) are useful for data-oriented tasks, such as query processing, data mining, and data integration. Data profiling techniques (as for instance provided by Metanome) determine such metadata for a given dataset. However, once the metadata have been acquired, they need to be further processed. In particular, it is highly beneficial to integrate and combine the different types of metadata and allow to explore them interactively.

This is where the Metadata Management System (MDMS for short) comes into play. It allows to store metadata in various persistence layers (Java serialization, SQLite, and Cassandra as of now), thereby integrating the different types of metadata. Moreover, MDMS is supposed to complement this persistence layer with an analytical layer, which is to expose a query language and provide various data mining operators to explore the metadata.

Usage notes

Besides providing a library for metadata management, the MDMS provides a set of utilities that can be run from the command line and allow for the management of metadata stores. Note that all tools can be run without parameters to explain their usage.

Create a metadata store. Creating and initializing a metadata store is the first step. This metadata store can later on manage metadata. To create a new metadata store, run the main class de.hpi.isg.mdms.tools.apps.CreateMetadataStoreApp.

Import a database schema. We provide a tool to automatically extract the basic schema information (tables, columns) of a database that is represented by CSV files. Importing such a schema is necessary (i) to configure data profiling algorithms appropriately, e.g., to define a set of schema elements to be profiled, and (ii) to integrate various metadata types by having them referencing the imported schema elements. To import a schema from a set of CSV files, run the main class de.hpi.isg.mdms.tools.apps.CreateSchemaForCsvFilesApp.

Fill the metadata store. To extract metadata from databases is the task of data profiling tools. This issue is orthogonal to the goals of the MDMS, which aims at managing metadata but not their discovery. While in general the MDMS offers APIs to interact with data profiling algorithms, we also provide a tool to import metadata from the Metanome data profiling tool. To do so, run the main classes de.hpi.isg.mdms.tools.apps.MetanomeDependencyImportApp (for functional dependencies, inclusion dependencies, and unique column combinations) and de.hpi.isg.mdms.tools.apps.MetanomeStatisticsImportApp (for column statistics).

Analyze metadata. This phase is currently in development. Some preview functionality can be found in the main classes de.hpi.isg.mdms.java.apps.PrimaryKeyClassifier and de.hpi.isg.mdms.java.apps.ForeignKeyClassifier (for PK and FK classification) and de.hpi.isg.mdms.flink.apps.KmeansUccsApp and de.hpi.isg.mdms.flink.apps.AprioriUccsApp (for data mining on unique column combinations).

Use a client. The MDMS currently offers two interfaces. The first one is a CLI and offers all of the above described functionality. Just run the main class de.hpi.isg.mdms.cli.apps.MDMSCliApp. Moreover, this CLI can also be used via Apache Zeppelin. Check out metadata-ms-on-zeppelin.

Roadmap

  • domain model
  • SQLite persistence
  • Cassandra persistence (mostly done; some interoperating issues with Flink remain)
  • File-based persistence: high scalability, build on top of Avro, Parquet or the like
  • Data mining operators: K-Means, Apriori
    • port to new analytics API
  • Primary/foreign key classifiers
    • port to new analytics API
  • Query language/analytical layer
    • SQL and CQL are usable of course
    • [] spike Scala-based analytics API
  • Integration with Metanome (results can be imported, but algorithms cannot be triggered)
  • Provide a CLI as most basic MDMS operator
  • Frontend
    • test integration with Jupyter
    • build more visualizations

Project overview

  • Base modules

    • mdms-model: metamodel of relational schemata
    • mdms-dependencies: metamodel of most common dependencies (e.g., inclusion dependencies and functional dependencies)
    • mdms-util: general-purpose utilities used throughout the project
  • Persistence modules

    • mdms-simple: persistence using Java serialization
    • mdms-rdmbs: abstract persistence module for relational databases
    • mdms-sqlite: presistence with SQLite
    • mdms-cassandra: persistence with Cassandra
  • Application modules

    • mdms-clients: utilities to write MDMS-based applications
    • mdms-tools: basic MDMS applications, such as importing a schema from CSV files into a metadata store
    • mdms-java: Java-based utilities for MDMS applications
    • mdms-flink: Flink-based utilites for MDMS applications (complementary to mdms-java)
    • mdms-cli: CLI-based client to operate the metadata store

    License

    Unless explicitly stated otherwise all files in this repository are licensed under the Apache Software License 2.0

    Copyright 2016 Sebastian Kruse

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0
    

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

A library to store metadata of relational databases including the schema, statistics, and integrity constraints.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 92.9%
  • Scala 7.0%
  • ANTLR 0.1%