Skip to content

AI pipeline Metadata Knowledge Graph is a metadata repository of 1.6 million AI pipelines and its components such as task, dataset, model, metrics and code repositories. This repository has a walk-through to construct the metadata knowledge graph.

License

Notifications You must be signed in to change notification settings

HewlettPackard/ai-metadata-knowledge-graph

Repository files navigation

AI Pipeline Metadata Knowledge Graph

The emergence of advanced Artificial Intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms like Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this paper, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.


Figure 1:Dashboard of AI pipeline Recommender that uses Dynamic AI Pipeline Constructor to recommend relevant pipelines


AIMKG Set up Guide

The construction details of AIMKG and recommendation can be found below in the next sections. To set up AIMKG, please follow the steps below.

  • Download the most recent version of docker as per your OS from here - https://docs.docker.com/desktop/release-notes/ and install the docker on your system
  • Navigate to ai-pipeline-knowledge-graph folder and modify the parameters in .env file. Create a .env file using env-example
    • Change the $USER and $UID in .env file. To find out the values echo $USER and echo $UID from your command terminal
  • Create a virtual environment python3 -m venv <myenv> and activate it using source <myenv>/bin/activate
  • Run pip install -r requirements.txt
  • Create a folder named graph_data where the graph database will be created
  • Download the neo4j plugin apoc-5.16.0-extended.jar and put it into a folder named plugins.
  • Download the sample dataset (dataset-small.zip) for AIMKG from here. Unzip it and put it into a folder named raw_files. The folder should look like raw_data/nodes and raw_data/relationships
  • Mention these paths in the docker-compose.yml file. Mention full path.
    • <path to raw_folder>:/import
    • <path to graph_data>:/data
    • <path to plugins>:/var/lib/neo4j/plugins
  • Command to stand up the docker container docker compose up --build
  • Access the notebook and neo4j from ports https://localhost:8888 and https://localhost:7474 respectively.
  • Run the notebook named small_dataset.ipynb to ingest data into the graph database
  • The graph can be explored with the sample queries given below

Datasets

  • If you are downloading dataset-large.zip, execute notebooks pwc_kg.ipynb, openml_kg.ipynb and hf_kg.ipynb IN THAT ORDER. The folder should look like raw_data/pwc, raw_data/open-ml and raw_data/huggignface
  • If you are directly downloading neo4j_dump.zip, unzip it and put this data into graph_data folder and leave raw_data folder empty. With data already in neo4j format, you can directl execute the sample queries below from neo4j browser. There is no need to execute any notebooks

Sample Queries

Following are some sample queries that can be run to test and visualize the data

  • Models used for text classification:
MATCH path= ((:Task {category:'classification', modality:'text,multimodal'})<-[e:executes]-(p:Pipeline)-[r:runs]->(m)) RETURN p, r, m
  • Pipelines with classification as task:
MATCH path = ((:Pipeline)-[]->(:Task {category:'classification'})) RETURN path
  • Datasets used text recognition
MATCH path = ((:Task {category:'recognition', modality:'text'})-[]-(:Pipeline)-[]-(:Stage)-[]-(:Execution)-[]-(:Artifact)-[](:Dataset)) RETURN path
  • Metrics on MNIST dataset
MATCH path = ((:Dataset {datasetID:'mnist'})-[]-(:Artifact)-[]-(:Metrics)) RETURN path
  • Datasets and Models used by pipelines that executes some form of 'image detection' task
MATCH (a:Artifact)-[r3]-(e:Execution)-[r4]-(s:Stage)-[r5]-(p:Pipeline)-[r6]-(t:Task{category:'detection', modality:'image,multimodal'})
WITH a,e,s,p,t,r3,r4,r5,r6
MATCH (d:Dataset)-[r1]-(a)-[r2]-(m:Model)
RETURN d, a, m, e, s, p, t, r1, r2, r3,r4, r5, r6 limit 100

  • Pipelines which are from papers-with-code and enriched with models from huggingface.
MATCH (t:Task)-[r1]-(p:Pipeline {source:'papers-with-code'})-[r2]-(s:Stage)-[r3]-(e:Execution)-[r4]-(a:Artifact)
WITH t,p,s,e,a,r1,r2,r3,r4
MATCH (d:Dataset)-[r5]-(a)-[r6]-(m:Model {source:'huggingface'})
return t,p,s,e,a,r1,r2,r3,r4,d,m,r5,r6
  • Dataset, model and pipelines that uses the modelclass 'gpt2'
MATCH (d:Dataset)-[r1]-(a:Artifact)-[r2]-(m:Model {modelClass:'gpt2'})
WITH d,a,m,r1,r2
MATCH (a)-[r3]-(e:Execution)-[r4]-(s:Stage)-[r5]-(p:Pipeline)-[r6]-(t:Task)
RETURN d, a, m, e, s, p, t, r1, r2, r3,r4, r5, r6 limit 100

AI Pipeline Recommendation Set Up Guide

Once the neo4j is up and running with AIMKG, the following steps will open-up a UI to query the graph using natural language or find pipelines based on similar datasets, similar models or similar tasks

  • Navigate to aimkg-recommender-UI folder
  • Navigate to utils folder and run compute_embeddings.py. This is a one-time step done once.
  • Navigate back to the aimkg-recommender-UI folder and run python app.py and the UI will stand up at the address mentioned in your terminal. The sample of the UI is shown in Figure 1 and the demo can be found here

Full Paper

The full paper along with supplementary materials can be found hereConstructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization

Video

The demo video of AI pipeline Recommendation can be found here - Demo of AI pipeline Recommender


AIMKG Construction

The construction of AIMKG is described in the figure below. The construction involves following steps:

  • Data Collection
  • Exploratory Data Analysis
  • Common Metadata Ontology
  • Mapping Concepts to Common Metadata Ontology
  • Semantic Enrichments
  • Data Ingestion to Graph Database(Neo4j)

Construction of AI pipeline Metadata Knowledge Graph

kg-const

1. Data Collection

  • Papers-with-code:

    Papers-with-Code provides extensive metadata for research papers and associated code repositories, encompassing over 1 million entries. The metadata covers various components and stages of AI pipelines described in the papers. Through their API, Papers-with-Code offers metadata including PDF URLs, GitHub repository links, task details, dataset information, methods employed, and evaluation metrics/results. While not all stages of metadata are available for every paper through the API, the information can still be obtained by referring to the research papers and their code repositories.

  • OpenML:

    OpenML provides metadata on machine learning pipelines logged by users, offering detailed information on tasks, datasets, flows, runs with parameter settings, and evaluations. OpenML encompasses eight major task types executed on various datasets, resulting in 1,600 unique tasks. For each task, most recent 500 runs have been collected which amounts to a total of 330,000 runs.

  • Huggingface:

    Huggingface is a model hub that offers users access to numerous pretrained models. It covers a wide range of tasks, including computer vision, natural language processing, tabular data, reinforcement learning, and multimodal learning. Huggingface provides model-centric information, along with datasets and evaluations, enabling the construction of complete pipelines. Currently, approximately 50,000 pipelines have been collected from Huggingface.

2. Exploratory Data Analysis

The exploratory data analysis of collected data showed different data structures and varying nomeclatures to denote similar concepts. For example, the concept model is referred is methods in Papers-with-code, flow in OpenML and models in Huggingface.


Graph Data Model: Papers-with-code

Image 1

Graph Data Model: OpenML

Image 2

Graph Data Model: Common Metadata Framework

Image 3

Graph Data Model: Huggingface

Image 4

3. Common Metadata Ontology

The data collected from above mentioned sources consists of different nomenclature and data structures. In order to unify them, Common Metadata Ontology (CMO) was designed based on the principles of Common Metadata Framework (CMF) which follows a pipeline-centric framework. MLFlow, which follows a model-centric approach will require separate instantiation of each model even if they are being executed for the same pipeline, say, Entity Extraction from Semi-Structed documents. CMF encompasses all the models and datasets of a pipeline under single instantiation enabling search of best execution path. The overview of CMO can be found below and the details can be found at common-metadata-ontology folder.


Overview of Common Metadata Ontology

CMO

4. Mapping

The concepts from Papers-with-code, OpenML and Huggingface are mapped to CMO to construct AIMKG. The details of mapping of each sources to Common Metadata Ontology can be found in mapping folder.

5.Semantic Enrichments

In order to enable contextually relevant queries, semantic enrichments are performed on the data entities. For example, in the figure below, the user searched for "Image Detection" task and its pipeline. It can be noticed that both "2D Object Detection" and "3D object Detection" are returned as results which do not explicitly have the name "image" in them. Such semantic enhancements are done for tasks, datasets and models. The methods and techniques are detailed [here](semantic-enrichments/semantics_readme.md)

CMO

6. Data Ingestion

The data gathered and semantically enriched are then loaded to Neo4j Graph DB to perform serach and recommendation. The steps to set-up the graph DB are mentioned in the section Set Up Guide

Publications

  • Venkataramanan, Revathy, Aalap Tripathy, Tarun Kumar, Sergey Serebryakov, Annmary Justine, Arpit Shah, Suparna Bhattacharya et al. "Constructing a Metadata Knowledge Graph as an atlas for demystifying AI Pipeline optimization." Frontiers in Big Data 7: 1476506. Link to the paper

  • Venkataramanan, Revathy, Aalap Tripathy, Martin Foltin, Hong Yung Yip, Annmary Justine, and Amit Sheth. "Knowledge graph empowered machine learning pipelines for improved efficiency, reusability, and explainability." IEEE Internet Computing 27, no. 1 (2023): 81-88. Link to the paper

About

AI pipeline Metadata Knowledge Graph is a metadata repository of 1.6 million AI pipelines and its components such as task, dataset, model, metrics and code repositories. This repository has a walk-through to construct the metadata knowledge graph.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published