This repository contains the metadata and data of different databases that we require for various purposes, specifically testing.
To import the data into your database, you can use the setup.sh
script. The script will drop and recreate the database if already present, so be careful when using it. This has the desired effect of idempotency, ie running it multiple times will result in the same state. Do remember to set the following environment variables to suit your postgres connection before running the script:
export DBPASSWORD="postgres"
export DBUSER="postgres"
export DBHOST="localhost"
export DBPORT=5432
./setup.sh
To set up the data in snowflake, you would need to have the snowflake cli installed (instructions), and have your credentials configured as per the docs. You can then run the following command to setup the data:
./setup_snowflake.sh
This will create 1 database per database in the repo as before, using public
as the default schema.
Note that the same sql files work for both the postgres and snowflake databases, so you can use the same sql files to setup both databases.
To set up the data in these systems, you would need your credentials to be configured in utils_dialects
. You can then run the following command to set up the databases:
python translate_ddl_dialect.py
This will create one new SQL file per database per dialect.
For SQLite, the .db
files will be saved in the folder sqlite_dbs
.
Note that BigQuery, MySQL and SQLite do not support schemas and hence the SQL files will be modified to skip schema creation.
If you only want to translate defog_data
for specific dialects, you can add the --dialects
argument to the command above. For example, the following command will translate the data for SQLite only
python translate_ddl_dialect.py --dialects sqlite
This is the recommended way to access the schema from the json files in a python environment. To use the python library in your code, navigate to this repository and install it using pip:
pip install -r requirements.txt # get dependencies
pip install -e .
The -e
allows us to edit the code in place, ie if we make changes to the code, we don't have to reinstall the package.
Once you have it installed, you can access the json metadata of each database as a dictionary using the following code:
import defog_data.metadata as md
md.academic
# {'table_metadata': {'cite': [{'data_type': 'bigint',
# 'column_name': 'cited',
# 'column_description': ['ID of the publication being cited']},
# ...
We also have joinable columns, split by database in supplementary.py. To access them, use the following code:
import defog_data.supplementary as sup
# columns that can be joined on
sup.columns_join
Each database (eg academic
) is organized in a folder with the following structure:
academic
├── academic.json
└── academic.sql
The json contains the metadata of the database along with the column and table descriptions, while the sql file contains the dump of the database (metadata + data). This is to facilitate easier importing of the data into a database, without worrying about the sequence of inserts, especially foreign key constraints which require the primary key from the parent table to be present before inserting into the child table.
The test in tests.py
just ensures that we are able to access the respective metadata for each table in each database. To run the tests, use the following command:
python -m unittest tests.py
To build for release, first bump the version in setup.py and then run the following commands:
python setup.py sdist bdist_wheel
twine upload dist/defog*