Guide to Implementing a dataset

Pre-Requisites

Please make a github account prior to implementing a dataset; you can follow instructions to install git here.

You will also need at Python 3.6+. If you are installing python, we recommend downloading anaconda to curate a python environment with necessary packages.

All commands in the guide provided are done through terminal access.

1. Assign yourself a dataset

Please select an issue on the biomedical data science repository here. Ideally, also check to see if the dataset already exists in the 🤗 Hub.

If the dataset does not exist OR if an issue is open (there are no people assigned to the issue) please feel free to assign yourself to the issue. To do so, make a comment on the issue that you would like to take this dataset, and an organizer will add you to the repo.

2. Setup a local version of the datasets repo to update

Fork the dataset repository from huggingface to your local github account. To do this, click the link to the repository and click "fork" in the upper-right corner. You should get an option to fork to your account, provided you are signed into Github.

After you fork, clone the repository locally. You can do so as follows:

git clone git@github.com:<your_github_username>/datasets.git
cd datasets  # enter the directory

Next, you want to set your upstream location to enable you to push/pull (add or receive updates). You can do so as follows:

git remote add upstream https://github.com/huggingface/datasets.git

You can optionally check that this was set properly by running the following command:

git remote -v

The output of this command should look as follows:

origin  https://github.com/<your_github_username>/datasets (fetch)
origin  https://github.com/<your_github_username>/datasets (push)
upstream    https://github.com/huggingface/datasets.git (fetch)
upstream    https://github.com/huggingface/datasets.git (push)

If you do NOT have an origin for whatever reason, then run:

git remote add origin https://github.com/<your_github_username>/datasets.git

The goal of upstream is to keep your repository up-to-date to any changes that are made officially to the datasets library. You can do this as follows by running the following commands:

git fetch upstream
git pull

Provided you have no merge conflicts, this will ensure the library stays up-to-date as you make changes. However, before you make changes, you should make a custom branch to implement your changes.

You can make a new branch as such:

git checkout -b <name_of_my_dataset_branch>

Please do not make changes on the master branch!

Always make sure you're on the right branch with the following command:

git branch

The correct branch will have a asterisk * in front of it.

3. Create a development environment

You can make an environment in any way you choose to. We highlight two possible options:

3a) Create a conda environment

Install anaconda for your appropriate operating system.
Run the following command while in the datasets folder (you can pick your python version):

conda create --name <your_env_name_here> python=<your_python_version_here>  # Creates a conda env
conda activate <your_env_name_here> # Activate your conda environment
pip install -e ".[dev]"  # Install this while in the datasets folder

Make sure you are using the correct pip by checking which pip; this should point to the pip within your conda environment. You can deactivate your environment at any time by either exiting your terminal or using conda deactivate.

3b) Create a venv environment

Python 3.3+ has venv automatically installed; official information is found here.

python3 -m venv <your_env_name_here>
source <your_env_name_here>/bin/activate  # activate environment
pip install -e ".[dev]"  # Install this while in the datasets folder

Note! If you already have the datasets library installed in a different environment of choice, please uninstall it first (via pip uninstall datasets) and re-install in the editable mode

4. Implement your dataset

Make a new directory within the datasets folder as such:

mkdir datasets/<name_of_my_dataset>
cd datasets/<name_of_my_dataset>

To implement your dataset, follow instructions in the necessary files:

Option 1: To add a dataset that has a public license, use template.py
Option 2: To add a dataset where files are local, use template_local.py

Make sure your dataset is implemented correctly by checking in python the following commands:

import datasets
from datasets import load_dataset

data = load_dataset('your_dataset_name')

Run these commands in a folder that isn't directly in datasets, or else it will interfere with the import statement.

5. Format code

Return to the main directory (assuming you are in your dataset-specific folder) via the following commands:

cd ../../  # return to the datasets main location
make style
make quality

This runs the black formatter, isort, and lints to ensure that the code is readable and looks nice. Flake8 linting errors may require manual changes.

6. Commit your changes

First, commit your changes to the branch to "add" the work:

git add datasets/<name_of_my_dataset>/<name_of_my_dataset>.py
git commit

Ideally, run git commit -m "A message of your commits"

Then, run the following commands to incorporate any new changes in the master branch of datasets as follows:

git fetch upstream
git rebase upstream/master

Run these commands in your custom branch.

Push these changes to your fork with the following command:

git push -u origin <name_of_my_dataset_branch>

[Optional Step] Add unit-tests and meta data by following instructions here.

7. Make a pull request

Make a PR to implement your changes on the main repository here. To do so, click "New Pull Request". Then, choose your branch from your fork to push into "base:master".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Guide to Implementing a dataset

Pre-Requisites

1. Assign yourself a dataset

2. Setup a local version of the datasets repo to update

3. Create a development environment

3a) Create a conda environment

3b) Create a venv environment

4. Implement your dataset

5. Format code

6. Commit your changes

7. Make a pull request

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Guide to Implementing a dataset

Pre-Requisites

1. Assign yourself a dataset

2. Setup a local version of the datasets repo to update

3. Create a development environment

3a) Create a conda environment

3b) Create a venv environment

4. Implement your dataset

5. Format code

6. Commit your changes

7. Make a pull request