Datadocs is static documentation for your datasets. I developed Datadocs to organize the vast number of datasets and fields we maintain at the Family Independence Initiative (FII). Datadocs is fully searchable and great for internal use as well as sharing with external partners (we use it for both purposes at FII).
Some key features of datadocs:
- Fully searchable static documentation that can be hosted anywhere, including on Dropbox or S3.
- Logically categorize fields, making it easier to quickly understand what fields a dataset contains.
- Designate which fields are private or protected, especially useful when sharing with external partners.
- Designate which fields are raw versus which have undergone some transformation. It is common to create new variables based on the raw data one has, keeping track of this makes for better, cleaner analysis.
You can view an example of datadocs with some dummy data on my personal site at fullcontactphilanthropy/datadocs. Datadocs is written for and tested in Python 3.4.
Your documentation goes in the /docs
folder. You should delete the contents of the included example data and do the following:
- Add your data as csv files - Datadocs requires each dataset to be a comma delimited
csv
file. Drop each of yourcsv
files in the/docs
folder and remove the "example.csv" file. - Write yaml files for each csv - Each
csv
file needs ayaml
file of the same name. For example, if your dataset ismy_dataset.csv
, then Datadocs requires you to include a file namedmy_dataset.yaml
. I recommend copying theexample.yaml
file and using that as a base template. How to write youryaml
files is discussed in more detail in the following section on "Documenting and registering your data". - [Optional] Include markdown files - If you want, you can include markdown files with additional detail about your datasets. Like the
yaml
files, include a similarly namedmd
file to provide additional documentation for a dataset. For example, add amy_dataset.md
to render markdown content before the data documentation for yourmy_dataset.csv
file. - [Optional] Include an index.md file - If you want to provide additional documentation on the index page of your documentation, you can include an
index.md
file. Theindex.md
file is useful for giving a high level view of your data before a reader dives into the datasets. - Register your datasets - Before making your documentation, you need to register each dataset in the
datadocs.yaml
file. For example, to add themy_dataset.csv
and itsyaml
andmd
files, simply addmy_dataset
to the list of datasets in thedatadocs.yaml
file. - Make your documentation - From the root directory execute
makedocs.py
with Python 3. Your static documentation will be built in the/site
folder.
This section provides more detail on how to document and register your data in datadocs.
Each dataset yaml
files has the following structure:
title: "Some title"
description: "Some description"
categories:
- title: "Some category title"
description: "Some category description"
fields:
- name: "Some field name as found in the .csv file"
description: "Some description for this field"
Each field can also have have optional attributes as defined below.
The type
attribute indicates a field's datatype. This field is optional as Datadocs attempts to guess a field's datatype based on the data provided in your csv
file. If you want to make sure the documentation produces the correct datatype, or you want to override Datadocs's guess, you can do so with type
.
Usage:
fields:
- name: "Some field name as found in the .csv file"
description: "Some description for this field"
type: "Date"
While you can provide any string for type
, Datadocs expects one of the following:
- Boolean
- Categorical
- Date
- JSON
- Numeric
- Text
- Yaml
The private
attribute indicates if a field is private or not. It is sometimes useful to document a field in a dataset and to share that you have a particular field, but that the field is somehow protected or private. A good example might be a Social Security number.
Usage:
fields:
- name: "Some field name as found in the .csv file"
description: "Some description for this field"
private: false
The transformed
attribute indicates if a field underwent some form of transformation. For example, if we have household size and the number of people in a household as raw data and we calculate the household's federal poverty line, the new variable would be considered transformed.
Usage:
fields:
- name: "Some field name as found in the .csv file"
description: "Some description for this field"
transformed: false
Your datasets are registered in /docs/datadocs.yaml
. If you had the following tow datasets:
- my_dataset.csv
- some_other_dataset.csv
Your datadocs.yaml
file might look like:
title: "My data documentation"
show_uncategorized: false
show_percent_answered: false
show_private: true
datasets:
- name: "my_dataset"
- name: "some_other_dataset"
Note the file above also includes some metadata and settings, defined below:
title - Title for your documentation.
show_uncategorized - Whether to show fields you have not provided documentation for. Setting this attribute to true
can be useful for determining which fields have not been documented yet.
show_percent_answered - Whether to show the percent of fields that are not null. For example, if you have ten observations for a field with three nulls, setting show_percent_answered
to true
would indicate that 70% of observations are not null in your documentation for that field.
show_private - Whether to include fields set to private when building the documentation. Toggling this attribute can be useful if you are sharing external documentation and you want to have certain fields documented, but you don't want others to know the field exists at all.
Build your documentation by navigating to the root datadocs
folder and typing the following at your command line:
$ python3 makedocs.py
A complete list of dependencies and version numbers are listed in the requirements.txt
file in the root directory. Key dependencies are:
- python 3
- pandas
- Markdown
- Jinja2
- PyYaml