Skip to content

Commit

Permalink
Simplify track object + update contributing docs (#436)
Browse files Browse the repository at this point in the history
* add missing space in fhandle docs

* abstract Track boilerplate, add Dataset._metadata

* fix 'object' usage

* set _track_metadata to None if it is not available, raise error if metadata is accessed and not available

* update contributing docs

Co-authored-by: Rachel Bittner <[email protected]>
  • Loading branch information
rabitt and Rachel Bittner authored Jan 26, 2021
1 parent 2a26391 commit d067be5
Show file tree
Hide file tree
Showing 61 changed files with 1,681 additions and 1,926 deletions.
49 changes: 38 additions & 11 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ the ``please-do-not-edit`` flag is used.
1. Create an index
------------------

``mirdata``'s structure relies on ``JSON`` objects called `indexes`. Indexes contain information about the structure of the
``mirdata``'s structure relies on `indexes`. Indexes are dictionaries contain information about the structure of the
dataset which is necessary for the loading and validating functionalities of ``mirdata``. In particular, indexes contain
information about the files included in the dataset, their location and checksums. The necessary steps are:

Expand Down Expand Up @@ -200,8 +200,8 @@ You may find these examples useful as references:
* `A simple, fully downloadable dataset <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py>`_
* `A dataset which is partially downloadable <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/beatles.py>`_
* `A dataset with restricted access data <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/medleydb_melody.py#L33>`_
* `A dataset which uses global metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py#L114>`_
* `A dataset which does not use global metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/gtzan_genre.py#L36>`_
* `A dataset which uses dataset-level metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py#L114>`_
* `A dataset which does not use dataset-level metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/gtzan_genre.py#L36>`_
* `A dataset with a custom download function <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/maestro.py#L257>`_
* `A dataset with a remote index <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/acousticbrainz_genre.py>`_

Expand All @@ -222,7 +222,7 @@ To finish your contribution, include tests that check the integrity of your load
* For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.
* If the dataset has a metadata file, reduce the length to a few lines.

2. Test all of the dataset specific code, e.g. the public attributes of the Track object, the load functions and any other
2. Test all of the dataset specific code, e.g. the public attributes of the Track class, the load functions and any other
custom functions you wrote. See the `tests folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/tests>`_ for reference.
If your loader has a custom download function, add tests similar to
`this loader <https://github.com/mir-dataset-loaders/mirdata/blob/master/tests/test_groove_midi.py#L96>`_.
Expand All @@ -231,7 +231,7 @@ To finish your contribution, include tests that check the integrity of your load


.. note:: We have written automated tests for all loader's ``cite``, ``download``, ``validate``, ``load``, ``track_ids`` functions,
as well as some basic edge cases of the ``Track`` object, so you don't need to write tests for these!
as well as some basic edge cases of the ``Track`` class, so you don't need to write tests for these!


.. _test_file:
Expand Down Expand Up @@ -530,9 +530,36 @@ Track Attributes
Custom track attributes should be global, track-level data.
For some datasets, there is a separate, dataset-level metadata file
with track-level metadata, e.g. as a csv. When a single file is needed
for more than one track, we recommend using writing a ``_load_metadata`` method
and passing it to a ``LargeData`` object, which is available globally throughout
the module to avoid loading the same file multiple times.
for more than one track, we recommend using writing a ``_metadata`` cached property (which
returns a dictionary, either keyed by track_id or freeform)
in the Dataset class (see the dataset module example code above). When this is specified,
it will populate a track's hidden ``_track_metadata`` field, which can be accessed from
the Track class.

For example, if ``_metadata`` returns a dictionary of the form:

.. code-block:: python
{
'track1': {
'artist': 'A',
'genre': 'Z'
},
'track2': {
'artist': 'B',
'genre': 'Y'
}
}
the ``_track metadata`` for ``track_id=track2`` will be:

.. code-block:: python
{
'artist': 'B',
'genre': 'Y'
}
Load methods vs Track properties
--------------------------------
Expand All @@ -553,7 +580,7 @@ Custom Decorators

cached_property
---------------
This is used primarily for Track objects.
This is used primarily for Track classes.

This decorator causes an Object's function to behave like
an attribute (aka, like the ``@property`` decorator), but caches
Expand All @@ -562,14 +589,14 @@ for data which is relatively large and loaded from files.

docstring_inherit
-----------------
This decorator is used for children of the Dataset object, and
This decorator is used for children of the Dataset class, and
copies the Attributes from the parent class to the docstring of the child.
This gives us clear and complete docs without a lot of copy-paste.

copy_docs
---------
This decorator is used mainly for a dataset's ``load_`` functions, which
are attached to a loader's Dataset object. The attached function is identical,
are attached to a loader's Dataset class. The attached function is identical,
and this decorator simply copies the docstring from another function.

coerce_to_bytes_io/coerce_to_string_io
Expand Down
173 changes: 82 additions & 91 deletions docs/source/contributing_examples/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@
6. Include a description about how the data can be accessed and the license it uses (if applicable)
"""

import csv
import logging
import os
import json
import os

import librosa
import csv
import numpy as np
# -- import whatever you need here and remove
# -- example imports you won't use
Expand All @@ -28,14 +28,15 @@
from mirdata import jams_utils
from mirdata import core, annotations


# -- Add any relevant citations here
BIBTEX = """@article{article-minimal,
author = "L[eslie] B. Lamport",
title = "The Gnats and Gnus Document Preparation System",
journal = "G-Animal's Journal",
year = "1986"
}"""
BIBTEX = """
@article{article-minimal,
author = "L[eslie] B. Lamport",
title = "The Gnats and Gnus Document Preparation System",
journal = "G-Animal's Journal",
year = "1986"
}
"""

# -- REMOTES is a dictionary containing all files that need to be downloaded.
# -- The keys should be descriptive (e.g. 'annotations', 'audio').
Expand Down Expand Up @@ -63,26 +64,7 @@
The dataset's license information goes here.
"""

# -- change this to load any top-level metadata
## delete this function if you don't have global metadata
def _load_metadata(data_home):
metadata_path = os.path.join(data_home, 'example_metadta.csv')
if not os.path.exists(metadata_path):
logging.info('Metadata file {} not found.'.format(metadata_path))
return None

# load metadata however makes sense for your dataset
metadata_path = os.path.join(data_home, 'example_metadata.json')
with open(metadata_path, 'r') as fhandle:
metadata = json.load(fhandle)

metadata['data_home'] = data_home

return metadata


DATA = core.LargeData('example_index.json', _load_metadata)
# DATA = core.LargeData('example_index.json') ## use this if your dataset has no metadata
DATA = core.LargeData('example_index.json')


class Track(core.Track):
Expand All @@ -99,29 +81,28 @@ class Track(core.Track):
# -- Add any of the dataset specific attributes here
"""
def __init__(self, track_id, data_home):
if track_id not in DATA.index:
raise ValueError(
'{} is not a valid track ID in Example'.format(track_id))

self.track_id = track_id

self._data_home = data_home
self._track_paths = DATA.index[track_id]

def __init__(self, track_id, data_home, dataset_name, index, metadata):

# -- this sets the following attributes:
# -- * track_id
# -- * _dataset_name
# -- * _data_home
# -- * _track_paths
# -- * _track_metadata
super().__init__(
track_id,
data_home,
dataset_name=dataset_name,
index=index,
metadata=metadata,
)

# -- add any dataset specific attributes here
self.audio_path = os.path.join(
self._data_home, self._track_paths['audio'][0])
self.annotation_path = os.path.join(
self._data_home, self._track_paths['annotation'][0])

# -- if the user doesn't have a metadata file, load None
self._metadata = DATA.metadata(data_home)
if self._metadata is not None and track_id in self._metadata:
self.some_metadata = self._metadata[track_id]['some_metadata']
else:
self.some_metadata = None

# -- `annotation` will behave like an attribute, but it will only be loaded
# -- and saved when someone accesses it. Useful when loading slightly
# -- bigger files or for bigger datasets. By default, we make any time
Expand Down Expand Up @@ -194,7 +175,7 @@ def audio(self):
"""(np.ndarray, float): DESCRIPTION audio signal, sample rate"""
return load_audio(self.audio_path)

# -- multitrack objects are themselves Tracks, and also need a to_jams method
# -- multitrack classes are themselves Tracks, and also need a to_jams method
# -- for any mixture-level annotations
def to_jams(self):
"""Jams: the track's data in jams format"""
Expand All @@ -206,43 +187,39 @@ def to_jams(self):
# -- see the documentation for `jams_utils.jams_converter for all fields


def load_audio(audio_path):
@io.coerce_to_bytes_io
def load_audio(fhandle):
"""Load a Example audio file.
Args:
audio_path (str): path to audio file
fhandle (str or file-like): path or file-like object pointing to an audio file
Returns:
* np.ndarray - the mono audio signal
* np.ndarray - the audio signal
* float - The sample rate of the audio file
"""
# -- for example, the code below. This should be dataset specific!
# -- By default we load to mono
# -- change this if it doesn't make sense for your dataset.
if not os.path.exists(audio_path):
raise IOError("audio_path {} does not exist".format(audio_path))
return librosa.load(audio_path, sr=None, mono=True)


# -- Write any necessary loader functions for loading the dataset's data
def load_annotation(annotation_path):
@io.coerce_to_string_io
def load_annotation(fhandle):

# -- if there are some file paths for this annotation type in this dataset's
# -- index that are None/null, uncomment the lines below.
# if annotation_path is None:
# return None

if not os.path.exists(annotation_path):
raise IOError("annotation_path {} does not exist".format(annotation_path))

with open(annotation_path, 'r') as fhandle:
reader = csv.reader(fhandle, delimiter=' ')
intervals = []
annotation = []
for line in reader:
intervals.append([float(line[0]), float(line[1])])
annotation.append(line[2])
reader = csv.reader(fhandle, delimiter=' ')
intervals = []
annotation = []
for line in reader:
intervals.append([float(line[0]), float(line[1])])
annotation.append(line[2])

annotation_data = annotations.EventData(
np.array(intervals), np.array(annotation)
Expand All @@ -259,14 +236,15 @@ def __init__(self, data_home=None):
super().__init__(
data_home,
index=DATA.index,
name="Example",
track_object=Track,
name=NAME,
track_class=Track,
bibtex=BIBTEX,
remotes=REMOTES,
download_info=DOWNLOAD_INFO,
license_info=LICENSE_INFO,
)

# -- Copy any loaders you wrote that should be part of the Dataset object
# -- Copy any loaders you wrote that should be part of the Dataset class
# -- use this core.copy_docs decorator to copy the docs from the original
# -- load_ function
@core.copy_docs(load_audio)
Expand All @@ -277,28 +255,41 @@ def load_audio(self, *args, **kwargs):
def load_annotation(self, *args, **kwargs):
return load_annotation(*args, **kwargs)

# -- if your dataset needs to overwrite the default download logic, do it here.
# -- this function is usually not necessary unless you need very custom download logic
def download(
self, partial_download=None, force_overwrite=False, cleanup=False
):
"""Download the dataset
Args:
partial_download (list or None):
A list of keys of remotes to partially download.
If None, all data is downloaded
force_overwrite (bool):
If True, existing files are overwritten by the downloaded files.
cleanup (bool):
Whether to delete any zip/tar files after extracting.
Raises:
ValueError: if invalid keys are passed to partial_download
IOError: if a downloaded file's checksum is different from expected
"""
# see download_utils.downloader for basic usage - if you only need to call downloader
# once, you do not need this function at all.
# only write a custom function if you need it!
# -- if your dataset has a top-level metadata file, write a loader for it here
# -- you do not have to include this function if there is no metadata
@core.cached_property
def _metadata(self):
metadata_path = os.path.join(self.data_home, 'example_metadta.csv')

# load metadata however makes sense for your dataset
metadata_path = os.path.join(data_home, 'example_metadata.json')
with open(metadata_path, 'r') as fhandle:
metadata = json.load(fhandle)

return metadata

# -- if your dataset needs to overwrite the default download logic, do it here.
# -- this function is usually not necessary unless you need very custom download logic
def download(
self, partial_download=None, force_overwrite=False, cleanup=False
):
"""Download the dataset
Args:
partial_download (list or None):
A list of keys of remotes to partially download.
If None, all data is downloaded
force_overwrite (bool):
If True, existing files are overwritten by the downloaded files.
cleanup (bool):
Whether to delete any zip/tar files after extracting.
Raises:
ValueError: if invalid keys are passed to partial_download
IOError: if a downloaded file's checksum is different from expected
"""
# see download_utils.downloader for basic usage - if you only need to call downloader
# once, you do not need this function at all.
# only write a custom function if you need it!

Loading

0 comments on commit d067be5

Please sign in to comment.