Simplify track object + update contributing docs (#436)

* add missing space in fhandle docs * abstract Track boilerplate, add Dataset._metadata * fix 'object' usage * set _track_metadata to None if it is not available, raise error if metadata is accessed and not available * update contributing docs Co-authored-by: Rachel Bittner <[email protected]>
mir-dataset-loaders · Jan 26, 2021 · d067be5 · d067be5
1 parent 2a26391
commit d067be5
Show file tree

Hide file tree

Showing 61 changed files with 1,681 additions and 1,926 deletions.
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -60,7 +60,7 @@ the ``please-do-not-edit`` flag is used.
 1. Create an index
 ------------------
 
-``mirdata``'s structure relies on ``JSON`` objects called `indexes`. Indexes contain information about the structure of the
+``mirdata``'s structure relies on `indexes`. Indexes are dictionaries contain information about the structure of the
 dataset which is necessary for the loading and validating functionalities of ``mirdata``. In particular, indexes contain
 information about the files included in the dataset, their location and checksums. The necessary steps are:
 
@@ -200,8 +200,8 @@ You may find these examples useful as references:
 * `A simple, fully downloadable dataset <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py>`_
 * `A dataset which is partially downloadable <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/beatles.py>`_
 * `A dataset with restricted access data <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/medleydb_melody.py#L33>`_
-* `A dataset which uses global metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py#L114>`_
-* `A dataset which does not use global metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/gtzan_genre.py#L36>`_
+* `A dataset which uses dataset-level metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/tinysol.py#L114>`_
+* `A dataset which does not use dataset-level metadata <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/gtzan_genre.py#L36>`_
 * `A dataset with a custom download function <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/maestro.py#L257>`_
 * `A dataset with a remote index <https://github.com/mir-dataset-loaders/mirdata/blob/master/mirdata/datasets/acousticbrainz_genre.py>`_
 
@@ -222,7 +222,7 @@ To finish your contribution, include tests that check the integrity of your load
     * For each audio/annotation file, reduce the audio length to 1-2 seconds and remove all but a few of the annotations.
     * If the dataset has a metadata file, reduce the length to a few lines.
 
-2. Test all of the dataset specific code, e.g. the public attributes of the Track object, the load functions and any other 
+2. Test all of the dataset specific code, e.g. the public attributes of the Track class, the load functions and any other 
    custom functions you wrote. See the `tests folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/tests>`_ for reference.
    If your loader has a custom download function, add tests similar to 
    `this loader <https://github.com/mir-dataset-loaders/mirdata/blob/master/tests/test_groove_midi.py#L96>`_.
@@ -231,7 +231,7 @@ To finish your contribution, include tests that check the integrity of your load
 
 
 .. note::  We have written automated tests for all loader's ``cite``, ``download``, ``validate``, ``load``, ``track_ids`` functions, 
-           as well as some basic edge cases of the ``Track`` object, so you don't need to write tests for these!
+           as well as some basic edge cases of the ``Track`` class, so you don't need to write tests for these!
 
 
 .. _test_file:
@@ -530,9 +530,36 @@ Track Attributes
 Custom track attributes should be global, track-level data.
 For some datasets, there is a separate, dataset-level metadata file
 with track-level metadata, e.g. as a csv. When a single file is needed
-for more than one track, we recommend using writing a ``_load_metadata`` method
-and passing it to a ``LargeData`` object, which is available globally throughout 
-the module to avoid loading the same file multiple times.
+for more than one track, we recommend using writing a ``_metadata`` cached property (which
+returns a dictionary, either keyed by track_id or freeform)
+in the Dataset class (see the dataset module example code above). When this is specified,
+it will populate a track's hidden ``_track_metadata`` field, which can be accessed from
+the Track class.
+
+For example, if ``_metadata`` returns a dictionary of the form:
+
+.. code-block:: python
+
+    {
+        'track1': {
+            'artist': 'A',
+            'genre': 'Z'
+        },
+        'track2': {
+            'artist': 'B',
+            'genre': 'Y'
+        }
+    }
+
+the ``_track metadata`` for ``track_id=track2`` will be:
+
+.. code-block:: python
+
+    {
+        'artist': 'B',
+        'genre': 'Y'
+    }
+
 
 Load methods vs Track properties
 --------------------------------
@@ -553,7 +580,7 @@ Custom Decorators
 
 cached_property
 ---------------
-This is used primarily for Track objects.
+This is used primarily for Track classes.
 
 This decorator causes an Object's function to behave like
 an attribute (aka, like the ``@property`` decorator), but caches
@@ -562,14 +589,14 @@ for data which is relatively large and loaded from files.
 
 docstring_inherit
 -----------------
-This decorator is used for children of the Dataset object, and
+This decorator is used for children of the Dataset class, and
 copies the Attributes from the parent class to the docstring of the child.
 This gives us clear and complete docs without a lot of copy-paste.
 
 copy_docs
 ---------
 This decorator is used mainly for a dataset's ``load_`` functions, which
-are attached to a loader's Dataset object. The attached function is identical,
+are attached to a loader's Dataset class. The attached function is identical,
 and this decorator simply copies the docstring from another function.
 
 coerce_to_bytes_io/coerce_to_string_io

diff --git a/docs/source/contributing_examples/example.py b/docs/source/contributing_examples/example.py
@@ -14,12 +14,12 @@
     6. Include a description about how the data can be accessed and the license it uses (if applicable)
 
 """
-
+import csv
 import logging
-import os
 import json
+import os
+
 import librosa
-import csv
 import numpy as np
 # -- import whatever you need here and remove
 # -- example imports you won't use
@@ -28,14 +28,15 @@
 from mirdata import jams_utils
 from mirdata import core, annotations
 
-
 # -- Add any relevant citations here
-BIBTEX = """@article{article-minimal,
-    author = "L[eslie] B. Lamport",
-    title = "The Gnats and Gnus Document Preparation System",
-    journal = "G-Animal's Journal",
-    year = "1986"
-}"""
+BIBTEX = """
+@article{article-minimal,
+  author = "L[eslie] B. Lamport",
+  title = "The Gnats and Gnus Document Preparation System",
+  journal = "G-Animal's Journal",
+  year = "1986"
+}
+"""
 
 # -- REMOTES is a dictionary containing all files that need to be downloaded.
 # -- The keys should be descriptive (e.g. 'annotations', 'audio').
@@ -63,26 +64,7 @@
 The dataset's license information goes here.
 """
 
-# -- change this to load any top-level metadata
-## delete this function if you don't have global metadata
-def _load_metadata(data_home):
-    metadata_path = os.path.join(data_home, 'example_metadta.csv')
-    if not os.path.exists(metadata_path):
-        logging.info('Metadata file {} not found.'.format(metadata_path))
-        return None
-
-    # load metadata however makes sense for your dataset
-    metadata_path = os.path.join(data_home, 'example_metadata.json')
-    with open(metadata_path, 'r') as fhandle:
-        metadata = json.load(fhandle)
-
-    metadata['data_home'] = data_home
-
-    return metadata
-
-
-DATA = core.LargeData('example_index.json', _load_metadata)
-# DATA = core.LargeData('example_index.json')  ## use this if your dataset has no metadata
+DATA = core.LargeData('example_index.json')
 
 
 class Track(core.Track):
@@ -99,29 +81,28 @@ class Track(core.Track):
         # -- Add any of the dataset specific attributes here
 
     """
-    def __init__(self, track_id, data_home):
-        if track_id not in DATA.index:
-            raise ValueError(
-                '{} is not a valid track ID in Example'.format(track_id))
-
-        self.track_id = track_id
-
-        self._data_home = data_home
-        self._track_paths = DATA.index[track_id]
-
+    def __init__(self, track_id, data_home, dataset_name, index, metadata):
+
+        # -- this sets the following attributes:
+        # -- * track_id
+        # -- * _dataset_name
+        # -- * _data_home
+        # -- * _track_paths
+        # -- * _track_metadata
+        super().__init__(
+            track_id,
+            data_home,
+            dataset_name=dataset_name,
+            index=index,
+            metadata=metadata,
+        )
+
         # -- add any dataset specific attributes here
         self.audio_path = os.path.join(
             self._data_home, self._track_paths['audio'][0])
         self.annotation_path = os.path.join(
             self._data_home, self._track_paths['annotation'][0])
 
-        # -- if the user doesn't have a metadata file, load None
-        self._metadata = DATA.metadata(data_home)
-        if self._metadata is not None and track_id in self._metadata:
-            self.some_metadata = self._metadata[track_id]['some_metadata']
-        else:
-            self.some_metadata = None
-
     # -- `annotation` will behave like an attribute, but it will only be loaded
     # -- and saved when someone accesses it. Useful when loading slightly
     # -- bigger files or for bigger datasets. By default, we make any time
@@ -194,7 +175,7 @@ def audio(self):
         """(np.ndarray, float): DESCRIPTION audio signal, sample rate"""
         return load_audio(self.audio_path)
 
-    # -- multitrack objects are themselves Tracks, and also need a to_jams method
+    # -- multitrack classes are themselves Tracks, and also need a to_jams method
     # -- for any mixture-level annotations
     def to_jams(self):
         """Jams: the track's data in jams format"""
@@ -206,43 +187,39 @@ def to_jams(self):
         # -- see the documentation for `jams_utils.jams_converter for all fields
 
 
-def load_audio(audio_path):
+@io.coerce_to_bytes_io
+def load_audio(fhandle):
     """Load a Example audio file.
 
     Args:
-        audio_path (str): path to audio file
+        fhandle (str or file-like): path or file-like object pointing to an audio file
 
     Returns:
-        * np.ndarray - the mono audio signal
+        * np.ndarray - the audio signal
         * float - The sample rate of the audio file
 
     """
     # -- for example, the code below. This should be dataset specific!
     # -- By default we load to mono
     # -- change this if it doesn't make sense for your dataset.
-    if not os.path.exists(audio_path):
-        raise IOError("audio_path {} does not exist".format(audio_path))
     return librosa.load(audio_path, sr=None, mono=True)
 
 
 # -- Write any necessary loader functions for loading the dataset's data
-def load_annotation(annotation_path):
+@io.coerce_to_string_io
+def load_annotation(fhandle):
 
     # -- if there are some file paths for this annotation type in this dataset's
     # -- index that are None/null, uncomment the lines below.
     # if annotation_path is None:
     #     return None
 
-    if not os.path.exists(annotation_path):
-        raise IOError("annotation_path {} does not exist".format(annotation_path))
-
-    with open(annotation_path, 'r') as fhandle:
-        reader = csv.reader(fhandle, delimiter=' ')
-        intervals = []
-        annotation = []
-        for line in reader:
-            intervals.append([float(line[0]), float(line[1])])
-            annotation.append(line[2])
+    reader = csv.reader(fhandle, delimiter=' ')
+    intervals = []
+    annotation = []
+    for line in reader:
+        intervals.append([float(line[0]), float(line[1])])
+        annotation.append(line[2])
 
     annotation_data = annotations.EventData(
         np.array(intervals), np.array(annotation)
@@ -259,14 +236,15 @@ def __init__(self, data_home=None):
         super().__init__(
             data_home,
             index=DATA.index,
-            name="Example",
-            track_object=Track,
+            name=NAME,
+            track_class=Track,
             bibtex=BIBTEX,
             remotes=REMOTES,
             download_info=DOWNLOAD_INFO,
+            license_info=LICENSE_INFO,
         )
 
-    # -- Copy any loaders you wrote that should be part of the Dataset object
+    # -- Copy any loaders you wrote that should be part of the Dataset class
     # -- use this core.copy_docs decorator to copy the docs from the original
     # -- load_ function
     @core.copy_docs(load_audio)
@@ -277,28 +255,41 @@ def load_audio(self, *args, **kwargs):
     def load_annotation(self, *args, **kwargs):
         return load_annotation(*args, **kwargs)
 
-# -- if your dataset needs to overwrite the default download logic, do it here.
-# -- this function is usually not necessary unless you need very custom download logic
-def download(
-    self, partial_download=None, force_overwrite=False, cleanup=False
-):
-    """Download the dataset
-
-    Args:
-        partial_download (list or None):
-            A list of keys of remotes to partially download.
-            If None, all data is downloaded
-        force_overwrite (bool):
-            If True, existing files are overwritten by the downloaded files. 
-        cleanup (bool):
-            Whether to delete any zip/tar files after extracting.
-
-    Raises:
-        ValueError: if invalid keys are passed to partial_download
-        IOError: if a downloaded file's checksum is different from expected
-
-    """
-    # see download_utils.downloader for basic usage - if you only need to call downloader
-    # once, you do not need this function at all.
-    # only write a custom function if you need it!
+    # -- if your dataset has a top-level metadata file, write a loader for it here
+    # -- you do not have to include this function if there is no metadata 
+    @core.cached_property
+    def _metadata(self):
+        metadata_path = os.path.join(self.data_home, 'example_metadta.csv')
+
+        # load metadata however makes sense for your dataset
+        metadata_path = os.path.join(data_home, 'example_metadata.json')
+        with open(metadata_path, 'r') as fhandle:
+            metadata = json.load(fhandle)
+
+        return metadata
+
+    # -- if your dataset needs to overwrite the default download logic, do it here.
+    # -- this function is usually not necessary unless you need very custom download logic
+    def download(
+        self, partial_download=None, force_overwrite=False, cleanup=False
+    ):
+        """Download the dataset
+
+        Args:
+            partial_download (list or None):
+                A list of keys of remotes to partially download.
+                If None, all data is downloaded
+            force_overwrite (bool):
+                If True, existing files are overwritten by the downloaded files. 
+            cleanup (bool):
+                Whether to delete any zip/tar files after extracting.
+
+        Raises:
+            ValueError: if invalid keys are passed to partial_download
+            IOError: if a downloaded file's checksum is different from expected
+
+        """
+        # see download_utils.downloader for basic usage - if you only need to call downloader
+        # once, you do not need this function at all.
+        # only write a custom function if you need it!