Investigate using HF Datasets class for splitting (and more!) #72

mtauraso · 2024-09-24T22:42:35Z

Huggingface seems to have a dataset definition library that I think we could leverage both for the immediate notion of splitting data, and for some longer-term purposes.
HF not only defines a scheme for doing dataset splits, but they also support things like progressive downloading of a dataset and making the dataset available to both TensorFlow based and PyTorch based code.
We can use HF's off-the-shelf notion of images-in-a-folder, or define our own scheme which allows us to read our existing metadata files:
https://huggingface.co/docs/datasets/en/image_dataset#imagefolder
https://huggingface.co/docs/datasets/en/image_dataset#legacy-loading-script
Splits in this library appear to be implemented by splitting metadata rather than creating separate folders. We already have metadata, and are keeping all images in one folder, so we have the bones of this system already.
Here's where they keep the code: https://github.com/huggingface/datasets/tree/v2.21.0-release
It's also my hope that by exploring this library we can also pave the way for folks using fibad to easily upload their datasets to HF, and share them with other researchers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using HF Datasets class for splitting (and more!) #72

Investigate using HF Datasets class for splitting (and more!) #72

mtauraso commented Sep 24, 2024

Investigate using HF Datasets class for splitting (and more!) #72

Investigate using HF Datasets class for splitting (and more!) #72

Comments

mtauraso commented Sep 24, 2024