-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Croissant refers to incomplete parquet branch in native parquet datasets #3101
Comments
@lhoestq is there anything we can contribute to fix this or it needs to be done in the HuggingFace server? |
Hi ! Yes if a dataset is already in parquet the croissant file doesn't need to point to the parquet branch (that may contain incomplete data). You can maybe check https://github.com/huggingface/dataset-viewer/blob/main/services/worker/src/worker/job_runners/dataset/croissant_crumbs.py and see if you can adapt the code for this case |
My impression is that the required change should be deeper than croissant_crumbs.py. In this file it assumes it already has a dataset info (containing config, splits), and IIUC the dataset_info is retrieved from the parque branch (or in general they is a clear mapping dataset_info -> parquet branch folder hierarchy). The first step would be to specify how the configs and splits are distributed across the folders of the main branch, which unlike the parquet branch it doesn't follow a predefined structure. For example https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet/tree/refs%2Fconvert%2Fparquet :
|
you can use the >>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("mlfoundations/dclm-baseline-1.0-parquet")
>>> builder.config.data_files
{
NamedSplit('train'): [
'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.parquet',
'hf://datasets/mlfoundations/dclm-baseline-1.0-parquet@817d6752765f6a41261085171dd546b104f60626/filtered/OH_eli5_vs_rw_v2_bigram_200k_train/fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train/processed_data/global-shard_01_of_10/local-shard_0_of_10/shard_00000001_processed.parquet'
...
]
} |
Thanks @lhoestq, that's exactly what I was looking for. And combined with listing the configs, it should be able to cover the mapping config/split -> paths in main branch:
Probably one of the challenges that we can face is that the data_files are listed individually (without globs) so if we list each file individually it could lead to huge Croissant files. |
Alternatively we can rely on the dataset-compatible-libraries job in dataset-viewer, which creates code snippet e.g. for Dask using a glob pattern for the parquet files. The glob pattern is computed using heuristics. For example for dclm it obtains this code pattern and glob: import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/mlfoundations/dclm-baseline-1.0-parquet/**/*_train/**") So we could just reuse the glob, that is stored along with the generated code snippet (no need to parse the code) |
The Croissant file exposed by HuggingFace seems to correspond to the parquet branch of the dataset, even when the dataset is native parquet:
https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet
https://huggingface.co/datasets/ai4bharat/sangraha
https://huggingface.co/datasets/BleachNick/UltraEdit_500k
IIUC, the parquet branch is not complete for datasets >5GB (not exactly like that since the 5GB are per split), but overall the branch can be often incomplete for large datasets. There are exceptions though, in this dataset the Parquet branch seems complete:
Instead, there should be a way of retrieving a Croissant referring to the main native-parquet branch. Maybe for backward compatibility it would be better to expose both Croissant files (parquet branch and main branch) although exposing only the "complete" one could also be an option.
The text was updated successfully, but these errors were encountered: