-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuzzy_dedup OOM issue #471
Comments
Thanks for raising the issue @chenrui17 . I have a few recommendations to reduce memory and computational requirements at this scale. char_ngrams=24, # use larger ngram size to reduce false positives.
buckets_per_shuffle=1, # process 1 bucket per iteration of LSH to reduce memory requirements.
# skip the false positive check which is computationally expensive.
# In practice this is usually 1-2% of documents based on our experiments.
false_positive_check=False, Some of the changes suggested above are becoming the default in Curator (see #386). Additionally I would recommend parquet files <= 2GB uncompressed if you have large files. If using small files, you can use the NeMo-Curator/nemo_curator/datasets/doc_dataset.py Lines 90 to 98 in 9c8f185
files_per_partition to combine multiple parquet files into a single block for processing in prior versions of curator.
|
Internally we've typically used 16-24 GPUs for processing data at this scale so I'm not sure if these suggestions will prevent OOM errors on 5 GPUs, but happy to follow up and see if this improves things. |
Describe the bug
Use 5*A100 GPUs to do fuzzey_dedup task and encountered OOM issues. here is error info
Steps/Code to reproduce bug
Environment overview (please complete the following information)
Additional context
use dclm-baseline 1.0 parquet data and totally 8TB parquet data(after add
nemo_id
and no compression)The text was updated successfully, but these errors were encountered: