-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy Dedup: Make skipping the False positive check the default #386
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making the default false_positive_check=False
makes sense. Added some nits, otherwise are there any other files that should be updated? Maybe:
- https://github.com/NVIDIA/NeMo-Curator/blob/main/config/fuzzy_dedup_config.yaml
- https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/zyda2-tutorial/1_fuzzy_dedup
- https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/gpudeduplication.rst
?
> [!TIP] | ||
> When using these scripts in a multi-node environment (like Slurm, K8's etc.) it is recommended to start up a Dask cluster prior to execution and connect to the existing cluster via the `--scheduler-address` or `--scheduler-file` flag. | ||
> Use the `--help` flag to view all possible CLI options for the scripts and details on what they do. | ||
> The up to date documentation on running the Fuzzy Deduplication scripts can be found in the [NeMo-User guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> The up to date documentation on running the Fuzzy Deduplication scripts can be found in the [NeMo-User guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases. | |
> The up to date documentation on running the fuzzy deduplication scripts can be found in the [NeMo Curator User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases. |
num_buckets: int = 20 | ||
hashes_per_bucket: int = 13 | ||
use_64_bit_hash: bool = False | ||
buckets_per_shuffle: int = 1 | ||
|
||
false_positive_check: bool = True | ||
false_positive_check: bool = False | ||
# Only required for fp check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Only required for fp check | |
# Only required for false positive check |
Nit, but I prefer spelling out false_positive
each time. Going to be making a similar comment about "fpp" on #285 but I think it can get pretty difficult to modify our codebase otherwise.
raise ValueError( | ||
"Finding fuzzy duplicates requires a cache directory accessible via all workers to store intermediates" | ||
) | ||
fp_defaults = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp_defaults = { | |
false_positive_defaults = { |
warnings.warn( | ||
"Using a higher number of anchor docs might lead to higher memory footprint and might impact performance", | ||
category=UserWarning, | ||
for arg, default in fp_defaults.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for arg, default in fp_defaults.items(): | |
for arg, default in false_positive_defaults.items(): |
"Using a small char_ngrams value might lead to a large number (~5%) of false positives during deduplication." | ||
" Using a value of at least 20 for char_ngrams is recommended." | ||
) | ||
unused_fp_args = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused_fp_args = [ | |
unused_false_positive_args = [ |
" Using a value of at least 20 for char_ngrams is recommended." | ||
) | ||
unused_fp_args = [ | ||
arg for arg in fp_defaults.keys() if getattr(self, arg) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arg for arg in fp_defaults.keys() if getattr(self, arg) is not None | |
arg for arg in false_positive_defaults.keys() if getattr(self, arg) is not None |
unused_fp_args = [ | ||
arg for arg in fp_defaults.keys() if getattr(self, arg) is not None | ||
] | ||
if unused_fp_args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if unused_fp_args: | |
if unused_false_positive_args: |
] | ||
if unused_fp_args: | ||
warnings.warn( | ||
f"False positive check is disabled. Unused arguments {unused_fp_args} will be ignored", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"False positive check is disabled. Unused arguments {unused_fp_args} will be ignored", | |
f"False positive check is disabled. Unused arguments {unused_false_positive_args} will be ignored", |
Description
This PR updates the FuzzyDedup modules making
false_positive_check=False
andchar_ngrams=24
the default params going forward.Motivation:
num_buckets
andhashes_per_bucket
, the false positive rate can be minimized based on intended jaccard similarity.20
buckets and13
hashes per bucket has a low false positive rate for documents with similarity < 0.8 and internal testing has shown a difference of 1-2% when skipping the fp check.24
which roughly corresponds to 5 word shingles, results in removal rates which are very similar to 5 char ngrams with the fp check enabled since 5 char ngrams are prone to more false positives.These new defaults should produce similar results to the previous defaults while being significantly less compute intensive and faster.
Will update tutorials in a followup PR.
Usage
# Add snippet demonstrating usage
Checklist