Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Dedup fails on dask compute_chunk_size (25.02 nightly) #461

Open
praateekmahajan opened this issue Jan 1, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@praateekmahajan
Copy link
Collaborator

Describe the bug

    dedup_ids = semdup(dataset)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 647, in __call__
    self.clustering_model(embeddings_dataset)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 351, in __call__
    cupy_darr.compute_chunk_sizes()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/array/core.py", line 1516, in compute_chunk_sizes
    tuple(int(chunk) for chunk in chunks) for chunks in compute(tuple(c))[0]
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 653, in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 422, in collections_to_dsk
    dsk = opt(dsk, keys, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/array/optimization.py", line 39, in optimize
    dsk = fuse_linear_task_spec(dsk, keys=keys)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/_task_spec.py", line 1069, in fuse_linear_task_spec
    result[renamed_key] = Task.fuse(*linear_chain, key=renamed_key)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/_task_spec.py", line 484, in fuse
    raise ValueError(f"Cannot fuse tasks with multiple outputs {leafs}")

Steps/Code to reproduce bug

  1. Call Semantic Dedup module using tcp protocol which internally calls clustering model and that calls compute_chunk_sizes()

Environment overview (please complete the following information)

crossfit                           0.0.8
cudf-cu12                          25.2.0a217
cugraph-cu12                       25.2.0a52
cuml-cu12                          25.2.0a36
dask                               2024.12.1
dask-cuda                          25.2.0a13
dask-cudf-cu12                     25.2.0a215
dask-expr                          1.1.21
dask_labextension                  7.0.0
dask-mpi                           2022.4.0
distributed                        2024.12.1
distributed-ucxx-cu12              0.42.0a21
libcudf-cu12                       25.2.0a217
libucx-cu12                        1.17.0.post1
libucxx-cu12                       0.42.0a21
nemo_curator                       0.6.0rc0.dev1
pylibcudf-cu12                     25.2.0a217
pylibcugraph-cu12                  25.2.0a52
raft-dask-cu12                     25.2.0a28
rapids-dask-dependency             25.2.0a9
torch                              2.5.1
ucx-py-cu12                        0.42.0a7
ucxx-cu12                          0.42.0a20
curator 4fb7f54bd5cf44bee7a520b3d40d1a53049f78d4

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • Dask version
  • Python version

Additional context

Add any other context about the problem here.

@praateekmahajan praateekmahajan added the bug Something isn't working label Jan 1, 2025
@sarahyurick
Copy link
Collaborator

Thanks for looking more into this, this is very helpful.

@praateekmahajan
Copy link
Collaborator Author

@sarahyurick I'm not sure this is related to #437 (comment) (or atleast I didn't have that in mind). I caught this issue from our nightly runs where this recently started failing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants