Semantic Dedup fails on dask `compute_chunk_size` (25.02 nightly) #461

praateekmahajan · 2025-01-01T06:44:31Z

Describe the bug

    dedup_ids = semdup(dataset)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 647, in __call__
    self.clustering_model(embeddings_dataset)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 351, in __call__
    cupy_darr.compute_chunk_sizes()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/array/core.py", line 1516, in compute_chunk_sizes
    tuple(int(chunk) for chunk in chunks) for chunks in compute(tuple(c))[0]
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 653, in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 422, in collections_to_dsk
    dsk = opt(dsk, keys, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/array/optimization.py", line 39, in optimize
    dsk = fuse_linear_task_spec(dsk, keys=keys)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/_task_spec.py", line 1069, in fuse_linear_task_spec
    result[renamed_key] = Task.fuse(*linear_chain, key=renamed_key)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/_task_spec.py", line 484, in fuse
    raise ValueError(f"Cannot fuse tasks with multiple outputs {leafs}")

Steps/Code to reproduce bug

Call Semantic Dedup module using tcp protocol which internally calls clustering model and that calls compute_chunk_sizes()

Environment overview (please complete the following information)

crossfit                           0.0.8
cudf-cu12                          25.2.0a217
cugraph-cu12                       25.2.0a52
cuml-cu12                          25.2.0a36
dask                               2024.12.1
dask-cuda                          25.2.0a13
dask-cudf-cu12                     25.2.0a215
dask-expr                          1.1.21
dask_labextension                  7.0.0
dask-mpi                           2022.4.0
distributed                        2024.12.1
distributed-ucxx-cu12              0.42.0a21
libcudf-cu12                       25.2.0a217
libucx-cu12                        1.17.0.post1
libucxx-cu12                       0.42.0a21
nemo_curator                       0.6.0rc0.dev1
pylibcudf-cu12                     25.2.0a217
pylibcugraph-cu12                  25.2.0a52
raft-dask-cu12                     25.2.0a28
rapids-dask-dependency             25.2.0a9
torch                              2.5.1
ucx-py-cu12                        0.42.0a7
ucxx-cu12                          0.42.0a20
curator 4fb7f54bd5cf44bee7a520b3d40d1a53049f78d4

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
Dask version
Python version

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

sarahyurick · 2025-01-01T23:45:14Z

Thanks for looking more into this, this is very helpful.

praateekmahajan · 2025-01-02T06:54:02Z

@sarahyurick I'm not sure this is related to #437 (comment) (or atleast I didn't have that in mind). I caught this issue from our nightly runs where this recently started failing

praateekmahajan added the bug Something isn't working label Jan 1, 2025

sarahyurick mentioned this issue Jan 1, 2025

Fix PyTest failures from RAPIDS 24.12 GPU CI #437

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Dedup fails on dask `compute_chunk_size` (25.02 nightly) #461

Semantic Dedup fails on dask `compute_chunk_size` (25.02 nightly) #461

praateekmahajan commented Jan 1, 2025

sarahyurick commented Jan 1, 2025

praateekmahajan commented Jan 2, 2025

Semantic Dedup fails on dask compute_chunk_size (25.02 nightly) #461

Semantic Dedup fails on dask compute_chunk_size (25.02 nightly) #461

Comments

praateekmahajan commented Jan 1, 2025

sarahyurick commented Jan 1, 2025

praateekmahajan commented Jan 2, 2025

Semantic Dedup fails on dask `compute_chunk_size` (25.02 nightly) #461

Semantic Dedup fails on dask `compute_chunk_size` (25.02 nightly) #461