Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-24.10 into branch-24.12 #16917

Merged
merged 14 commits into from
Sep 26, 2024
Merged

Conversation

rapids-bot[bot]
Copy link

@rapids-bot rapids-bot bot commented Sep 25, 2024

Forward-merge triggered by push to branch-24.10 that creates a PR to keep branch-24.12 up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.

mroeschke and others added 6 commits September 25, 2024 18:35
Before when `columns=` was a `cudf.Series/Index` we would call `return array.unique.to_pandas()`, but `.unique` is a method not a property so this would have raised an error.

Also took the time to refactor the helper methods here and push down the `errors=` keyword to `Frame._drop_column`

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #16712
This PR is a first pass at #15937. We will close #15937 after #15162 is closed

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16810
Fixes #16625

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation.  Previously it was doing a linear march over the individual tdigests per group and merging them one by one.  This led to terrible performance for large numbers of groups.  In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to.  Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

```
Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8
```


```
New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8
```

Authors:
  - https://github.com/nvdbaranec
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #16780
This PR displays delta's for CPU and GPU usage metrics that are extracted from `cudf.pandas` pytests.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)

URL: #16864
…15979)

Part of #15903.
1. Introduces the Compressed Sparse Row (CSR) format to store the adjacency information of the column tree. 
2. Analogous to `reduce_to_column_tree`, `reduce_to_column_tree_csr` reduces node tree representation to column tree stored in CSR format.

TODO:
- [x] Correctness test

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #15979
This PR is a first pass of making tests deterministic, I noticed one of CI job failed due to an overflow error related to random data generation.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #16910
@rapids-bot rapids-bot bot requested review from a team as code owners September 25, 2024 21:03
@rapids-bot rapids-bot bot requested a review from KyleFromNVIDIA September 25, 2024 21:03
Copy link
Author

rapids-bot bot commented Sep 25, 2024

FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/forward-merger/

IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the /merge comment). Instead, an admin must manually merge by changing the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

@rapids-bot rapids-bot bot requested review from wence-, vyasr and mhaseeb123 September 25, 2024 21:03
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Sep 25, 2024
@vyasr vyasr mentioned this pull request Sep 25, 2024
3 tasks
This adds Polars to the "all" dependency list, ensuring that devcontainers and developers using the conda environment can use the Polars GPU backend provided by cudf.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16875
@rapids-bot rapids-bot bot requested a review from a team as a code owner September 25, 2024 21:21
lithomas1 and others added 3 commits September 25, 2024 21:51
xref #15162

Authors:
  - Thomas Li (https://github.com/lithomas1)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16042
Contributes to #15162

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16749
…uet` (#16684)

This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,

```python
df = dask_cudf.read_parquet(path, filesystem="arrow")
```
will produce a `cudf`-backed collection using PyArrow for IO (i.e. disk->`pa.Table`->`cudf.DataFrame`). Before this PR, passing `filesystem="arrow"` will simply result in an error.

Although this code path is not ideal for fast/local storage, it can be **very** efficient for remote storage (e.g. S3).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Matthew Murray (https://github.com/Matt711)
  - David Wendt (https://github.com/davidwendt)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/brandon-b-miller
  - https://github.com/nvdbaranec

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16684
mroeschke and others added 2 commits September 25, 2024 22:59
Contributes to #15162

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16785
Depends on #16836
This change adds a new host tree building algorithms for JSON reader and utf8 field name support.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. `mark_is_pruned`
Tree is constructed from pruned adjacency list, (with mixed types handling). `construct_tree`

utf8 field name support added: (spark requested)
utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column.

All unit tests passes, 1 unit test added where old algorithm fails.
This code is kept under experimental flag.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16545
@rapids-bot rapids-bot bot requested a review from a team as a code owner September 26, 2024 00:13
@github-actions github-actions bot added the Java Affects Java cuDF API. label Sep 26, 2024
Contributes to #15162

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16771
Contributes to #15162

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #16781
@raydouglass raydouglass merged commit 4ef6d1f into branch-24.12 Sep 26, 2024
199 of 200 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants