Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v23.10 #14224

Merged
merged 245 commits into from
Oct 11, 2023
Merged

[RELEASE] cudf v23.10 #14224

merged 245 commits into from
Oct 11, 2023

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-23.10 and v23.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-23.10 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-23.10 into main for the release

raydouglass and others added 30 commits July 20, 2023 16:29
Forward-merge branch-23.08 to branch-23.10
This PR enforces previously deprecated code until `23.08` in `23.10`. This PR removes `strings_to_categorical` parameter support in `read_parquet`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Bradley Dice (https://github.com/bdice)

URL: #13732
…3729)

draft

This PR adds additional numeric dtypes to `GroupBy.apply` with `engine='jit'`.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13729
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
`make_unique` in Cython's libcpp headers is not annotated with `except +`. As a consequence, if the constructor throws, we do not catch it in Python. To work around this (see cython/cython#5560 for details), provide our own implementation.

Due to the way assignments occur to temporaries, we need to now explicitly wrap all calls to `make_unique` in `move`, but that is arguably preferable to not being able to catch exceptions, and will not be necessary once we move to Cython 3.

- Closes #13743

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #13746
Checking for boolean values in a range results in incorrect behavior:
```python
In [1]: True in range(0, 0)
Out[1]: False


In [3]: True in range(0, 2)
Out[3]: True
```

This results in the following bug:
```python
In [23]: s = cudf.Series([True, False])

In [24]: s[0]
Out[24]: True

In [25]: type(s[0])
Out[25]: numpy.bool_

In [26]: True in s
Out[26]: True

In [26]: True in s.to_pandas()
Out[26]: False
```

This PR fixes this issue by properly checking if an integer is passed to the `RangeIndex. __contains__`

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #13779
We currently are allowing construction of mixed-dtype by type-casting them into a common type as below:
```python

In [1]: import cudf

In [2]: import pandas as pd

In [3]: s = pd.Series([1, 2, 3], dtype='datetime64[ns]')


In [5]: p = pd.Series([10, 11])

In [6]: new_s = pd.concat([s, p])

In [7]: new_s
Out[7]: 
0    1970-01-01 00:00:00.000000001
1    1970-01-01 00:00:00.000000002
2    1970-01-01 00:00:00.000000003
0                               10
1                               11
dtype: object

In [8]: cudf.Series(new_s)
Out[8]: 
0   1970-01-01 00:00:00.000000
1   1970-01-01 00:00:00.000000
2   1970-01-01 00:00:00.000000
0   1970-01-01 00:00:00.000010
1   1970-01-01 00:00:00.000011
dtype: datetime64[us]
```
This behavior is incorrect and we are getting this from `pa.array` constructor. This PR ensures we do proper handling around such cases and raise an error.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)

URL: #13768
Negative unary op on boolean series is resulting in conversion to `int` type:
```python
In [1]: import cudf

In [2]: s = cudf.Series([True, False])

In [3]: s
Out[3]: 
0     True
1    False
dtype: bool

In [4]: -s
Out[4]: 
0   -1
1    0
dtype: int8

In [5]: -s.to_pandas()
Out[5]: 
0    False
1     True
dtype: bool
```
The PR fixes the above issue by returning inversion of the boolean column instead of multiplying with `-1`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13780
This PR preserves column names in various APIs by retaining `self._data._level_names` and also calculating when to preserve the column names.
Fixes: #13741, #13740

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ashwin Srinath (https://github.com/shwina)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13772
…13766)

If a columns argument is provided to the dataframe constructor, this should be used to select columns from the provided data dictionary. The previous logic did do this correctly, but didn't preserve the appropriate order of the resulting columns (which should come out in the order that the column selection is in).

- Closes #13738

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13766
This PR fixes various cases in binary operations where columns are of certain dtypes and the binary operations on those dataframes and series don't yield correct results, correct resulting column types, or have missing columns altogether. 
This PR also introduces ensuring column ordering to match pandas binary ops column ordering when pandas compatibility mode is enabled.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #13778
Forward-merge branch-23.08 to branch-23.10
This PR upgrades arrow version in `cudf` to `12.0.1`

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13728
… other types (#13786)

This PR fixes an issue when trying to merge a `datetime`|`timdelta` type column with another type:
```python

In [1]: import cudf

In [2]: import pandas as pd

In [3]: df = cudf.DataFrame({'a': cudf.Series([10, 20, 30], dtype='datetime64[ns]')})

In [4]: df2 = df.astype('int')

In [5]: df.merge(df2)
Out[5]: 
      a
0  10.0
1  20.0
2  30.0

In [6]: df2.merge(df)
Out[6]: 
      a
0  10.0
1  20.0
2  30.0

In [7]: df
Out[7]: 
                              a
0 1970-01-01 00:00:00.000000010
1 1970-01-01 00:00:00.000000020
2 1970-01-01 00:00:00.000000030

In [8]: df2
Out[8]: 
    a
0  10
1  20
2  30

In [9]: df.to_pandas().merge(df2.to_pandas())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 df.to_pandas().merge(df2.to_pandas())

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/frame.py:10092, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10073 @substitution("")
  10074 @appender(_merge_doc, indents=2)
  10075 def merge(
   (...)
  10088     validate: str | None = None,
  10089 ) -> DataFrame:
  10090     from pandas.core.reshape.merge import merge
> 10092     return merge(
  10093         self,
  10094         right,
  10095         how=how,
  10096         on=on,
  10097         left_on=left_on,
  10098         right_on=right_on,
  10099         left_index=left_index,
  10100         right_index=right_index,
  10101         sort=sort,
  10102         suffixes=suffixes,
  10103         copy=copy,
  10104         indicator=indicator,
  10105         validate=validate,
  10106     )

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @substitution("\nleft : DataFrame or named Series")
     94 @appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
--> 110     op = _MergeOperation(
    111         left,
    112         right,
    113         how=how,
    114         on=on,
    115         left_on=left_on,
    116         right_on=right_on,
    117         left_index=left_index,
    118         right_index=right_index,
    119         sort=sort,
    120         suffixes=suffixes,
    121         indicator=indicator,
    122         validate=validate,
    123     )
    124     return op.get_result(copy=copy)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:707, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    699 (
    700     self.left_join_keys,
    701     self.right_join_keys,
    702     self.join_names,
    703 ) = self._get_merge_keys()
    705 # validate the merge keys dtypes. We may need to coerce
    706 # to avoid incompatible dtypes
--> 707 self._maybe_coerce_merge_keys()
    709 # If argument passed to validate,
    710 # check if columns specified as unique
    711 # are in fact unique.
    712 if validate is not None:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1344, in _MergeOperation._maybe_coerce_merge_keys(self)
   1342 # datetimelikes must match exactly
   1343 elif needs_i8_conversion(lk.dtype) and not needs_i8_conversion(rk.dtype):
-> 1344     raise ValueError(msg)
   1345 elif not needs_i8_conversion(lk.dtype) and needs_i8_conversion(rk.dtype):
   1346     raise ValueError(msg)

ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat
```

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13786
…ter (#13791)

Adds JSON reader and writer to the list of components that support GDS.
Updates the supported data types in JSON reader and writer.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13791
Arrow 12.0 uses the vendored CMake target `arrow::xsimd` instead of the global target name of `xsimd`. We need to use the new name so that libcudf can be used from the build directory by other projects.

Found by issue: NVIDIA/spark-rapids-jni#1306

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13790
spark-rapids has [code for debugging JNI Tables/Columns][1] that is useful for debugging during dev work in cudf

This PR proposes to start moving it to cudf/java. spark-rapids will be updated to call into the cudf in a follow-up PR.

[1]: https://github.com/NVIDIA/spark-rapids/blob/b5cf25eef347d845bd77077d5cb9035262281f98/sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java

## Sample Usage with JShell
```Bash
(rapids) rapids@compose:~/cudf/java$ mvn dependency:build-classpath -Dmdep.outputFile=cudf-java-cp.txt
(rapids) rapids@compose:~/cudf/java$ jshell --class-path target/cudf-23.10.0-SNAPSHOT-cuda12.jar:$(< cudf-java-cp.txt) \
    -R -Dai.rapids.cudf.debug.output=log_error
```
```Java
|  Welcome to JShell -- Version 11.0.20
|  For an introduction type: /help intro

jshell> import ai.rapids.cudf.*;

jshell> Table tbl = new Table.TestBuilder().column(1,2,3,4,5,6).build()
tbl ==> Table{columns=[ColumnVector{rows=6, type=INT32, n ... e=140381937458144, rows=6}

jshell> TableDebug.get().debug("gera", tbl)
[main] ERROR ai.rapids.cudf.TableDebug - DEBUG gera Table{columns=[ColumnVector{rows=6, type=INT32, nullCount=Optional[0], offHeap=(ID: 4 7fad371d1a30)}], cudfTable=140381937458144, rows=6}
[main] ERROR ai.rapids.cudf.TableDebug - GPU COLUMN 0 - NC: 0 DATA: DeviceMemoryBufferView{address=0x7fad3be00000, length=24, id=-1} VAL: null
[main] ERROR ai.rapids.cudf.TableDebug - COLUMN 0 - INT32
[main] ERROR ai.rapids.cudf.TableDebug - 0 1
[main] ERROR ai.rapids.cudf.TableDebug - 1 2
[main] ERROR ai.rapids.cudf.TableDebug - 2 3
[main] ERROR ai.rapids.cudf.TableDebug - 3 4
[main] ERROR ai.rapids.cudf.TableDebug - 4 5
[main] ERROR ai.rapids.cudf.TableDebug - 5 6
```

Authors:
  - Gera Shegalov (https://github.com/gerashegalov)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13783
This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13792
This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas:

```python
ans = df.groupby('key').apply(lambda group_df: group_df['x'].corr(group_df['y']))
```

Authors:
  - Ashwin Srinath (https://github.com/shwina)
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13767
Fixes: #13049 
This PR allows errors from pyarrow to be propagated when an un-bounded sequence is passed to `pa.array` constructor.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #13799
karthikeyann and others added 8 commits September 27, 2023 03:59
… with mask (#14201)

Workaround for illegal instruction error in sm90 for warp instrinsics with non `0xffffffff` mask
Removed the mask, and used ~0u (`0xffffffff`) as MASK because
- all threads in warp has correct data on error since is_within_bounds==true thread update error.
- init_state is not required at last iteration only where MASK is not ~0u.

Fixes #14183

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Elias Stehle (https://github.com/elstehle)
  - Mark Harris (https://github.com/harrism)

URL: #14201
This adds two more aggregations for groupby and reduction:
 * `HISTOGRAM`: Count the number of occurrences (aka frequency) for each element, and
 * `MERGE_HISTOGRAM`: Merge different outputs generated by `HISTOGRAM` aggregations

This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting.

Closes #13885.

Merging checklist:
 * [X] Working prototypes.
 * [X] Cleanup and docs.
 * [X]  Unit test.
 * [ ] Test with spark-rapids integration tests.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #14045
…12.2 (#14108)

Compile issues found by compiling libcudf with the `rapidsai/devcontainers:23.10-cpp-gcc9-cuda12.2-ubuntu20.04` docker container.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14108
This PR adds a method to ColumnView class to allow for conversion from Integers to hex
closes #14081

Authors:
  - Raza Jafri (https://github.com/razajafri)

Approvers:
  - Kuhu Shukla (https://github.com/kuhushukla)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #14205
This implements JNI for  `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction.

Depends on:
 * #14045

Contributes to:
 * #13885.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #14154
Fixes: #14088 

This PR preserves `names` of `column` object while constructing a `DataFrame` through various constructor flows.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ashwin Srinath (https://github.com/shwina)

URL: #14110
Pass the error code to the host when a kernel detects invalid input.
If multiple errors types are detected, they are combined using a bitwise OR so that caller gets the aggregate error code that includes all types of errors that occurred.

Does not change the kernel side checks.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - https://github.com/nvdbaranec
  - Divye Gala (https://github.com/divyegala)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #14167
Previously, the parquet chunked reader operated by controlling the size of output chunks only.  It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory.  With this new 'progressive' support, we also 'chunk' at the input level.  Specifically, the user can pass a `pass_read_limit` value which controls how much memory is used for storing compressed/decompressed data.  The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit.  Within each pass, chunks are emitted as before. 

From the external user's perspective, the chunked read mechanism is the same.  You call `has_next()` and `read_chunk()`.  If the user has specified a value for `pass_read_limit` the set of chunks produced might end up being different (although the concatenation of all of them will still be the same). 

The core idea of the code change is to add the idea of the internal `pass`.  Previously we had a `file_intermediate_data` which held data across `read_chunk()` calls.   There is now a `pass_intermediate_data` struct which holds information specific to a given pass.  Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data.  As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data. 

In order to avoid breaking pre-existing interfaces, there's a new contructor for the `chunked_parquet_reader` class:

```
  chunked_parquet_reader(
    std::size_t chunk_read_limit,
    std::size_t pass_read_limit,
    parquet_reader_options const& options,
    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
```

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #14079
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 28, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Sep 28, 2023
galipremsagar and others added 4 commits September 28, 2023 13:16
This PR pins `dask` and `distributed` to `2023.9.2` for `23.10` release.

Authors:
   - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
   - Peter Andreas Entschev (https://github.com/pentschev)
Fixes a bug where floating-point values were used in decimal128 rounding, giving wrong results.

Closes #14210.

Authors:
   - Bradley Dice (https://github.com/bdice)

Approvers:
   - Divye Gala (https://github.com/divyegala)
   - Mark Harris (https://github.com/harrism)
…nt values. (#14242)

This is a follow-up PR to #14233. This PR fixes a bug where floating-point values were used as intermediates in ceil/floor unary operations and cast operations that require rescaling for fixed-point types, giving inaccurate results.

See also:
- #14233 (comment)
- #14243

Authors:
   - Bradley Dice (https://github.com/bdice)

Approvers:
   - Mike Wilson (https://github.com/hyperbolic2346)
   - Vukasin Milovanovic (https://github.com/vuule)
@raydouglass raydouglass merged commit 562f70e into main Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.