-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] cudf v23.10 #14224
Merged
Merged
[RELEASE] cudf v23.10 #14224
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Forward-merge branch-23.08 to branch-23.10
This PR enforces previously deprecated code until `23.08` in `23.10`. This PR removes `strings_to_categorical` parameter support in `read_parquet`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Richard (Rick) Zamora (https://github.com/rjzamora) - Bradley Dice (https://github.com/bdice) URL: #13732
…3729) draft This PR adds additional numeric dtypes to `GroupBy.apply` with `engine='jit'`. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13729
Branch 23.10 merge 23.08
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
Forward-merge branch-23.08 to branch-23.10
`make_unique` in Cython's libcpp headers is not annotated with `except +`. As a consequence, if the constructor throws, we do not catch it in Python. To work around this (see cython/cython#5560 for details), provide our own implementation. Due to the way assignments occur to temporaries, we need to now explicitly wrap all calls to `make_unique` in `move`, but that is arguably preferable to not being able to catch exceptions, and will not be necessary once we move to Cython 3. - Closes #13743 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #13746
Branch 23.10 merge 23.08
Checking for boolean values in a range results in incorrect behavior: ```python In [1]: True in range(0, 0) Out[1]: False In [3]: True in range(0, 2) Out[3]: True ``` This results in the following bug: ```python In [23]: s = cudf.Series([True, False]) In [24]: s[0] Out[24]: True In [25]: type(s[0]) Out[25]: numpy.bool_ In [26]: True in s Out[26]: True In [26]: True in s.to_pandas() Out[26]: False ``` This PR fixes this issue by properly checking if an integer is passed to the `RangeIndex. __contains__` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #13779
We currently are allowing construction of mixed-dtype by type-casting them into a common type as below: ```python In [1]: import cudf In [2]: import pandas as pd In [3]: s = pd.Series([1, 2, 3], dtype='datetime64[ns]') In [5]: p = pd.Series([10, 11]) In [6]: new_s = pd.concat([s, p]) In [7]: new_s Out[7]: 0 1970-01-01 00:00:00.000000001 1 1970-01-01 00:00:00.000000002 2 1970-01-01 00:00:00.000000003 0 10 1 11 dtype: object In [8]: cudf.Series(new_s) Out[8]: 0 1970-01-01 00:00:00.000000 1 1970-01-01 00:00:00.000000 2 1970-01-01 00:00:00.000000 0 1970-01-01 00:00:00.000010 1 1970-01-01 00:00:00.000011 dtype: datetime64[us] ``` This behavior is incorrect and we are getting this from `pa.array` constructor. This PR ensures we do proper handling around such cases and raise an error. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #13768
Negative unary op on boolean series is resulting in conversion to `int` type: ```python In [1]: import cudf In [2]: s = cudf.Series([True, False]) In [3]: s Out[3]: 0 True 1 False dtype: bool In [4]: -s Out[4]: 0 -1 1 0 dtype: int8 In [5]: -s.to_pandas() Out[5]: 0 False 1 True dtype: bool ``` The PR fixes the above issue by returning inversion of the boolean column instead of multiplying with `-1`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13780
This PR preserves column names in various APIs by retaining `self._data._level_names` and also calculating when to preserve the column names. Fixes: #13741, #13740 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) URL: #13772
…13766) If a columns argument is provided to the dataframe constructor, this should be used to select columns from the provided data dictionary. The previous logic did do this correctly, but didn't preserve the appropriate order of the resulting columns (which should come out in the order that the column selection is in). - Closes #13738 Authors: - Lawrence Mitchell (https://github.com/wence-) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13766
This PR fixes various cases in binary operations where columns are of certain dtypes and the binary operations on those dataframes and series don't yield correct results, correct resulting column types, or have missing columns altogether. This PR also introduces ensuring column ordering to match pandas binary ops column ordering when pandas compatibility mode is enabled. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #13778
Forward-merge branch-23.08 to branch-23.10
This PR upgrades arrow version in `cudf` to `12.0.1` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #13728
… other types (#13786) This PR fixes an issue when trying to merge a `datetime`|`timdelta` type column with another type: ```python In [1]: import cudf In [2]: import pandas as pd In [3]: df = cudf.DataFrame({'a': cudf.Series([10, 20, 30], dtype='datetime64[ns]')}) In [4]: df2 = df.astype('int') In [5]: df.merge(df2) Out[5]: a 0 10.0 1 20.0 2 30.0 In [6]: df2.merge(df) Out[6]: a 0 10.0 1 20.0 2 30.0 In [7]: df Out[7]: a 0 1970-01-01 00:00:00.000000010 1 1970-01-01 00:00:00.000000020 2 1970-01-01 00:00:00.000000030 In [8]: df2 Out[8]: a 0 10 1 20 2 30 In [9]: df.to_pandas().merge(df2.to_pandas()) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[9], line 1 ----> 1 df.to_pandas().merge(df2.to_pandas()) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/frame.py:10092, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 10073 @substitution("") 10074 @appender(_merge_doc, indents=2) 10075 def merge( (...) 10088 validate: str | None = None, 10089 ) -> DataFrame: 10090 from pandas.core.reshape.merge import merge > 10092 return merge( 10093 self, 10094 right, 10095 how=how, 10096 on=on, 10097 left_on=left_on, 10098 right_on=right_on, 10099 left_index=left_index, 10100 right_index=right_index, 10101 sort=sort, 10102 suffixes=suffixes, 10103 copy=copy, 10104 indicator=indicator, 10105 validate=validate, 10106 ) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 93 @substitution("\nleft : DataFrame or named Series") 94 @appender(_merge_doc, indents=0) 95 def merge( (...) 108 validate: str | None = None, 109 ) -> DataFrame: --> 110 op = _MergeOperation( 111 left, 112 right, 113 how=how, 114 on=on, 115 left_on=left_on, 116 right_on=right_on, 117 left_index=left_index, 118 right_index=right_index, 119 sort=sort, 120 suffixes=suffixes, 121 indicator=indicator, 122 validate=validate, 123 ) 124 return op.get_result(copy=copy) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:707, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate) 699 ( 700 self.left_join_keys, 701 self.right_join_keys, 702 self.join_names, 703 ) = self._get_merge_keys() 705 # validate the merge keys dtypes. We may need to coerce 706 # to avoid incompatible dtypes --> 707 self._maybe_coerce_merge_keys() 709 # If argument passed to validate, 710 # check if columns specified as unique 711 # are in fact unique. 712 if validate is not None: File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1344, in _MergeOperation._maybe_coerce_merge_keys(self) 1342 # datetimelikes must match exactly 1343 elif needs_i8_conversion(lk.dtype) and not needs_i8_conversion(rk.dtype): -> 1344 raise ValueError(msg) 1345 elif not needs_i8_conversion(lk.dtype) and needs_i8_conversion(rk.dtype): 1346 raise ValueError(msg) ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13786
…ter (#13791) Adds JSON reader and writer to the list of components that support GDS. Updates the supported data types in JSON reader and writer. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13791
Arrow 12.0 uses the vendored CMake target `arrow::xsimd` instead of the global target name of `xsimd`. We need to use the new name so that libcudf can be used from the build directory by other projects. Found by issue: NVIDIA/spark-rapids-jni#1306 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13790
spark-rapids has [code for debugging JNI Tables/Columns][1] that is useful for debugging during dev work in cudf This PR proposes to start moving it to cudf/java. spark-rapids will be updated to call into the cudf in a follow-up PR. [1]: https://github.com/NVIDIA/spark-rapids/blob/b5cf25eef347d845bd77077d5cb9035262281f98/sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java ## Sample Usage with JShell ```Bash (rapids) rapids@compose:~/cudf/java$ mvn dependency:build-classpath -Dmdep.outputFile=cudf-java-cp.txt (rapids) rapids@compose:~/cudf/java$ jshell --class-path target/cudf-23.10.0-SNAPSHOT-cuda12.jar:$(< cudf-java-cp.txt) \ -R -Dai.rapids.cudf.debug.output=log_error ``` ```Java | Welcome to JShell -- Version 11.0.20 | For an introduction type: /help intro jshell> import ai.rapids.cudf.*; jshell> Table tbl = new Table.TestBuilder().column(1,2,3,4,5,6).build() tbl ==> Table{columns=[ColumnVector{rows=6, type=INT32, n ... e=140381937458144, rows=6} jshell> TableDebug.get().debug("gera", tbl) [main] ERROR ai.rapids.cudf.TableDebug - DEBUG gera Table{columns=[ColumnVector{rows=6, type=INT32, nullCount=Optional[0], offHeap=(ID: 4 7fad371d1a30)}], cudfTable=140381937458144, rows=6} [main] ERROR ai.rapids.cudf.TableDebug - GPU COLUMN 0 - NC: 0 DATA: DeviceMemoryBufferView{address=0x7fad3be00000, length=24, id=-1} VAL: null [main] ERROR ai.rapids.cudf.TableDebug - COLUMN 0 - INT32 [main] ERROR ai.rapids.cudf.TableDebug - 0 1 [main] ERROR ai.rapids.cudf.TableDebug - 1 2 [main] ERROR ai.rapids.cudf.TableDebug - 2 3 [main] ERROR ai.rapids.cudf.TableDebug - 3 4 [main] ERROR ai.rapids.cudf.TableDebug - 4 5 [main] ERROR ai.rapids.cudf.TableDebug - 5 6 ``` Authors: - Gera Shegalov (https://github.com/gerashegalov) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #13783
This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13792
This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas: ```python ans = df.groupby('key').apply(lambda group_df: group_df['x'].corr(group_df['y'])) ``` Authors: - Ashwin Srinath (https://github.com/shwina) - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13767
Fixes: #13049 This PR allows errors from pyarrow to be propagated when an un-bounded sequence is passed to `pa.array` constructor. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13799
… with mask (#14201) Workaround for illegal instruction error in sm90 for warp instrinsics with non `0xffffffff` mask Removed the mask, and used ~0u (`0xffffffff`) as MASK because - all threads in warp has correct data on error since is_within_bounds==true thread update error. - init_state is not required at last iteration only where MASK is not ~0u. Fixes #14183 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Divye Gala (https://github.com/divyegala) - Elias Stehle (https://github.com/elstehle) - Mark Harris (https://github.com/harrism) URL: #14201
This adds two more aggregations for groupby and reduction: * `HISTOGRAM`: Count the number of occurrences (aka frequency) for each element, and * `MERGE_HISTOGRAM`: Merge different outputs generated by `HISTOGRAM` aggregations This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting. Closes #13885. Merging checklist: * [X] Working prototypes. * [X] Cleanup and docs. * [X] Unit test. * [ ] Test with spark-rapids integration tests. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14045
…12.2 (#14108) Compile issues found by compiling libcudf with the `rapidsai/devcontainers:23.10-cpp-gcc9-cuda12.2-ubuntu20.04` docker container. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14108
This PR adds a method to ColumnView class to allow for conversion from Integers to hex closes #14081 Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Kuhu Shukla (https://github.com/kuhushukla) - Robert (Bobby) Evans (https://github.com/revans2) URL: #14205
This implements JNI for `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction. Depends on: * #14045 Contributes to: * #13885. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #14154
Fixes: #14088 This PR preserves `names` of `column` object while constructing a `DataFrame` through various constructor flows. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #14110
Pass the error code to the host when a kernel detects invalid input. If multiple errors types are detected, they are combined using a bitwise OR so that caller gets the aggregate error code that includes all types of errors that occurred. Does not change the kernel side checks. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - https://github.com/nvdbaranec - Divye Gala (https://github.com/divyegala) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #14167
Previously, the parquet chunked reader operated by controlling the size of output chunks only. It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory. With this new 'progressive' support, we also 'chunk' at the input level. Specifically, the user can pass a `pass_read_limit` value which controls how much memory is used for storing compressed/decompressed data. The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit. Within each pass, chunks are emitted as before. From the external user's perspective, the chunked read mechanism is the same. You call `has_next()` and `read_chunk()`. If the user has specified a value for `pass_read_limit` the set of chunks produced might end up being different (although the concatenation of all of them will still be the same). The core idea of the code change is to add the idea of the internal `pass`. Previously we had a `file_intermediate_data` which held data across `read_chunk()` calls. There is now a `pass_intermediate_data` struct which holds information specific to a given pass. Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data. As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data. In order to avoid breaking pre-existing interfaces, there's a new contructor for the `chunked_parquet_reader` class: ``` chunked_parquet_reader( std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const& options, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); ``` Authors: - https://github.com/nvdbaranec Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14079
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
github-actions
bot
added
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
CMake
CMake build issue
conda
Java
Affects Java cuDF API.
labels
Sep 28, 2023
This PR pins `dask` and `distributed` to `2023.9.2` for `23.10` release. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ray Douglass (https://github.com/raydouglass) - Peter Andreas Entschev (https://github.com/pentschev)
Fixes a bug where floating-point values were used in decimal128 rounding, giving wrong results. Closes #14210. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Mark Harris (https://github.com/harrism)
…nt values. (#14242) This is a follow-up PR to #14233. This PR fixes a bug where floating-point values were used as intermediates in ceil/floor unary operations and cast operations that require rescaling for fixed-point types, giving inaccurate results. See also: - #14233 (comment) - #14243 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❄️ Code freeze for
branch-23.10
and v23.10 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-23.10
until release (merging of this PR).What is the purpose of this PR?
branch-23.10
intomain
for the release