Releases: rapidsai/cudf
Releases · rapidsai/cudf
v23.08.00
🚨 Breaking Changes
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Expose streams in all public copying APIs (#13629) @vyasr
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Change build.sh to use pip install instead of setup.py (#13507) @vyasr
- Remove unused max_rows_tensor parameter from subword tokenizer (#13463) @davidwendt
- Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
🐛 Bug Fixes
- Add CUDA version to cudf_kafka and libcudf-example build strings. (#13769) @bdice
- Fix typo in wheels-test.yaml. (#13763) @bdice
- Don't test strings shorter than the requested ngram size (#13758) @vyasr
- Add CUDA version to custreamz build string. (#13754) @bdice
- Fix writing of ORC files with empty child string columns (#13745) @vuule
- Remove the erroneous "empty level" short-circuit from ORC reader (#13722) @vuule
- Fix character counting when writing sliced tables into ORC (#13721) @vuule
- Parquet uses row group row count if missing from header (#13712) @hyperbolic2346
- Fix reading of RLE encoded boolean data from parquet files with V2 page headers (#13707) @etseidl
- Fix a corner case of list lexicographic comparator (#13701) @ttnghia
- Fix combined filtering and column projection in
dask_cudf.read_parquet
(#13697) @rjzamora - Revert fetch-rapids changes (#13696) @vyasr
- Data generator - include offsets in the size estimate of list elments (#13688) @vuule
- Add
cuda-nvcc-impl
tocudf
fornumba
CUDA 12 (#13673) @jakirkham - Fix combined filtering and column projection in
read_parquet
(#13666) @rjzamora - Use
thrust::identity
as hash functions for byte pair encoding (#13665) @PointKernel - Fix loc-getitem ordering when index contains duplicate labels (#13659) @wence-
- [REVIEW] Introduce parity with pandas for
MultiIndex.loc
ordering & fix a bug inGroupby
withas_index
(#13657) @galipremsagar - Fix memcheck error found in nvtext tokenize functions (#13649) @davidwendt
- Fix
has_nonempty_nulls
ignoring column offset (#13647) @ttnghia - [Java] Avoid double-free corruption in case of an Exception while creating a ColumnView (#13645) @razajafri
- Fix memcheck error in ORC reader call to cudf::io::copy_uncompressed_kernel (#13643) @davidwendt
- Fix CUDA 12 conda environment to remove cubinlinker and ptxcompiler. (#13636) @bdice
- Fix inf/NaN comparisons for FLOAT orderby in window functions (#13635) @mythrocks
- Refactor
Index
search to simplify code and increase correctness (#13625) @wence- - Fix compile warning for unused variable in split_re.cu (#13621) @davidwendt
- Fix tz_localize for dask_cudf Series (#13610) @shwina
- Fix issue with no decompressed data in ORC reader (#13609) @vuule
- Fix floating point window range extents. (#13606) @mythrocks
- Fix
localize(None)
for timezone-naive columns (#13603) @shwina - Fixed a memory leak caused by Exception thrown while constructing a ColumnView (#13597) @razajafri
- Handle nullptr return value from bitmask_or in distinct_count (#13590) @wence-
- Bring parity with pandas in Index.join (#13589) @galipremsagar
- Fix cudf.melt when there are more than 255 columns (#13588) @hcho3
- Fix memory issues in cuIO due to removal of memory padding (#13586) @ttnghia
- Fix Parquet multi-file reading (#13584) @etseidl
- Fix memcheck error found in LISTS_TEST (#13579) @davidwendt
- Fix memcheck error found in STRINGS_TEST (#13578) @davidwendt
- Fix memcheck error found in INTEROP_TEST (#13577) @davidwendt
- Fix memcheck errors found in REDUCTION_TEST (#13574) @davidwendt
- Preemptive fix for hive-partitioning change in dask (#13564) @rjzamora
- Fix an issue with
dask_cudf.read_csv
when lines are needed to be skipped (#13555) @galipremsagar - Fix out-of-bounds memory write in cudf::dictionary::detail::concatenate (#13554) @davidwendt
- Fix the null mask size in json reader (#13537) @karthikeyann
- Fix cudf::strings::strip for all-empty input column (#13533) @davidwendt
- Make sure to build without isolation or installing dependencies (#13524) @vyasr
- Remove preload lib from CMake for now (#13519) @vyasr
- Fix missing separator after null values in JSON writer (#13503) @karthikeyann
- Ensure
single_lane_block_sum_reduce
is safe to call in a loop (#13488) @wence- - Update all versions in pyproject.toml files. (#13486) @bdice
- Remove applying nvbench that doesn't exist in 23.08 (#13484) @robertmaynard
- Fix chunked Parquet reader benchmark (#13482) @vuule
- Update JNI JSON reader column compatability for Spark (#13477) @revans2
- Fix unsanitized output of scan with strings (#13455) @davidwendt
- Reject functions without bytecode from
_can_be_jitted
in GroupBy Apply (#13429) @brandon-b-miller - Fix decimal scale reductions in
_get_decimal_type
(#13224) @charlesbluca
📖 Documentation
- Fix doxygen groups for io data sources and sinks (#13718) @davidwendt
- Add pandas compatibility note to DataFrame.query docstring (#13693) @beckernick
- Add pylibcudf to developer guide (#13639) @vyasr
- Fix repeated words in doxygen text (#13598) @karthikeyann
- Update docs for top-level API. (#13592) @bdice
- Fix the the doxygen text for cudf::concatenate and other places (#13561) @davidwendt
- Document stream validation approach used in testing (#13556) @vyasr
- Cleanup doc repetitions in libcudf (#13470) @karthikeyann
🚀 New Features
- Support
min
andmax
aggregations for list type in groupby and reduction (#13676) @ttnghia - Add nvtext::jaccard_index API for strings columns (#13669) @davidwendt
- Add read_parquet_metadata libcudf API (#13663) @karthikeyann
- Expose streams in all public copying APIs (#13629) @vyasr
- Add XXHash_64 hash function to cudf (#13612) @davidwendt
- Java support: Floating point order-by columns for RANGE window functions (#13595) @mythrocks
- Use
cuco::static_map
to build string dictionaries in ORC writer (#13580) @vuule - Add pylibcudf subpackage with gather implementation (#13562) @vyasr
- Add JNI for
lists::concatenate_list_elements
(#13547) @ttnghia - Enable nested types for
lists::concatenate_list_elements
(#13545) @ttnghia - Add unicode encoding for string columns in JSON writer (#13539) @karthikeyann
- Remove numba kernels from
find_index_of_val
(#13517) @brandon-b-miller - Floating point order-by columns for RANGE window functions (#13512) @mythrocks
- Parse column chunk metadata statistics in parquet reader (#13472) @karthikeyann
- Add
abs
function to apply (#13408) @brandon-b-miller - [FEA] AST filtering in parquet reader (#13348) @karthikeyann
- [FEA] Adds option to recover from invalid JSON lines in JSON tokenizer (#13344) @elstehle
- Ensure cccl packages don't clash with upstream version (#13235) @robertmaynard
- Update
struct_minmax_util
to experimental row comparator (#13069) @divyegala - Add stream parameter to hashing APIs (#12090) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for23.08
release (#13802) @galipremsagar - Relax protobuf pinnings. (#13770) @bdice
- Switch fully unbounded window functions to use aggregations (#13727) @mythrocks
- Switch to new wheel building pipeline (#13723) @vyasr
- Revert CUDA 12.0 CI workflows to branch-23.08. (#13719) @bdice
- Adding identify minimum version requirement (#13713) @hyperbolic2346
- Enforce deprecations and add clarifications around existing deprecations (#13710) @galipremsagar
- Optimize ORC reader performance for list data (#13708) @vyasr
- fix limit overflow message in a docstring (#13703) @ahmet-uyar
- Alleviates JSON parser's need for multi-file sources to end with a newline (#13702) @elstehle
- Update cython-lint and replace flake8 with ruff (#13699) @vyasr
- Add
__dask_tokenize__
definitions to cudf classes (#13695) @rjzamora - Convert libcudf hashing benchmarks to nvbench (#13694) @davidwendt
- Separate MurmurHash32 from hash_functions.cuh (#13681) @davidwendt
- Improve performance of cudf::strings::split on whitespace (#13680) @davidwendt
- Allow ORC and Parquet writers to write nullable columns without nulls as non-nullable (#13675) @vuule
- Raise a NotImplementedError in to_datetime when utc is passed (#13670) @shwina
- Add rmm_mode parameter to nvbench base fixture (#13668) @davidwendt
- Fix multiindex loc ordering in pandas-compat mode (#13660) @wence-
- Add nvtext hash_character_ngrams function (#13654) @davidwendt
- Avoid storing metadata in pointers in ORC and Parquet writers (#13648) @vuule
- Acquire spill lock in to/from_arrow (#13646) @shwina
- Expose stable versions of libcudf sort routines (#13634) @wence-
- Separate out hash_test.cpp source for each hash API (#13633) @davidwendt
- Remove deprecated cudf::strings::slice_strings (by delimiter) functions (#13628) @davidwendt
- Create separate libcudf hash APIs for each supported hash function (#13626) @davidwendt
- Add convert_dtypes API (#13623) @shwina
- Clean up cupy in dependencies.yaml. (#13617) @bdice
- Use cuda-version to constrain cudatoolkit. (#13615) @bdice
- Add murmurhash3_x64_128 function to libcudf (#13604) @davidwendt
- Performance improvement for cudf::strings::like (#13594) @davidwendt
- Remove deprecated cudf.set_allocator. (#13591) @bdice
- Clean up cudf device atomic with
cuda::atomic_ref
(#13583) @PointKernel - Add java bindings for distinct count (#13573) @revans2
- Use nvcomp conda package. (#13566) @bdice
- Add exception to string_scalar if input string exceeds size_type (#13560) @davidwendt
- Add dispatch for
cudf.Dataframe
to/frompyarrow.Table
conversion (#13558) @rjzamora - Get rid of
cuco::pair_type
aliases (#13553) @PointKernel - Introduce parity with pandas when
sort=False
inGroupby
(#13551) @galipremsagar - Update CMake in docker to 3.26.4 (#13550) @NvTimLi...
v23.06.01
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
meta
calculation in `dask_cu...
v23.06.00
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to set the null count on returning columns (#13331) @davidwendt
- Change cudf::detail::concatenate_masks to return null-count (#13330) @davidwendt
- Move
meta
calculation indask_cudf.read_parquet
(#13327) @rjzamora - Changes to support Numpy >...
v23.04.01
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
Index.is_*
methods (#12820) @galipremsagar - Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
🐛 Bug Fixes
- Pin curand version (#13127) @vyasr
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrame
constructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schema
from chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby
(#12992) @galipremsagar - Fix
__setitem__
on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_values
when column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquet
changes (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__
(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer column_size() should return a size_t (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndex
bug (#12838) @rjzamora - Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes
(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keys
in_scatter_by_slice
(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullable
flag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply
(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
round
API (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrow
to load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtype
andvalues
to handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Series
comparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFString
back toStringView
to call methods instrings_udf
(#12363) @brandon-b-miller
📖 Documentation
- Fix
GroupBy.apply
doc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0
compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
🚀 New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullable
option application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter
(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.apply
functions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilities
to useexperimental::equality
row comparator (#12777) @divyegala - Update
distinct/unique_count
toexperimental::row
hasher/comparator (#12776) @divyegala - Update
hash_partition
to useexperimental::row::row_hasher
(#12761) @divyegala - Update
is_sorted
to useexperimental::row::lexicographic
(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_program
strings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_program
strings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_program
strings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Add
regex_program
searching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rank
to use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schema
parameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.head
andgroupby.tail
(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSL
to conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
moto
env vars & passAWS_SESSION_TOKEN
to conda builds (#12902) @ajschmidt8 - Rewrite CSV wri...
v23.04.00
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Update minimum
pandas
andnumpy
pinnings (#12887) @galipremsagar - Deprecate
names
&dtype
inIndex.copy
(#12825) @galipremsagar - Deprecate
Index.is_*
methods (#12820) @galipremsagar - Deprecate
datetime_is_numeric
fromdescribe
(#12818) @galipremsagar - Deprecate
na_sentinel
infactorize
(#12817) @galipremsagar - Make string methods return a Series with a useful Index (#12814) @shwina
- Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Remove cudf::strings::repeat_strings_output_sizes and optional parameter from cudf::strings::repeat_strings (#12609) @davidwendt
- Replace message parsing with throwing more specific exceptions (#12426) @vyasr
🐛 Bug Fixes
- Fix memcheck script to execute only _TEST files found in bin/gtests/libcudf (#13006) @davidwendt
- Fix
DataFrame
constructor to broadcast scalar inputs properly (#12997) @galipremsagar - Drop
force_nullable_schema
from chunked parquet writer (#12996) @galipremsagar - Fix gtest column utility comparator diff reporting (#12995) @davidwendt
- Handle index names while performing
groupby
(#12992) @galipremsagar - Fix
__setitem__
on string columns when the scalar value ends in a null byte (#12991) @wence- - Fix
sort_values
when column is all empty strings (#12988) @eriknw - Remove unused variable and fix memory issue in ORC writer (#12984) @ttnghia
- Pre-emptive fix for upstream
dask.dataframe.read_parquet
changes (#12983) @rjzamora - Remove MANIFEST.in use auto-generated one for sdists and package_data for wheels (#12960) @vyasr
- Update to use rapids-export(COMPONENTS) feature. (#12959) @robertmaynard
- cudftestutil supports static gtest dependencies (#12957) @robertmaynard
- Include gtest in build environment. (#12956) @vyasr
- Correctly handle scalar indices in
Index.__getitem__
(#12955) @wence- - Avoid building cython twice (#12945) @galipremsagar
- Fix set index error for Series rolling window operations (#12942) @galipremsagar
- Fix calculation of null counts for Parquet statistics (#12938) @etseidl
- Preserve integer dtype of hive-partitioned column containing nulls (#12930) @rjzamora
- Use get_current_device_resource for intermediate allocations in COLLECT_LIST window code (#12927) @karthikeyann
- Mark dlpack tensor deleter as noexcept to match PyCapsule_Destructor signature. (#12921) @bdice
- Fix conda recipe post-link.sh typo (#12916) @pentschev
- min_rows and num_rows are swapped in ComputePageSizes declaration in Parquet reader (#12886) @etseidl
- Expect cupy to now support bool arrays for dlpack. (#12883) @vyasr
- Use python -m pytest for nightly wheel tests (#12871) @bdice
- Parquet writer column_size() should return a size_t (#12870) @etseidl
- Fix cudf::hash_partition kernel launch error with decimal128 types (#12863) @davidwendt
- Fix an issue with parquet chunked reader undercounting string lengths. (#12859) @nvdbaranec
- Remove tokenizers pre-install pinning. (#12854) @vyasr
- Fix parquet
RangeIndex
bug (#12838) @rjzamora - Remove KAFKA_HOST_TEST from compute-sanitizer check (#12831) @davidwendt
- Make string methods return a Series with a useful Index (#12814) @shwina
- Tell cudf_kafka to use header-only fmt (#12796) @vyasr
- Add
GroupBy.dtypes
(#12783) @galipremsagar - Fix a leak in a test and clarify some test names (#12781) @revans2
- Fix bug in all-null list due to join_list_elements special handling (#12767) @karthikeyann
- Add try/except for expected null-schema error in read_parquet (#12756) @rjzamora
- Throw an exception if an unsupported page encoding is detected in Parquet reader (#12754) @etseidl
- Fix a bug with
num_keys
in_scatter_by_slice
(#12749) @thomcom - Bump pinned rapids wheel deps to 23.4 (#12735) @sevagh
- Rework logic in cudf::strings::split_record to improve performance (#12729) @davidwendt
- Add
always_nullable
flag to Dremel encoding (#12727) @divyegala - Fix memcheck read error in compound segmented reduce (#12722) @davidwendt
- Fix faulty conditional logic in JIT
GroupBy.apply
(#12706) @brandon-b-miller - Produce useful guidance on overflow error in
to_csv
(#12705) @wence- - Handle parquet list data corner case (#12698) @nvdbaranec
- Fix missing trailing comma in json writer (#12688) @karthikeyann
- Remove child fom newCudaAsyncMemoryResource (#12681) @abellina
- Handle bool types in
round
API (#12670) @galipremsagar - Ensure all of device bitmask is initialized in from_arrow (#12668) @wence-
- Fix
from_arrow
to load a sliced arrow table (#12665) @galipremsagar - Fix dask-cudf read_parquet bug for multi-file aggregation (#12663) @rjzamora
- Fix AllocateLikeTest gtests reading uninitialized null-mask (#12643) @davidwendt
- Fix
find_common_dtype
andvalues
to handle complex dtypes (#12537) @galipremsagar - Fix fetching of MultiIndex values when a label is passed (#12521) @galipremsagar
- Fix
Series
comparison vs scalars (#12519) @brandon-b-miller - Allow casting from
UDFString
back toStringView
to call methods instrings_udf
(#12363) @brandon-b-miller
📖 Documentation
- Fix
GroupBy.apply
doc examples rendering (#12994) @brandon-b-miller - add sphinx building and s3 uploading for dask-cudf docs (#12982) @quasiben
- Add developer documentation forbidding default parameters in detail APIs (#12978) @vyasr
- Add README symlink for dask-cudf. (#12946) @bdice
- Remove return type from @return doxygen tags (#12908) @davidwendt
- Fix docs build to be
pydata-sphinx-theme=0.13.0
compatible (#12874) @galipremsagar - Add skeleton API and prose documentation for dask-cudf (#12725) @wence-
- Enable doctests for GroupBy methods (#12658) @brandon-b-miller
- Add comment about CUB patch for SegmentedSortInt.Bool gtest (#12611) @davidwendt
🚀 New Features
- Add JNI method for strings::replace multi variety (#12979) @NVnavkumar
- Add nunique aggregation support for cudf::segmented_reduce (#12972) @davidwendt
- Refactor orc chunked writer (#12949) @ttnghia
- Make Parquet writer
nullable
option application to single table writes (#12933) @vuule - Refactor
io::orc::ProtobufWriter
(#12877) @ttnghia - Make timezone table independent from ORC (#12805) @vuule
- Cache JIT
GroupBy.apply
functions (#12802) @brandon-b-miller - Implement initial support for avro logical types (#6482) (#12788) @tpn
- Update
tests/column_utilities
to useexperimental::equality
row comparator (#12777) @divyegala - Update
distinct/unique_count
toexperimental::row
hasher/comparator (#12776) @divyegala - Update
hash_partition
to useexperimental::row::row_hasher
(#12761) @divyegala - Update
is_sorted
to useexperimental::row::lexicographic
(#12752) @divyegala - Update default data source in cuio reader benchmarks (#12740) @PointKernel
- Reenable stream identification library in CI (#12714) @vyasr
- Add
regex_program
strings splitting java APIs and tests (#12713) @cindyyuanjiang - Add
regex_program
strings replacing java APIs and tests (#12701) @cindyyuanjiang - Add
regex_program
strings extract java APIs and tests (#12699) @cindyyuanjiang - Variable fragment sizes for Parquet writer (#12685) @etseidl
- Add segmented reduction support for fixed-point types (#12680) @davidwendt
- Move
strings_udf
code into cuDF (#12669) @brandon-b-miller - Add
regex_program
searching APIs and related java classes (#12666) @cindyyuanjiang - Add logging to libcudf (#12637) @vuule
- Add compound aggregations to cudf::segmented_reduce (#12573) @davidwendt
- Convert
rank
to use to experimental row comparators (#12481) @divyegala - Use rapids-cmake parallel testing feature (#12451) @robertmaynard
- Enable detection of undesired stream usage (#12089) @vyasr
🛠️ Improvements
- Pin
dask
anddistributed
for release (#13070) @galipremsagar - Pin cupy in wheel tests to supported versions (#13041) @vyasr
- Pin numba version (#13001) @vyasr
- Rework gtests SequenceTest to remove using namepace cudf (#12985) @davidwendt
- Stop setting package version attribute in wheels (#12977) @vyasr
- Move detail reduction functions to cudf::reduction::detail namespace (#12971) @davidwendt
- Remove default detail mrs: part7 (#12970) @vyasr
- Remove default detail mrs: part6 (#12969) @vyasr
- Remove default detail mrs: part5 (#12968) @vyasr
- Remove default detail mrs: part4 (#12967) @vyasr
- Remove default detail mrs: part3 (#12966) @vyasr
- Remove default detail mrs: part2 (#12965) @vyasr
- Remove default detail mrs: part1 (#12964) @vyasr
- Add
force_nullable_schema
parameter to Parquet writer. (#12952) @galipremsagar - Declare a different name for nan_equality.UNEQUAL to prevent Cython warnings. (#12947) @bdice
- Remove remaining default stream parameters (#12943) @vyasr
- Fix cudf::segmented_reduce gtest for ANY aggregation (#12940) @davidwendt
- Implement
groupby.head
andgroupby.tail
(#12939) @wence- - Fix libcudf gtests to pass null-count=0 for empty validity masks (#12923) @davidwendt
- Migrate parquet encoding to use experimental row operators (#12918) @PointKernel
- Fix benchmarks coded in namespace cudf and using namespace cudf (#12915) @karthikeyann
- Fix io/text gtests coded in namespace cudf::test (#12914) @karthikeyann
- Pass
SCCACHE_S3_USE_SSL
to conda builds (#12910) @ajschmidt8 - Fix FST, JSON gtests & benchmarks coded in namespace cudf::test (#12907) @karthikeyann
- Generate pyproject dependencies using dfg (#12906) @vyasr
- Update libcudf counting functions to specify cudf::size_type (#12904) @davidwendt
- Fix
moto
env vars & passAWS_SESSION_TOKEN
to conda builds (#12902) @ajschmidt8 - Rewrite CSV writer benchmark with nvbench (#12901) @PointKernel
- Rework some code logic to reduce iterator and comparator inlining to improve compile time (#12900) @davidwendt
- Deprecate `line_te...
[NIGHTLY] v23.06.00
🔗 Links
🚨 Breaking Changes
- Fix batch processing for parquet writer (#13438) @ttnghia
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Use std::overflow_error when output would exceed column size limit (#13323) @davidwendt
- Remove null mask and null count from column_view constructors (#13311) @vyasr
- Change default value of the
observed=
argument in groupby toTrue
to reflect the actual behaviour (#13296) @shwina - Throw error if UNINITIALIZED is passed to cudf::state_null_count (#13292) @davidwendt
- Remove default null-count parameter from cudf::make_strings_column factory (#13227) @davidwendt
- Remove UNKNOWN_NULL_COUNT where it can be easily computed (#13205) @vyasr
- Update minimum Python version to Python 3.9 (#13196) @shwina
- Refactor contiguous_split API into contiguous_split.hpp (#13186) @abellina
- Cleanup Parquet chunked writer (#13094) @ttnghia
- Cleanup ORC chunked writer (#13091) @ttnghia
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Remove deprecated regex functions from libcudf (#13067) @davidwendt
- [REVIEW] Upgrade to
arrow-11
(#12757) @galipremsagar - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🐛 Bug Fixes
- Fix valid count computation in offset_bitmask_binop kernel (#13489) @davidwendt
- Fix writing of ORC files with empty rowgroups (#13466) @vuule
- Fix cudf::repeat logic when count is zero (#13459) @davidwendt
- Fix batch processing for parquet writer (#13438) @ttnghia
- Fix invalid use of std::exclusive_scan in Parquet writer (#13434) @etseidl
- Patch numba if it is imported first to ensure minor version compatibility works. (#13433) @bdice
- Fix cudf::strings::replace_with_backrefs hang on empty match result (#13418) @davidwendt
- Use <NA> instead of null to match pandas. (#13415) @bdice
- Fix tokenize with non-space delimiter (#13403) @shwina
- Fix groupby head/tail for empty dataframe (#13398) @shwina
- Default to closed="right" in
IntervalIndex
constructor (#13394) @shwina - Correctly reorder and reindex scan groupbys with null keys (#13389) @wence-
- Fix unused argument errors in nvcc 11.5 (#13387) @abellina
- Updates needed to work with jitify that leverages libcudacxx (#13383) @robertmaynard
- Fix unused parameter warning/error in parquet/page_data.cu (#13367) @davidwendt
- Fix page size estimation in Parquet writer (#13364) @etseidl
- Fix subword_tokenize error when input contains no tokens (#13320) @davidwendt
- Support gcc 12 as the C++ compiler (#13316) @robertmaynard
- Correctly set bitmask size in
from_column_view
(#13315) @wence- - Fix approach to detecting assignment for gte/lte operators (#13285) @vyasr
- Fix parquet schema interpretation issue (#13277) @hyperbolic2346
- Fix 64bit shift bug in avro reader (#13276) @karthikeyann
- Fix unused variables/parameters in parquet/writer_impl.cu (#13263) @davidwendt
- Clean up buffers in case AssertionError (#13262) @razajafri
- Allow empty input table in ast
compute_column
(#13245) @wence- - Fix structs_column_wrapper constructors to copy input column wrappers (#13243) @davidwendt
- Fix the row index stream order in ORC reader (#13242) @vuule
- Make
is_decompression_disabled
andis_compression_disabled
thread-safe (#13240) @vuule - Add [[maybe_unused]] to nvbench environment. (#13219) @bdice
- Fix race in ORC string dictionary creation (#13214) @revans2
- Add scalar argtypes to udf cache keys (#13194) @brandon-b-miller
- Fix unused parameter warning/error in grouped_rolling.cu (#13192) @davidwendt
- Avoid skbuild 0.17.2 which affected the cmake -DPython_LIBRARY string (#13188) @sevagh
- Fix
hostdevice_vector::subspan
(#13187) @ttnghia - Use custom nvbench entry point to ensure
cudf::nvbench_base_fixture
usage (#13183) @robertmaynard - Fix slice_strings to return empty strings for stop < start indices (#13178) @davidwendt
- Allow compilation with any GTest version 1.11+ (#13153) @robertmaynard
- Fix a few clang-format style check errors (#13146) @davidwendt
- [REVIEW] Fix
Series
andDataFrame
constructors to validate index lengths (#13122) @galipremsagar - Fix hash join when the input tables have nulls on only one side (#13120) @ttnghia
- Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURES in Python package build. (#13117) @davidwendt
- Adds checks to make sure json reader won't overflow (#13115) @elstehle
- Fix
null_count
of columns returned bychunked_parquet_reader
(#13111) @vuule - Fixes sliced list and struct column bug in JSON chunked writer (#13108) @karthikeyann
- [REVIEW] Fix missing confluent kafka version (#13101) @galipremsagar
- Use make_empty_lists_column instead of make_empty_column(type_id::LIST) (#13099) @davidwendt
- Raise
NotImplementedError
when attempting to construct cuDF objects from timezone-aware datetimes (#13086) @shwina - Fix column selection
read_parquet
benchmarks (#13082) @vuule - Fix bugs in iterative groupby apply algorithm (#13078) @brandon-b-miller
- Add algorithm include in data_sink.hpp (#13068) @ahendriksen
- Fix tests/identify_stream_usage.cpp (#13066) @ahendriksen
- Prevent overflow with
skip_rows
in ORC and Parquet readers (#13063) @vuule - Add except declaration in Cython interface for regex_program::create (#13054) @davidwendt
- [REVIEW] Fix branch version in CI scripts (#13029) @galipremsagar
- Fix OOB memory access in CSV reader when reading without NA values (#13011) @vuule
- Fix read_avro() skip_rows and num_rows. (#12912) @tpn
- Purge nonempty nulls from byte_cast list outputs. (#11971) @bdice
- Fix consumption of CPU-backed interchange protocol dataframes (#11392) @shwina
🚀 New Features
- Remove numba JIT kernel usage from dataframe copy tests (#13385) @brandon-b-miller
- Add JNI for ORC/Parquet writer compression statistics (#13376) @ttnghia
- Use _compile_or_get in JIT groupby apply (#13350) @brandon-b-miller
- cuDF numba cuda 12 updates (#13337) @brandon-b-miller
- Add tz_convert method to convert between timestamps (#13328) @shwina
- Optionally return compression statistics from ORC and Parquet writers (#13294) @vuule
- Support the case=False argument to str.contains (#13290) @shwina
- Add an event handler for ColumnVector.close (#13279) @abellina
- JNI api for cudf::chunked_pack (#13278) @abellina
- Implement a chunked_pack API (#13260) @abellina
- Update cudf recipes to use GTest version to >=1.13 (#13207) @robertmaynard
- JNI changes for range-extents in window functions. (#13199) @mythrocks
- Add support for DatetimeTZDtype and tz_localize (#13163) @shwina
- Add IS_NULL operator to AST (#13145) @karthikeyann
- STRING order-by column for RANGE window functions (#13143) @mythrocks
- Update
contains_table
to experimental row hasher and equality comparator (#13119) @divyegala - Automatically select
GroupBy.apply
algorithm based on if the UDF is jittable (#13113) @brandon-b-miller - Refactor Parquet chunked writer (#13076) @ttnghia
- Add Python bindings for string literal support in AST (#13073) @karthikeyann
- Add Java bindings for string literal support in AST (#13072) @karthikeyann
- Add string scalar support in AST (#13061) @karthikeyann
- Log cuIO warnings using the libcudf logger (#13043) @vuule
- Update
mixed_join
to use experimental row hasher and comparator (#13028) @divyegala - Support structs of lists in row lexicographic comparator (#13005) @ttnghia
- Adding
hostdevice_span
that is a span createable fromhostdevice_vector
(#12981) @hyperbolic2346 - Add nvtext::minhash function (#12961) @davidwendt
- Support lists of structs in row lexicographic comparator (#12953) @ttnghia
- Update
join
to use experimental row hasher and comparator (#12787) @divyegala - Implement Python drop_duplicates with cudf::stable_distinct. (#11656) @brandon-b-miller
🛠️ Improvements
- Bump typing_extensions minimum version to 4.0.0 (#13618) @shwina
- Drop extraneous dependencies from cudf conda recipe. (#13406) @bdice
- Handle some corner-cases in indexing with boolean masks (#13402) @wence-
- Add cudf::stable_distinct public API, tests, and benchmarks. (#13392) @bdice
- [JNI] Pass this ColumnVector to the onClosed event handler (#13386) @abellina
- Fix JNI method with mismatched parameter list (#13384) @ttnghia
- Split up experimental_row_operator_tests.cu to improve its compile time (#13382) @davidwendt
- Deprecate cudf::strings::slice_strings APIs that accept delimiters (#13373) @davidwendt
- Remove UNKNOWN_NULL_COUNT (#13372) @vyasr
- Move some nvtext benchmarks to nvbench (#13368) @davidwendt
- run docs nightly too (#13366) @AyodeAwe
- Add warning for default
dtype
parameter inget_dummies
(#13365) @galipremsagar - Add log messages about kvikIO compatibility mode (#13363) @vuule
- Switch back to using primary shared-action-workflows branch (#13362) @vyasr
- Deprecate
StringIndex
and useIndex
instead (#13361) @galipremsagar - Ensure columns have valid null counts in CUDF JNI. (#13355) @mythrocks
- Expunge most uses of
TypeVar(bound="Foo")
(#13346) @wence- - Remove all references to UNKNOWN_NULL_COUNT in Python (#13345) @vyasr
- Improve
distinct_count
withcuco::static_set
(#13343) @PointKernel - Fix
contiguous_split
performance (#13342) @ttnghia - Remove default UNKNOWN_NULL_COUNT from cudf::column member functions (#13341) @davidwendt
- Update mypy to 1.3 (#13340) @wence-
- [Java] Purge non-empty nulls when setting validity (#13335) @razajafri
- Add row-wise filtering step to
read_parquet
(#13334) @rjzamora - Performance improvement for nvtext::minhash (#13333) @davidwendt
- Fix some libcudf functions to ...
v23.02.00
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
🐛 Bug Fixes
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()
should use the blocking functiondevice_read()
(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash()
: support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDS
magic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn
(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix get_json_object to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column
(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp
(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concat
for singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
📖 Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf
doc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
🚀 New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nulls
instead ofmay_contain_non_empty_nulls
insuperimpose_nulls
andpush_down_nulls
(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_program
java APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode
to use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_if
intocudf::detail::copy_if_safe
(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hpp
detail header fromsrc/io/json/nested_json.hpp
(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse
(#12336) @ttnghia - Use
device_read
in experimentalread_json
(#12314) @vuule - Implement JNI for
strings::reverse
(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector
(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replace
instrings_udf
(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/unique
to use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
🛠️ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Don't upload
libcudf-example
to Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
category
dtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lock
parameter fromSpillableBuffer.get_ptr()
(#12564) @madsbk - Optimize
cudf::make_lists_column
(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizes
from Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()
andas_buffer(..., exposed=False)
(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
- Fix erroneously skipped ORC ZSTD test (#12486) @vuule
- Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
- Raise warnings as errors in the test suite (#12468) @v...
v22.12.01
🚨 Breaking Changes
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove unused
managed_allocator
(#12005) @vyasr - Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
🐛 Bug Fixes
- strings_udf: use libcudf caching of character tables (#12343) @wence-
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dict
backend dispatch to match upstreamdask
(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_string
pointers instrings_udf
(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditional_full_join benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFile
type (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__
edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptx
by default instrings_udf
(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage
&items
implementation forStruct
column & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
- Fix issues when both
usecols
andnames
options are used inread_csv
(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr
" (#11999) @vuule - Fix bug where
df.loc
resulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in get_dremel_data (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
📖 Documentation
- Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncate
API to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md
(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudf
specific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
🚀 New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+
instrings_udf
(#12117) @brandon-b-miller - Support
upper
andlower
instrings_udf
(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip
,lstrip
, andrstrip
instrings_udf
(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
like
jni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType
(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply
(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf
(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadata
to libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
🛠️ Improvements
- Reduce number of tests marked
spilling
(#12197) @madsbk - Pin
dask
anddistributed
for release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orc
changes in tests (#12103) @galipremsagar - Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dask
nigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow err...
v22.12.00
🚨 Breaking Changes
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Rollback of
DeviceBufferLike
(#12009) @madsbk - Remove unused
managed_allocator
(#12005) @vyasr - Pass column names to
write_csv
instead oftable_metadata
pointer (#11972) @vuule - Accept const refs instead of const unique_ptr refs in reduce and scan APIs. (#11960) @vyasr
- Default to equal NaNs in make_merge_sets_aggregation. (#11952) @bdice
- Remove validation that requires introspection (#11938) @vyasr
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Add tests ensuring that cudf's default stream is always used (#11875) @vyasr
- Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Default to equal NaNs in make_collect_set_aggregation. (#11621) @bdice
- Removing int8 column option from parquet byte_array writing (#11539) @hyperbolic2346
- part1: Simplify BaseIndex to an abstract class (#10389) @skirui-source
🐛 Bug Fixes
- Fix include line for IO Cython modules (#12250) @vyasr
- Make dask pinning looser (#12231) @vyasr
- Workaround for CUB segmented-sort bug with boolean keys (#12217) @davidwendt
- Fix
from_dict
backend dispatch to match upstreamdask
(#12203) @galipremsagar - Merge branch-22.10 into branch-22.12 (#12198) @davidwendt
- Fix compression in ORC writer (#12194) @vuule
- Don't use CMake 3.25.0 as it has a show stopping FindCUDAToolkit bug (#12188) @robertmaynard
- Fix data corruption when reading ORC files with empty stripes (#12160) @vuule
- Fix decimal binary operations (#12142) @galipremsagar
- Ensure dlpack include is provided to cudf interop lib (#12139) @robertmaynard
- Safely allocate
udf_string
pointers instrings_udf
(#12138) @brandon-b-miller - Fix/disable jitify lto (#12122) @robertmaynard
- Fix conditional_full_join benchmark (#12121) @GregoryKimball
- Fix regex working-memory-size refactor error (#12119) @davidwendt
- Add in negative size checks for columns (#12118) @revans2
- Add JNI for
substring
without 'end' parameter. (#12113) @firestarman - Fix reading of CSV files with blank second row (#12098) @vuule
- Fix an error in IO with
GzipFile
type (#12085) @galipremsagar - Workaround groupby aggregate thrust::copy_if overflow (#12079) @davidwendt
- Fix alignment of compressed blocks in ORC writer (#12077) @vuule
- Fix singleton-range
__setitem__
edge case (#12075) @wence- - Fix type promotion edge cases in numerical binops (#12074) @wence-
- Force using old fmt in nvbench. (#12067) @vyasr
- Fixes List offset bug in Nested JSON reader (#12060) @karthikeyann
- Allow falling back to
shim_60.ptx
by default instrings_udf
(#12056) @brandon-b-miller - Force black exclusions for pre-commit. (#12036) @bdice
- Add
memory_usage
&items
implementation forStruct
column & dtype (#12033) @galipremsagar - Reduce/Remove reliance on
**kwargs
and*args
inIO
readers & writers (#12025) @galipremsagar - Fixes bug in csv_reader_options construction in cython (#12021) @karthikeyann
- Fix issues when both
usecols
andnames
options are used inread_csv
(#12018) @vuule - Port thrust's pinned_allocator to cudf, since Thrust 1.17 removes the type (#12004) @robertmaynard
- Revert "Replace most of preprocessor usage in nvcomp adapter with
constexpr
" (#11999) @vuule - Fix bug where
df.loc
resulting in single row could give wrong index (#11998) @eriknw - Switch to DISABLE_DEPRECATION_WARNINGS to match other RAPIDS projects (#11989) @robertmaynard
- Fix maximum page size estimate in Parquet writer (#11962) @vuule
- Fix local offset handling in bgzip reader (#11918) @upsj
- Fix an issue reading struct-of-list types in Parquet. (#11910) @nvdbaranec
- Fix memcheck error in TypeInference.Timestamp gtest (#11905) @davidwendt
- Fix type casting in Series.setitem (#11904) @wence-
- Fix memcheck error in get_dremel_data (#11903) @davidwendt
- Fixes Unsupported column type error due to empty list columns in Nested JSON reader (#11897) @karthikeyann
- Fix segmented-sort to ignore indices outside the offsets (#11888) @davidwendt
- Fix cudf::stable_sorted_order for NaN and -NaN in FLOAT64 columns (#11874) @davidwendt
- Fix writing of Parquet files with many fragments (#11869) @etseidl
- Fix RangeIndex unary operators. (#11868) @vyasr
- JNI Avoid NPE for reading host binary data (#11865) @revans2
- Fix decimal benchmark input data generation (#11863) @karthikeyann
- Fix pre-commit copyright check (#11860) @galipremsagar
- Fix Parquet support for seconds and milliseconds duration types (#11854) @vuule
- Ensure better compiler cache results between cudf cal-ver branches (#11835) @robertmaynard
- Fix make_column_from_scalar for all-null strings column (#11807) @davidwendt
- Tell jitify_preprocess where to search for libnvrtc (#11787) @robertmaynard
- add V2 page header support to parquet reader (#11778) @etseidl
- Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing (#11752) @nvdbaranec
- Determine if Arrow has S3 support at runtime in unit test. (#11560) @bdice
📖 Documentation
- Use rapidsai CODE_OF_CONDUCT.md (#12166) @bdice
- Add symlinks to notebooks. (#12128) @bdice
- Add
truncate
API to python doc pages (#12109) @galipremsagar - Update Numba docs links. (#12107) @bdice
- Remove "Multi-GPU with Dask-cuDF" notebook. (#12095) @bdice
- Fix link to c++ developer guide from
CONTRIBUTING.md
(#12084) @brandon-b-miller - Add pivot_table and crosstab to docs. (#12014) @bdice
- Fix doxygen text for cudf::dictionary::encode (#11991) @davidwendt
- Replace default_stream_value with get_default_stream in docs. (#11985) @vyasr
- Add dtype docs pages and docstrings for
cudf
specific dtypes (#11974) @galipremsagar - Update Unit Testing in libcudf guidelines to code tests outside the cudf::test namespace (#11959) @davidwendt
- Rename libcudf++ to libcudf. (#11953) @bdice
- Fix documentation referring to removed as_gpu_matrix method. (#11937) @bdice
- Remove "experimental" warning for struct columns in ORC reader and writer (#11880) @vuule
- Initial draft of policies and guidelines for libcudf usage. (#11853) @vyasr
- Add clear indication of non-GPU accelerated parameters in read_json docstring (#11825) @GregoryKimball
- Add developer docs for writing tests (#11199) @vyasr
🚀 New Features
- Adds an EventHandler to Java MemoryBuffer to be invoked on close (#12125) @abellina
- Support
+
instrings_udf
(#12117) @brandon-b-miller - Support
upper
andlower
instrings_udf
(#12099) @brandon-b-miller - Add wheel builds (#12096) @vyasr
- Allow setting malloc heap size in string udfs (#12094) @brandon-b-miller
- Support
strip
,lstrip
, andrstrip
instrings_udf
(#12091) @brandon-b-miller - Mark nvcomp zstd compression stable (#12059) @jbrennan333
- Add debug-only onAllocated/onDeallocated to RmmEventHandler (#12054) @abellina
- Enable building against the libarrow contained in pyarrow (#12034) @vyasr
- Add strings
like
jni and native method (#12032) @cindyyuanjiang - Cleanup common parsing code in JSON, CSV reader (#12022) @karthikeyann
- byte_range support for JSON Lines format (#12017) @karthikeyann
- Minor cleanup of root CMakeLists.txt for better organization (#11988) @robertmaynard
- Add inplace arithmetic operators to
MaskedType
(#11987) @brandon-b-miller - Implement JNI for chunked Parquet reader (#11961) @ttnghia
- Add method argument to DataFrame.quantile (#11957) @rjzamora
- Add gpu memory watermark apis to JNI (#11950) @abellina
- Adds retryCount to RmmEventHandler.onAllocFailure (#11940) @abellina
- Enable returning string data from UDFs used through
apply
(#11933) @brandon-b-miller - Switch over to rapids-cmake patches for thrust (#11921) @robertmaynard
- Add strings udf C++ classes and functions for phase II (#11912) @davidwendt
- Trim quotes for non-string values in nested json parsing (#11898) @karthikeyann
- Enable CEC for
strings_udf
(#11884) @brandon-b-miller - ArrowIPCTableWriter writes en empty batch in the case of an empty table. (#11883) @firestarman
- Implement chunked Parquet reader (#11867) @ttnghia
- Add
read_orc_metadata
to libcudf (#11815) @vuule - Support nested types as groupby keys in libcudf (#11792) @PointKernel
- Adding feature Truncate to DataFrame and Series (#11435) @VamsiTallam95
🛠️ Improvements
- Reduce number of tests marked
spilling
(#12197) @madsbk - Pin
dask
anddistributed
for release (#12165) @galipremsagar - Don't rely on GNU find in headers_test.sh (#12164) @wence-
- Update cp.clip call (#12148) @quasiben
- Enable automatic column projection in groupby().agg (#12124) @rjzamora
- Refactor
purge_nonempty_nulls
(#12111) @ttnghia - Create an
int8
column inread_csv
when all elements are missing (#12110) @vuule - Spilling to host memory (#12106) @madsbk
- First pass of
pd.read_orc
changes in tests (#12103) @galipremsagar - Expose engine argument in dask_cudf.read_json (#12101) @rjzamora
- Remove CUDA 10 compatibility code. (#12088) @bdice
- Move and update
dask
nigthly install in CI (#12082) @galipremsagar - Throw an error when libcudf is built without cuFile and
LIBCUDF_CUFILE_POLICY
is set to"ALWAYS"
(#12080) @vuule - Remove macros that inspect the contents of exceptions (#12076) @vyasr
- Fix ingest_raw_data performance issue in Nested JSON reader due to RVO (#12070) @karthikeyann
- Remove overflow error during decimal binops (#12063) @galipremsagar
- Change cudf::detail::...
[NIGHTLY] v23.02.00
🔗 Links
🚨 Breaking Changes
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
🐛 Bug Fixes
- Fix update-version.sh (#12745) @raydouglass
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()
should use the blocking functiondevice_read()
(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash()
: support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDS
magic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn
(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix get_json_object to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column
(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp
(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concat
for singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
📖 Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf
doc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
🚀 New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nulls
instead ofmay_contain_non_empty_nulls
insuperimpose_nulls
andpush_down_nulls
(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_program
java APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode
to use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_if
intocudf::detail::copy_if_safe
(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hpp
detail header fromsrc/io/json/nested_json.hpp
(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse
(#12336) @ttnghia - Use
device_read
in experimentalread_json
(#12314) @vuule - Implement JNI for
strings::reverse
(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector
(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replace
instrings_udf
(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/unique
to use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
🛠️ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Don't upload
libcudf-example
to Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
category
dtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lock
parameter fromSpillableBuffer.get_ptr()
(#12564) @madsbk - Optimize
cudf::make_lists_column
(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizes
from Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()
andas_buffer(..., exposed=False)
(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test...