v22.08.00
🚨 Breaking Changes
- Remove legacy join APIs (#11274) @vyasr
- Remove
lists::drop_list_duplicates
(#11236) @ttnghia - Remove Index.replace API (#11131) @vyasr
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7
in code-base (#11029) @galipremsagar - Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Remove Arrow CUDA IPC code (#10995) @shwina
- Buffer: make
.ptr
read-only (#10872) @madsbk
🐛 Bug Fixes
- Fix
distributed
error related toloop_in_thread
(#11428) @galipremsagar - Relax arrow pinning to just 8.x and remove cuda build dependency from cudf recipe (#11412) @kkraus14
- Revert "Allow CuPy 11" (#11409) @jakirkham
- Fix
moto
timeouts (#11369) @galipremsagar - Set
+/-infinity
as theidentity
values for floating-point numbers in device operatorsmin
andmax
(#11357) @ttnghia - Fix memory_usage() for
ListSeries
(#11355) @thomcom - Fix constructing Column from column_view with expired mask (#11354) @shwina
- Handle parquet corner case: Columns with more rows than are in the row group. (#11353) @nvdbaranec
- Fix
DatetimeIndex
&TimedeltaIndex
constructors (#11342) @galipremsagar - Fix unsigned-compare compile warning in IntPow binops (#11339) @davidwendt
- Fix performance issue and add a new code path to
cudf::detail::contains
(#11330) @ttnghia - Pin
pytorch
to temporarily unblock fromlibcupti
errors (#11289) @galipremsagar - Workaround for nvcomp zstd overwriting blocks for orc due to underestimate of sizes (#11288) @jbrennan333
- Fix inconsistency when hashing two tables in
cudf::detail::contains
(#11284) @ttnghia - Fix issue related to numpy array and
category
dtype (#11282) @galipremsagar - Add NotImplementedError when on is specified in DataFrame.join. (#11275) @vyasr
- Fix invalid allocate_like() and empty_like() tests. (#11268) @nvdbaranec
- Returns DataFrame When Concating Along Axis 1 (#11263) @isVoid
- Fix compile error due to missing header (#11257) @ttnghia
- Fix a memory aliasing/crash issue in scatter for lists. (#11254) @nvdbaranec
- Fix
tests/rolling/empty_input_test
(#11238) @ttnghia - Fix const qualifier when using
host_span<bitmask_type const*>
(#11220) @ttnghia - Avoid using
nvcompBatchedDeflateDecompressGetTempSizeEx
in cuIO (#11213) @vuule - Generate benchmark data with correct run length regardless of cardinality (#11205) @vuule
- Fix cumulative count index behavior (#11188) @brandon-b-miller
- Fix assertion in dask_cudf test_struct_explode (#11170) @rjzamora
- Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager (#11161) @res-life
- Fix compatibility issues with pandas 1.4.3 (#11152) @vyasr
- Ensure cuco export set is installed in cmake build (#11147) @jlowe
- Avoid redundant deepcopy in
cudf.from_pandas
(#11142) @galipremsagar - Fix compile error due to missing header (#11126) @ttnghia
- Fix
__cuda_array_interface__
failures (#11113) @galipremsagar - Support octal and hex within regex character class pattern (#11112) @davidwendt
- Fix split_re matching logic for word boundaries (#11106) @davidwendt
- Handle multiple files metadata in
read_parquet
(#11105) @galipremsagar - Fix index alignment for Series objects with repeated index (#11103) @shwina
- FindcuFile now searches in the current CUDA Toolkit location (#11101) @robertmaynard
- Fix regex word boundary logic to include underline (#11099) @davidwendt
- Exclude CudaFatalTest when selecting all Java tests (#11083) @jlowe
- Fix duplicate
cudatoolkit
pinning issue (#11070) @galipremsagar - Maintain the input index in the result of a groupby-transform (#11068) @shwina
- Fix bug with row count comparison for expect_columns_equivalent(). (#11059) @nvdbaranec
- Fix BPE uninitialized size value for null and empty input strings (#11054) @davidwendt
- Include missing header for usage of
get_current_device_resource()
(#11047) @AtlantaPepsi - Fix warn_unused_result error in parquet test (#11026) @karthikeyann
- Return empty dataframe when reading a Parquet file using empty
columns
option (#11018) @vuule - Fix small error in page row count limiting (#10991) @etseidl
- Fix a row index entry error in ORC writer issue (#10989) @vuule
- Fix grouped covariance to require both values to be convertible to double. (#10891) @bdice
📖 Documentation
- Fix issues with day & night modes in python docs (#11400) @galipremsagar
- Update missing data handling APIs in docs (#11345) @galipremsagar
- Add lists filtering APIs to doxygen group. (#11336) @bdice
- Remove unused import in README sample (#11318) @vyasr
- Note null behavior in
where
docs (#11276) @brandon-b-miller - Update docstring for spans in
get_row_data_range
(#11271) @vyasr - Update nvCOMP integration table (#11231) @vuule
- Add dev docs for documentation writing (#11217) @vyasr
- Documentation fix for concatenate (#11187) @dagardner-nv
- Fix unresolved links in markdown (#11173) @karthikeyann
- Fix cudf version in README.md install commands (#11164) @jvanstraten
- Switch
language
fromNone
to"en"
in docs build (#11133) @galipremsagar - Remove docs mentioning scalar_view since no such class exists. (#11132) @bdice
- Add docstring entry for
DataFrame.value_counts
(#11039) @galipremsagar - Add docs to rolling var, std, count. (#11035) @bdice
- Fix docs for Numba UDFs. (#11020) @bdice
- Replace column comparison utilities functions with macros (#11007) @karthikeyann
- Fix Doxygen warnings in multiple headers files (#11003) @karthikeyann
- Fix doxygen warnings in utilities/ headers (#10974) @karthikeyann
- Fix Doxygen warnings in table header files (#10964) @karthikeyann
- Fix Doxygen warnings in column header files (#10963) @karthikeyann
- Fix Doxygen warnings in strings / header files (#10937) @karthikeyann
- Generate Doxygen Tag File for Libcudf (#10932) @isVoid
- Fix doxygen warnings in structs, lists headers (#10923) @karthikeyann
- Fix doxygen warnings in fixed_point.hpp (#10922) @karthikeyann
- Fix doxygen warnings in ast/, rolling, tdigest/, wrappers/, dictionary/ headers (#10921) @karthikeyann
- fix doxygen warnings in cudf/io/types.hpp, other header files (#10913) @karthikeyann
- fix doxygen warnings in cudf/io/ avro, csv, json, orc, parquet header files (#10912) @karthikeyann
- Fix doxygen warnings in cudf/*.hpp (#10896) @karthikeyann
- Add missing documentation in aggregation.hpp (#10887) @karthikeyann
- Revise PR template. (#10774) @bdice
🚀 New Features
- Change cmake to allow controlling Arrow version via cmake variable (#11429) @kkraus14
- Adding support for list<int8> columns to be written as byte arrays in parquet (#11328) @hyperbolic2346
- Adding byte array view structure (#11322) @hyperbolic2346
- Adding byte_array statistics (#11303) @hyperbolic2346
- Add column indexes to Parquet writer (#11302) @etseidl
- Provide an Option for Default Integer and Floating Bitwidth (#11272) @isVoid
- FST benchmark (#11243) @karthikeyann
- Adds the Finite-State Transducer algorithm (#11242) @elstehle
- Refactor
collect_set
to usecudf::distinct
andcudf::lists::distinct
(#11228) @ttnghia - Treat zstd as stable in nvcomp releases 2.3.2 and later (#11226) @jbrennan333
- Add 24 bit dictionary support to Parquet writer (#11216) @devavret
- Enable positive group indices for extractAllRecord on JNI (#11215) @anthony-chang
- JNI bindings for NTH_ELEMENT window aggregation (#11201) @mythrocks
- Add JNI bindings for extractAllRecord (#11196) @anthony-chang
- Add
cudf.options
(#11193) @isVoid - Add thrift support for parquet column and offset indexes (#11178) @etseidl
- Adding binary read/write as options for parquet (#11160) @hyperbolic2346
- Support
nth_element
for window functions (#11158) @mythrocks - Implement
lists::distinct
andcudf::detail::stable_distinct
(#11149) @ttnghia - Implement Groupby pct_change (#11144) @skirui-source
- Add JNI for set operations (#11143) @ttnghia
- Remove deprecated PER_THREAD_DEFAULT_STREAM (#11134) @jbrennan333
- Added a Java method to check the existence of a list of keys in a map (#11128) @razajafri
- Feature/python benchmarking (#11125) @vyasr
- Support
nan_equality
incudf::distinct
(#11118) @ttnghia - Added JNI for getMapValueForKeys (#11104) @razajafri
- Refactor
semi_anti_join
(#11100) @ttnghia - Replace remaining instances of rmm::cuda_stream_default with cudf::default_stream_value (#11082) @jbrennan333
- Adds the Logical Stack algorithm (#11078) @elstehle
- Add doxygen-check pre-commit hook (#11076) @karthikeyann
- Use new nvCOMP API to optimize the decompression temp memory size (#11064) @vuule
- Add Doxygen CI check (#11057) @karthikeyann
- Support
duplicate_keep_option
incudf::distinct
(#11052) @ttnghia - Support set operations (#11043) @ttnghia
- Support for ZLIB compression in ORC writer (#11036) @vuule
- Adding feature swaplevels (#11027) @VamsiTallam95
- Use nvCOMP for ZLIB decompression in ORC reader (#11024) @vuule
- Function for bfill, ffill #9591 (#11022) @Sreekiran096
- Generate group offsets from element labels (#11017) @ttnghia
- Feature axes (#10979) @VamsiTallam95
- Generate group labels from offsets (#10945) @ttnghia
- Add missing cuIO benchmark coverage for duration types (#10933) @vuule
- Dask-cuDF cumulative groupby ops (#10889) @brandon-b-miller
- Reindex Improvements (#10815) @brandon-b-miller
- Implement value_counts for DataFrame (#10813) @martinfalisse
🛠️ Improvements
- Pin
dask
&distributed
for release (#11433) @galipremsagar - Use documented header template for
doxygen
(#11430) @galipremsagar - Relax arrow version in dev env (#11418) @galipremsagar
- Allow CuPy 11 (#11393) @jakirkham
- Improve multibyte_split performance (#11347) @cwharris
- Switch death test to use explicit trap. (#11326) @vyasr
- Add --output-on-failure to ctest args. (#11321) @vyasr
- Consolidate remaining DataFrame/Series APIs (#11315) @vyasr
- Add JNI support for the join_strings API (#11309) @revans2
- Add cupy version to setup.py install_requires (#11306) @vyasr
- removing some unused code (#11305) @hyperbolic2346
- Add test of wildcard selection (#11300) @vyasr
- Update parquet reader to take stream parameter (#11294) @PointKernel
- Spark list hashing (#11292) @bdice
- Remove legacy join APIs (#11274) @vyasr
- Fix
cudf
recipes syntax (#11273) @ajschmidt8 - Fix
cudf
recipe (#11267) @ajschmidt8 - Cleanup config files (#11266) @vyasr
- Run mypy on all packages (#11265) @vyasr
- Update to isort 5.10.1. (#11262) @vyasr
- Consolidate flake8 and pydocstyle configuration (#11260) @vyasr
- Remove redundant black config specifications. (#11258) @vyasr
- Ensure DeprecationWarnings are not introduced via pre-commit (#11255) @wence-
- Optimization to gpu::PreprocessColumnData in parquet reader. (#11252) @nvdbaranec
- Move rolling impl details to detail/ directory. (#11250) @mythrocks
- Remove
lists::drop_list_duplicates
(#11236) @ttnghia - Use
cudf::lists::distinct
in Python binding (#11234) @ttnghia - Use
cudf::lists::distinct
in Java binding (#11233) @ttnghia - Use
cudf::distinct
in Java binding (#11232) @ttnghia - Pin
dask-cuda
in dev environment (#11229) @galipremsagar - Remove cruft in map_lookup (#11221) @mythrocks
- Deprecate
skiprows
&num_rows
in parquet reader (#11218) @galipremsagar - Remove Frame._index (#11210) @vyasr
- Improve performance for
cudf::contains
when searching for a scalar (#11202) @ttnghia - Document why Development component is needing for CMake. (#11200) @vyasr
- cleanup unused code in rolling_test.hpp (#11195) @karthikeyann
- Standardize join internals around DataFrame (#11184) @vyasr
- Move character case table declarations from src to detail (#11183) @davidwendt
- Remove usage of Frame in StringMethods (#11181) @vyasr
- Expose get_json_object_options to Python (#11180) @SrikarVanavasam
- Fix decimal128 stats in parquet writer (#11179) @etseidl
- Modify CheckPageRows in parquet_test to use datasources (#11177) @etseidl
- Pin max version of
cuda-python
to11.7.0
(#11174) @Ethyling - Refactor and optimize Frame.where (#11168) @vyasr
- Add npos const static member to cudf::string_view (#11166) @davidwendt
- Move _drop_rows_by_label from Frame to IndexedFrame (#11157) @vyasr
- Clean up _copy_type_metadata (#11156) @vyasr
- Add
nvcc
conda package in dev environment (#11154) @galipremsagar - Struct binary comparison op functionality for spark rapids (#11153) @rwlee
- Refactor inline conditionals. (#11151) @bdice
- Refactor Spark hashing tests (#11145) @bdice
- Add new
_from_data_like_self
factory (#11140) @vyasr - Update get_cucollections to use rapids-cmake (#11139) @vyasr
- Remove unnecessary extra function for libcudacxx detection (#11138) @vyasr
- Allow initial value for cudf::reduce and cudf::segmented_reduce. (#11137) @SrikarVanavasam
- Remove Index.replace API (#11131) @vyasr
- Move char-type table function declarations from src to detail (#11127) @davidwendt
- Clean up repo root (#11124) @bdice
- Improve print formatting of strings containing newline characters. (#11108) @nvdbaranec
- Fix cudf::string_view::find() to return pos for empty string argument (#11107) @davidwendt
- Forward-merge branch-22.06 to branch-22.08 (#11086) @bdice
- Take iterators by value in clamp.cu. (#11084) @bdice
- Performance improvements for row to column conversions (#11075) @hyperbolic2346
- Remove deprecated Index methods from Frame (#11073) @vyasr
- Use per-page max compressed size estimate for compression (#11066) @devavret
- column to row refactor for performance (#11063) @hyperbolic2346
- Include
skbuild
directory intobuild.sh
clean
operation (#11060) @galipremsagar - Unpin
dask
&distributed
for development (#11058) @galipremsagar - Add support for
Series.between
(#11051) @galipremsagar - Fix groupby include (#11046) @bwyogatama
- Regex cleanup internal reclass and reclass_device classes (#11045) @davidwendt
- Remove public API of cudf.merge_sorted. (#11032) @bdice
- Drop python
3.7
in code-base (#11029) @galipremsagar - Addition & integration of the integer power operator (#11025) @AtlantaPepsi
- Refactor
lists::contains
(#11019) @ttnghia - Change build.sh to find C++ library by default and avoid shadowing CMAKE_ARGS (#11013) @vyasr
- Clean up parquet unit test (#11005) @PointKernel
- Add missing #pragma once to header files (#11004) @karthikeyann
- Cleanup
iterator.cuh
and add fixed point support forscalar_optional_accessor
(#10999) @ttnghia - Refactor
cudf::contains
(#10997) @ttnghia - Remove Arrow CUDA IPC code (#10995) @shwina
- Change file extension for groupby benchmark (#10985) @ttnghia
- Sort recipe include checks. (#10984) @bdice
- Update cuCollections for thrust upgrade (#10983) @PointKernel
- Expose row-group size options in cudf ParquetWriter (#10980) @rjzamora
- Cleanup cudf::strings::detail::regex_parser class source (#10975) @davidwendt
- Handle missing fields as nulls in get_json_object() (#10970) @SrikarVanavasam
- Fix license families to match all-caps expected by conda-verify. (#10931) @bdice
- Include <optional> for GCC 11 compatibility. (#10927) @bdice
- Enable builds with scikit-build (#10919) @vyasr
- Improve
distinct
by usingcuco::static_map::retrieve_all
(#10916) @PointKernel - update cudfjni to 22.08.0-SNAPSHOT (#10910) @pxLi
- Improve the capture of fatal cuda error (#10884) @sperlingxx
- Cleanup regex compiler operators and operands source (#10879) @davidwendt
- Buffer: make
.ptr
read-only (#10872) @madsbk - Configurable NaN handling in device_row_comparators (#10870) @rwlee
- Register
cudf.core.groupby.Grouper
objects to daskgrouper_dispatch
(#10838) @brandon-b-miller - Upgrade to
arrow-8
(#10816) @galipremsagar - Remove getattr method in RangeIndex class (#10538) @skirui-source
- Adding bins to value counts (#8247) @marlenezw