Cleanup CUB util_arch #2773

bernhardmgruber · 2024-11-12T10:56:36Z

This PR is a plain refactoring.

PR is blocked on C++14 support for constexpr ceil_div;

copy-pr-bot · 2025-01-13T14:25:53Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2025-01-13T16:37:09Z

🟩 CI finished in 2h 09m: Pass: 100%/78 | Total: 2d 06h | Avg: 41m 41s | Max: 1h 13m | Hits: 148%/12368

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 52m 51s | Max: 1h 13m | Hits: 33%/3108

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total:  1d 07h | Avg: 52m 38s | Max:  1h 13m | Hits:  33%/3108  
  🟩 arm64              Pass: 100%/2   | Total:  1h 53m | Avg: 56m 42s | Max: 57m 31s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  4h 54m | Avg: 58m 49s | Max:  1h 00m | Hits:  33%/777   
  🟩 12.5               Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 07m
  🟩 12.6               Pass: 100%/31  | Total:  1d 02h | Avg: 51m 02s | Max:  1h 13m | Hits:  33%/2331  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 02m
  🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 54m | Avg: 58m 49s | Max:  1h 00m | Hits:  33%/777   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 07m
  🟩 nvcc12.6           Pass: 100%/29  | Total:  1d 00h | Avg: 50m 21s | Max:  1h 13m | Hits:  33%/2331  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 02m
  🟩 nvcc               Pass: 100%/36  | Total:  1d 07h | Avg: 52m 24s | Max:  1h 13m | Hits:  33%/3108  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  3h 58m | Avg: 59m 40s | Max:  1h 00m
  🟩 Clang15            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 Clang16            Pass: 100%/1   | Total: 55m 47s | Avg: 55m 47s | Max: 55m 47s
  🟩 Clang17            Pass: 100%/1   | Total: 57m 19s | Avg: 57m 19s | Max: 57m 19s
  🟩 Clang18            Pass: 100%/7   | Total:  5h 52m | Avg: 50m 24s | Max:  1h 02m
  🟩 GCC7               Pass: 100%/2   | Total:  1h 58m | Avg: 59m 14s | Max:  1h 01m
  🟩 GCC8               Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m
  🟩 GCC9               Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m
  🟩 GCC10              Pass: 100%/1   | Total: 57m 48s | Avg: 57m 48s | Max: 57m 48s
  🟩 GCC11              Pass: 100%/1   | Total: 58m 43s | Avg: 58m 43s | Max: 58m 43s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 42m | Avg: 34m 19s | Max: 58m 26s
  🟩 GCC13              Pass: 100%/8   | Total:  5h 25m | Avg: 40m 40s | Max:  1h 01m
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 13m | Hits:  33%/1554  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m | Hits:  33%/1554  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 07m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total: 12h 46m | Avg: 54m 44s | Max:  1h 02m
  🟩 GCC                Pass: 100%/18  | Total: 14h 04m | Avg: 46m 54s | Max:  1h 01m
  🟩 MSVC               Pass: 100%/4   | Total:  4h 25m | Avg:  1h 06m | Max:  1h 13m | Hits:  33%/3108  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 07m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 44m 31s | Avg: 22m 15s | Max: 28m 22s
  🟩 v100               Pass: 100%/36  | Total:  1d 08h | Avg: 54m 33s | Max:  1h 13m | Hits:  33%/3108  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  1d 06h | Avg: 58m 27s | Max:  1h 13m | Hits:  33%/3108  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 20m 00s | Avg: 20m 00s | Max: 20m 00s
  🟩 GraphCapture       Pass: 100%/1   | Total: 40m 56s | Avg: 40m 56s | Max: 40m 56s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 58s | Max: 30m 11s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 09m | Avg: 34m 51s | Max: 44m 46s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 44m 31s | Avg: 22m 15s | Max: 28m 22s
  🟩 90a                Pass: 100%/1   | Total: 25m 06s | Avg: 25m 06s | Max: 25m 06s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 14h 16m | Avg:  1h 01m | Max:  1h 13m | Hits:  33%/2331  
  🟩 20                 Pass: 100%/24  | Total: 19h 12m | Avg: 48m 00s | Max:  1h 07m | Hits:  33%/777

🟩 thrust: Pass: 100%/37 | Total: 20h 03m | Avg: 32m 32s | Max: 1h 09m | Hits: 186%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 41m 52s | Avg: 20m 56s | Max: 27m 45s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 19h 03m | Avg: 32m 40s | Max:  1h 09m | Hits: 186%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 03s | Max: 32m 09s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 04m | Avg: 36m 49s | Max:  1h 00m | Hits: 142%/1852  
  🟩 12.5               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 59s
  🟩 12.6               Pass: 100%/30  | Total: 15h 12m | Avg: 30m 24s | Max:  1h 09m | Hits: 198%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 52m 26s | Avg: 26m 13s | Max: 28m 04s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 04m | Avg: 36m 49s | Max:  1h 00m | Hits: 142%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 59s
  🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 19m | Avg: 30m 42s | Max:  1h 09m | Hits: 198%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 52m 26s | Avg: 26m 13s | Max: 28m 04s
  🟩 nvcc               Pass: 100%/35  | Total: 19h 11m | Avg: 32m 53s | Max:  1h 09m | Hits: 186%/9260  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 01m | Avg: 30m 23s | Max: 32m 33s
  🟩 Clang15            Pass: 100%/1   | Total: 29m 13s | Avg: 29m 13s | Max: 29m 13s
  🟩 Clang16            Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
  🟩 Clang17            Pass: 100%/1   | Total: 31m 09s | Avg: 31m 09s | Max: 31m 09s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 43m | Avg: 23m 22s | Max: 33m 26s
  🟩 GCC7               Pass: 100%/2   | Total: 58m 27s | Avg: 29m 13s | Max: 29m 25s
  🟩 GCC8               Pass: 100%/1   | Total: 32m 41s | Avg: 32m 41s | Max: 32m 41s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 00s | Max: 34m 17s
  🟩 GCC10              Pass: 100%/1   | Total: 30m 19s | Avg: 30m 19s | Max: 30m 19s
  🟩 GCC11              Pass: 100%/1   | Total: 34m 51s | Avg: 34m 51s | Max: 34m 51s
  🟩 GCC12              Pass: 100%/1   | Total: 33m 16s | Avg: 33m 16s | Max: 33m 16s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 00m | Avg: 22m 35s | Max: 35m 39s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 05s | Max:  1h 00m | Hits: 142%/3704  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 45m | Avg: 55m 04s | Max:  1h 09m | Hits: 216%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 59s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  6h 16m | Avg: 26m 53s | Max: 33m 26s
  🟩 GCC                Pass: 100%/16  | Total:  7h 16m | Avg: 27m 15s | Max: 35m 39s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 43m | Avg: 56m 40s | Max:  1h 09m | Hits: 186%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 47m | Avg: 53m 46s | Max: 55m 59s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 20h 03m | Avg: 32m 32s | Max:  1h 09m | Hits: 186%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 18h 33m | Avg: 35m 54s | Max:  1h 09m | Hits: 142%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 51m 12s | Avg: 17m 04s | Max: 35m 13s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 39m 17s | Avg: 13m 05s | Max: 14m 07s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 19m 40s | Avg: 19m 40s | Max: 19m 40s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  9h 02m | Avg: 38m 43s | Max:  1h 00m | Hits: 142%/5556  
  🟩 20                 Pass: 100%/21  | Total: 10h 19m | Avg: 29m 30s | Max:  1h 09m | Hits: 253%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 04s | Avg: 6m 32s | Max: 10m 57s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 13m 04s | Avg:  6m 32s | Max: 10m 57s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
  🟩 Test               Pass: 100%/1   | Total: 10m 57s | Avg: 10m 57s | Max: 10m 57s

🟩 python: Pass: 100%/1 | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

fbusato · 2025-01-13T18:08:10Z

cub/cub/util_arch.cuh

+    ::cuda::std::max(1, Nominal4ByteItemsPerThread * 4 / ::cuda::std::max(4, int{sizeof(T)}));
+  static constexpr int BLOCK_THREADS =
+    ::cuda::std::min(Nominal4ByteBlockThreads,
+                     ::cuda::ceil_div(int{detail::max_smem_per_block} / (int{sizeof(T)} * ITEMS_PER_THREAD), 32) * 32);


remark. This is essentially round_up

Right. Looking forward to #3234 :)

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

Cleanup util_arch (NVIDIA#2773) Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Deprecate thrust::numeric_limits (NVIDIA#3366) Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 fix documentation

Cleanup util_arch

29bda1b

bernhardmgruber force-pushed the util_arch branch from 8d61de9 to 29bda1b Compare January 13, 2025 14:25

bernhardmgruber marked this pull request as ready for review January 13, 2025 14:25

bernhardmgruber requested a review from a team as a code owner January 13, 2025 14:25

bernhardmgruber requested review from gevtushenko and alliepiper January 13, 2025 14:25

bernhardmgruber added the cub For all items related to CUB label Jan 13, 2025

fbusato approved these changes Jan 13, 2025

View reviewed changes

bernhardmgruber merged commit daab0a4 into NVIDIA:main Jan 14, 2025
97 checks passed

bernhardmgruber deleted the util_arch branch January 14, 2025 08:39

dkolsen-pgi mentioned this pull request Jan 15, 2025

[BUG]: BLOCK_THREADS won't compile with NVC++ due to non-constexpr assertion macro #3411

Closed

1 task

shwina pushed a commit to shwina/cccl that referenced this pull request Jan 16, 2025

Cleanup util_arch (NVIDIA#2773)

abda850

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025

Cleanup util_arch (NVIDIA#2773)

2e010b1

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025

Cleanup util_arch (NVIDIA#2773)

3fe2685

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025

Cleanup util_arch (NVIDIA#2773)

ec2cd74

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 23, 2025

Cleanup util_arch (NVIDIA#2773)

2ce3beb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup CUB util_arch #2773

Cleanup CUB util_arch #2773

bernhardmgruber commented Nov 12, 2024 •

edited

Loading

copy-pr-bot bot commented Jan 13, 2025

github-actions bot commented Jan 13, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 52m 51s | Max: 1h 13m | Hits: 33%/3108

🟩 thrust: Pass: 100%/37 | Total: 20h 03m | Avg: 32m 32s | Max: 1h 09m | Hits: 186%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 04s | Avg: 6m 32s | Max: 10m 57s

🟩 python: Pass: 100%/1 | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

fbusato Jan 13, 2025

bernhardmgruber Jan 14, 2025

Cleanup CUB util_arch #2773

Cleanup CUB util_arch #2773

Conversation

bernhardmgruber commented Nov 12, 2024 • edited Loading

copy-pr-bot bot commented Jan 13, 2025

github-actions bot commented Jan 13, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 52m 51s | Max: 1h 13m | Hits: 33%/3108

🟩 thrust: Pass: 100%/37 | Total: 20h 03m | Avg: 32m 32s | Max: 1h 09m | Hits: 186%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 13m 04s | Avg: 6m 32s | Max: 10m 57s

🟩 python: Pass: 100%/1 | Total: 26m 25s | Avg: 26m 25s | Max: 26m 25s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

fbusato Jan 13, 2025

Choose a reason for hiding this comment

bernhardmgruber Jan 14, 2025

Choose a reason for hiding this comment

bernhardmgruber commented Nov 12, 2024 •

edited

Loading