Refactor CUB's util_debug #3345

bernhardmgruber · 2025-01-10T21:26:31Z

No description provided.

gonidelis · 2025-01-21T21:34:16Z

Got a related issue or an explanation for that change? I am out of the loop.

github-actions · 2025-01-21T21:40:16Z

🟩 CI finished in 3h 27m: Pass: 100%/78 | Total: 2d 10h | Avg: 45m 14s | Max: 1h 46m | Hits: 157%/12708

🟩 cub: Pass: 100%/38 | Total: 1d 13h | Avg: 58m 45s | Max: 1h 46m | Hits: 86%/3528

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total:  1d 11h | Avg: 58m 40s | Max:  1h 46m | Hits:  86%/3528  
  🟩 arm64              Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 05m
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  4h 44m | Avg: 56m 51s | Max:  1h 02m | Hits:  87%/882   
  🟩 12.5               Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m
  🟩 12.6               Pass: 100%/31  | Total:  1d 06h | Avg: 58m 32s | Max:  1h 46m | Hits:  86%/2646  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 02m
  🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 44m | Avg: 56m 51s | Max:  1h 02m | Hits:  87%/882   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m
  🟩 nvcc12.6           Pass: 100%/29  | Total:  1d 04h | Avg: 58m 16s | Max:  1h 46m | Hits:  86%/2646  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 02m
  🟩 nvcc               Pass: 100%/36  | Total:  1d 11h | Avg: 58m 34s | Max:  1h 46m | Hits:  86%/3528  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  3h 40m | Avg: 55m 05s | Max: 58m 50s
  🟩 Clang15            Pass: 100%/1   | Total: 52m 24s | Avg: 52m 24s | Max: 52m 24s
  🟩 Clang16            Pass: 100%/1   | Total: 52m 05s | Avg: 52m 05s | Max: 52m 05s
  🟩 Clang17            Pass: 100%/1   | Total: 56m 40s | Avg: 56m 40s | Max: 56m 40s
  🟩 Clang18            Pass: 100%/7   | Total:  5h 37m | Avg: 48m 16s | Max:  1h 02m
  🟩 GCC7               Pass: 100%/2   | Total:  1h 55m | Avg: 57m 44s | Max: 57m 50s
  🟩 GCC8               Pass: 100%/1   | Total: 55m 12s | Avg: 55m 12s | Max: 55m 12s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 56m | Avg: 58m 13s | Max:  1h 01m
  🟩 GCC10              Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m
  🟩 GCC11              Pass: 100%/1   | Total: 57m 22s | Avg: 57m 22s | Max: 57m 22s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 44m | Avg: 34m 53s | Max: 57m 45s
  🟩 GCC13              Pass: 100%/8   | Total:  9h 56m | Avg:  1h 14m | Max:  1h 46m
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 10m | Hits:  87%/1764  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 10m | Hits:  86%/1764  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total: 11h 59m | Avg: 51m 23s | Max:  1h 02m
  🟩 GCC                Pass: 100%/18  | Total: 18h 26m | Avg:  1h 01m | Max:  1h 46m
  🟩 MSVC               Pass: 100%/4   | Total:  4h 33m | Avg:  1h 08m | Max:  1h 10m | Hits:  86%/3528  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 46m 55s | Avg: 23m 27s | Max: 26m 25s
  🟩 v100               Pass: 100%/36  | Total:  1d 12h | Avg:  1h 00m | Max:  1h 46m | Hits:  86%/3528  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  1d 05h | Avg: 57m 00s | Max:  1h 10m | Hits:  86%/3528  
  🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 46m | Avg:  1h 46m | Max:  1h 46m
  🟩 GraphCapture       Pass: 100%/1   | Total:  1h 35m | Avg:  1h 35m | Max:  1h 35m
  🟩 HostLaunch         Pass: 100%/3   | Total:  2h 14m | Avg: 44m 56s | Max:  1h 25m
  🟩 TestGPU            Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 46m
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 46m 55s | Avg: 23m 27s | Max: 26m 25s
  🟩 90a                Pass: 100%/1   | Total: 25m 33s | Avg: 25m 33s | Max: 25m 33s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 13h 54m | Avg: 59m 37s | Max:  1h 10m | Hits:  87%/2646  
  🟩 20                 Pass: 100%/24  | Total: 23h 18m | Avg: 58m 16s | Max:  1h 46m | Hits:  84%/882

🟩 thrust: Pass: 100%/37 | Total: 20h 39m | Avg: 33m 29s | Max: 1h 08m | Hits: 185%/9180

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 38s | Avg: 18m 49s | Max: 25m 04s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 19h 37m | Avg: 33m 37s | Max:  1h 08m | Hits: 185%/9180  
  🟩 arm64              Pass: 100%/2   | Total:  1h 02m | Avg: 31m 10s | Max: 31m 57s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 06m | Avg: 37m 20s | Max: 55m 44s | Hits: 139%/1836  
  🟩 12.5               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 51s | Max: 59m 52s
  🟩 12.6               Pass: 100%/30  | Total: 15h 39m | Avg: 31m 18s | Max:  1h 08m | Hits: 196%/7344  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 56m 00s | Avg: 28m 00s | Max: 28m 13s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 06m | Avg: 37m 20s | Max: 55m 44s | Hits: 139%/1836  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 53m | Avg: 56m 51s | Max: 59m 52s
  🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 43m | Avg: 31m 32s | Max:  1h 08m | Hits: 196%/7344  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 56m 00s | Avg: 28m 00s | Max: 28m 13s
  🟩 nvcc               Pass: 100%/35  | Total: 19h 43m | Avg: 33m 48s | Max:  1h 08m | Hits: 185%/9180  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 02m | Avg: 30m 36s | Max: 31m 44s
  🟩 Clang15            Pass: 100%/1   | Total: 32m 13s | Avg: 32m 13s | Max: 32m 13s
  🟩 Clang16            Pass: 100%/1   | Total: 30m 54s | Avg: 30m 54s | Max: 30m 54s
  🟩 Clang17            Pass: 100%/1   | Total: 34m 00s | Avg: 34m 00s | Max: 34m 00s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 53m | Avg: 24m 51s | Max: 32m 30s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 02m | Avg: 31m 11s | Max: 33m 09s
  🟩 GCC8               Pass: 100%/1   | Total: 31m 48s | Avg: 31m 48s | Max: 31m 48s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 12m | Avg: 36m 01s | Max: 37m 06s
  🟩 GCC10              Pass: 100%/1   | Total: 33m 06s | Avg: 33m 06s | Max: 33m 06s
  🟩 GCC11              Pass: 100%/1   | Total: 35m 30s | Avg: 35m 30s | Max: 35m 30s
  🟩 GCC12              Pass: 100%/1   | Total: 35m 51s | Avg: 35m 51s | Max: 35m 51s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 04m | Avg: 23m 02s | Max: 36m 55s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 56m | Avg: 58m 26s | Max:  1h 01m | Hits: 139%/3672  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 40m | Avg: 53m 26s | Max:  1h 08m | Hits: 215%/5508  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 51s | Max: 59m 52s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  6h 33m | Avg: 28m 06s | Max: 34m 00s
  🟩 GCC                Pass: 100%/16  | Total:  7h 35m | Avg: 28m 26s | Max: 37m 06s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 37m | Avg: 55m 26s | Max:  1h 08m | Hits: 185%/9180  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 53m | Avg: 56m 51s | Max: 59m 52s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 20h 39m | Avg: 33m 29s | Max:  1h 08m | Hits: 185%/9180  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 19h 04m | Avg: 36m 54s | Max:  1h 08m | Hits: 139%/7344  
  🟩 TestCPU            Pass: 100%/3   | Total: 52m 06s | Avg: 17m 22s | Max: 36m 22s | Hits: 365%/1836  
  🟩 TestGPU            Pass: 100%/3   | Total: 43m 20s | Avg: 14m 26s | Max: 16m 09s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 20m 29s | Avg: 20m 29s | Max: 20m 29s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  9h 09m | Avg: 39m 17s | Max:  1h 01m | Hits: 139%/5508  
  🟩 20                 Pass: 100%/21  | Total: 10h 51m | Avg: 31m 02s | Max:  1h 08m | Hits: 252%/3672

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 31s | Avg: 4m 45s | Max: 7m 31s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 31s | Avg:  4m 45s | Max:  7m 31s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
  🟩 Test               Pass: 100%/1   | Total:  7m 31s | Avg:  7m 31s | Max:  7m 31s

🟩 python: Pass: 100%/1 | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

bernhardmgruber force-pushed the ref_util_debug branch 2 times, most recently from fbc9fb1 to 8d05f7a Compare January 15, 2025 14:30

bernhardmgruber marked this pull request as ready for review January 15, 2025 14:31

bernhardmgruber requested review from a team as code owners January 15, 2025 14:31

bernhardmgruber requested review from alliepiper, gonidelis and elstehle January 15, 2025 14:31

NVIDIA deleted a comment from copy-pr-bot bot Jan 21, 2025

bernhardmgruber force-pushed the ref_util_debug branch from 8d05f7a to ca1fc92 Compare January 21, 2025 11:28

Refactor CUB's util_debug

62b9e00

bernhardmgruber force-pushed the ref_util_debug branch from ca1fc92 to 62b9e00 Compare January 21, 2025 18:11

gonidelis approved these changes Jan 21, 2025

View reviewed changes

miscco approved these changes Jan 22, 2025

View reviewed changes

elstehle approved these changes Jan 22, 2025

View reviewed changes

alliepiper approved these changes Jan 22, 2025

View reviewed changes

bernhardmgruber merged commit d47c1c1 into NVIDIA:main Jan 22, 2025
93 of 96 checks passed

bernhardmgruber deleted the ref_util_debug branch January 22, 2025 16:27

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025

Refactor CUB's util_debug (NVIDIA#3345)

d6ae7f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor CUB's util_debug #3345

Refactor CUB's util_debug #3345

bernhardmgruber commented Jan 10, 2025

gonidelis commented Jan 21, 2025

github-actions bot commented Jan 21, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 13h | Avg: 58m 45s | Max: 1h 46m | Hits: 86%/3528

🟩 thrust: Pass: 100%/37 | Total: 20h 39m | Avg: 33m 29s | Max: 1h 08m | Hits: 185%/9180

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 31s | Avg: 4m 45s | Max: 7m 31s

🟩 python: Pass: 100%/1 | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

Refactor CUB's util_debug #3345

Refactor CUB's util_debug #3345

Conversation

bernhardmgruber commented Jan 10, 2025

gonidelis commented Jan 21, 2025

github-actions bot commented Jan 21, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 13h | Avg: 58m 45s | Max: 1h 46m | Hits: 86%/3528

🟩 thrust: Pass: 100%/37 | Total: 20h 39m | Avg: 33m 29s | Max: 1h 08m | Hits: 185%/9180

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 31s | Avg: 4m 45s | Max: 7m 31s

🟩 python: Pass: 100%/1 | Total: 47m 07s | Avg: 47m 07s | Max: 47m 07s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)