Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for large number of items and large number of segments to DeviceSegmentedSort #3308

Merged
merged 15 commits into from
Jan 14, 2025

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Jan 9, 2025

Description

⛓️ Depends on #3246

Closes #3132

Benchmark results:

We switched from hardcoded int offset type to choose_offset_t. The following table depicts the performance difference on H100 for (a) when the user passes in 32-bit wide offset type (i.e., the leftmost two data columns) over the performance we have in main today and (b) when the user passes in 64-bit wide offset type (i.e., the rightmost two data columns) over the performance we have in main today.

x Diff u32 vs i32.main any num items Diff u32 vs i32.main 2^30 num items Diff u64 vs i32.main any num items Diff u64 vs i32.main 2^30 num items
min 98.66% 98.66% 74.74% 76.05%
max 102.24% 102.24% 106.10% 106.10%
avg 100.33% 100.26% 98.95% 99.10%
Performance comparison for different offset types over main.i32
T{ct} Segments{io} Elements{io} Entropy I32 u32 u32/i32 time i64 i64/i32 time u64 u64/i32 time
I8 2^12 = 4096 2^22 = 4194304 1 171.064 170.077 99.42% 171.608 100.32% 171.669 100.35%
I8 2^12 = 4096 2^26 = 67108864 1 989.288 984.608 99.53% 1004 101.49% 1007 101.79%
I8 2^12 = 4096 2^30 = 1073741824 1 15242 15185 99.63% 15336 100.62% 15422 101.18%
I8 2^16 = 65536 2^22 = 4194304 1 323.736 320.816 99.10% 321.864 99.42% 322.058 99.48%
I8 2^16 = 65536 2^26 = 67108864 1 1805 1792 99.28% 1808 100.17% 1807 100.11%
I8 2^16 = 65536 2^30 = 1073741824 1 10383 10300 99.20% 10525 101.37% 10545 101.56%
I8 2^20 = 1048576 2^22 = 4194304 1 152.102 151.422 99.55% 156.698 103.02% 157.003 103.22%
I8 2^20 = 1048576 2^26 = 67108864 1 4643 4648 100.11% 4683 100.86% 4676 100.71%
I8 2^20 = 1048576 2^30 = 1073741824 1 27983 28023 100.14% 28284 101.08% 28261 100.99%
I8 2^12 = 4096 2^22 = 4194304 0.201 169.123 169.35 100.13% 170.66 100.91% 170.81 101.00%
I8 2^12 = 4096 2^26 = 67108864 0.201 969.104 968.397 99.93% 987.917 101.94% 988.25 101.98%
I8 2^12 = 4096 2^30 = 1073741824 0.201 14944 15025 100.54% 15094 101.00% 15163 101.47%
I8 2^16 = 65536 2^22 = 4194304 0.201 321.08 320.83 99.92% 321.694 100.19% 322.101 100.32%
I8 2^16 = 65536 2^26 = 67108864 0.201 1783 1784 100.06% 1800 100.95% 1799 100.90%
I8 2^16 = 65536 2^30 = 1073741824 0.201 10120 10128 100.08% 10349 102.26% 10373 102.50%
I8 2^20 = 1048576 2^22 = 4194304 0.201 151.437 150.722 99.53% 156.332 103.23% 156.826 103.56%
I8 2^20 = 1048576 2^26 = 67108864 0.201 4643 4645 100.04% 4680 100.80% 4674 100.67%
I8 2^20 = 1048576 2^30 = 1073741824 0.201 27877 27906 100.10% 28166 101.04% 28145 100.96%
I16 2^12 = 4096 2^22 = 4194304 1 331.151 334.64 101.05% 335.582 101.34% 336.183 101.52%
I16 2^12 = 4096 2^26 = 67108864 1 2148 2159 100.51% 2196 102.23% 2198 102.33%
I16 2^12 = 4096 2^30 = 1073741824 1 34572 34688 100.34% 34521 99.85% 34527 99.87%
I16 2^16 = 65536 2^22 = 4194304 1 656.383 666.756 101.58% 661.087 100.72% 661.379 100.76%
I16 2^16 = 65536 2^26 = 67108864 1 4142 4194 101.26% 4210 101.64% 4214 101.74%
I16 2^16 = 65536 2^30 = 1073741824 1 25701 25861 100.62% 26508 103.14% 26501 103.11%
I16 2^20 = 1048576 2^22 = 4194304 1 166.968 167.761 100.47% 172.948 103.58% 173.548 103.94%
I16 2^20 = 1048576 2^26 = 67108864 1 10198 10359 101.58% 10305 101.05% 10308 101.08%
I16 2^20 = 1048576 2^30 = 1073741824 1 65475 66307 101.27% 66562 101.66% 66604 101.72%
I16 2^12 = 4096 2^22 = 4194304 0.201 329.561 333.422 101.17% 334.258 101.43% 335.089 101.68%
I16 2^12 = 4096 2^26 = 67108864 0.201 2109 2124 100.71% 2164 102.61% 2167 102.75%
I16 2^12 = 4096 2^30 = 1073741824 0.201 34106 34248 100.42% 34201 100.28% 34213 100.31%
I16 2^16 = 65536 2^22 = 4194304 0.201 655.958 666.458 101.60% 660.648 100.71% 661.596 100.86%
I16 2^16 = 65536 2^26 = 67108864 0.201 4130 4183 101.28% 4198 101.65% 4202 101.74%
I16 2^16 = 65536 2^30 = 1073741824 0.201 25185 25389 100.81% 26057 103.46% 26078 103.55%
I16 2^20 = 1048576 2^22 = 4194304 0.201 166.447 167.336 100.53% 173.288 104.11% 172.969 103.92%
I16 2^20 = 1048576 2^26 = 67108864 0.201 10194 10355 101.58% 10298 101.02% 10302 101.06%
I16 2^20 = 1048576 2^30 = 1073741824 0.201 65315 66158 101.29% 66389 101.64% 66447 101.73%
I32 2^12 = 4096 2^22 = 4194304 1 531.756 538.586 101.28% 544.199 102.34% 544.732 102.44%
I32 2^12 = 4096 2^26 = 67108864 1 4334 4332 99.95% 4415 101.87% 4418 101.94%
I32 2^12 = 4096 2^30 = 1073741824 1 79322 79113 99.74% 78212 98.60% 78226 98.62%
I32 2^16 = 65536 2^22 = 4194304 1 1005 1026 102.09% 1021 101.59% 1021 101.59%
I32 2^16 = 65536 2^26 = 67108864 1 6736 6848 101.66% 6889 102.27% 6892 102.32%
I32 2^16 = 65536 2^30 = 1073741824 1 52419 52379 99.92% 53737 102.51% 53750 102.54%
I32 2^20 = 1048576 2^22 = 4194304 1 159.591 162.123 101.59% 166.05 104.05% 164.699 103.20%
I32 2^20 = 1048576 2^26 = 67108864 1 15913 16124 101.33% 16097 101.16% 16053 100.88%
I32 2^20 = 1048576 2^30 = 1073741824 1 107649 108443 100.74% 109075 101.32% 109081 101.33%
I32 2^12 = 4096 2^22 = 4194304 0.201 534.099 536.152 100.38% 540.949 101.28% 540.975 101.29%
I32 2^12 = 4096 2^26 = 67108864 0.201 4280 4224 98.69% 4313 100.77% 4312 100.75%
I32 2^12 = 4096 2^30 = 1073741824 0.201 78510 77641 98.89% 76831 97.86% 76788 97.81%
I32 2^16 = 65536 2^22 = 4194304 0.201 1013 1026 101.28% 1021 100.79% 1020 100.69%
I32 2^16 = 65536 2^26 = 67108864 0.201 6787 6833 100.68% 6873 101.27% 6871 101.24%
I32 2^16 = 65536 2^30 = 1073741824 0.201 51874 51179 98.66% 52563 101.33% 52561 101.32%
I32 2^20 = 1048576 2^22 = 4194304 0.201 158.486 158.639 100.10% 165.821 104.63% 164.546 103.82%
I32 2^20 = 1048576 2^26 = 67108864 0.201 15905 16103 101.24% 16082 101.11% 16087 101.14%
I32 2^20 = 1048576 2^30 = 1073741824 0.201 107473 108220 100.70% 108789 101.22% 108796 101.23%
I64 2^12 = 4096 2^22 = 4194304 1 992.396 995.195 100.28% 823.864 83.02% 827.203 83.35%
I64 2^12 = 4096 2^26 = 67108864 1 13335 13358 100.17% 12605 94.53% 12683 95.11%
I64 2^12 = 4096 2^30 = 1073741824 1 263087 263605 100.20% 227206 86.36% 228033 86.68%
I64 2^16 = 65536 2^22 = 4194304 1 2499 2508 100.36% 1872 74.91% 1871 74.87%
I64 2^16 = 65536 2^26 = 67108864 1 12189 12237 100.39% 9363 76.82% 9386 77.00%
I64 2^16 = 65536 2^30 = 1073741824 1 164230 164130 99.94% 150192 91.45% 150436 91.60%
I64 2^20 = 1048576 2^22 = 4194304 1 456.432 459.818 100.74% 399.507 87.53% 398.75 87.36%
I64 2^20 = 1048576 2^26 = 67108864 1 39183 39567 100.98% 29295 74.76% 29295 74.76%
I64 2^20 = 1048576 2^30 = 1073741824 1 191465 193515 101.07% 145777 76.14% 146110 76.31%
I64 2^12 = 4096 2^22 = 4194304 0.201 980.146 987.815 100.78% 815.258 83.18% 820.249 83.69%
I64 2^12 = 4096 2^26 = 67108864 0.201 13039 13085 100.35% 12278 94.16% 12367 94.85%
I64 2^12 = 4096 2^30 = 1073741824 0.201 256766 257719 100.37% 222320 86.58% 223453 87.03%
I64 2^16 = 65536 2^22 = 4194304 0.201 2481 2506 101.01% 1869 75.33% 1869 75.33%
I64 2^16 = 65536 2^26 = 67108864 0.201 12051 12161 100.91% 9276 76.97% 9303 77.20%
I64 2^16 = 65536 2^30 = 1073741824 0.201 160395 160791 100.25% 146069 91.07% 146554 91.37%
I64 2^20 = 1048576 2^22 = 4194304 0.201 456.292 458.946 100.58% 399.642 87.58% 398.532 87.34%
I64 2^20 = 1048576 2^26 = 67108864 0.201 39165 39546 100.97% 29271 74.74% 29271 74.74%
I64 2^20 = 1048576 2^30 = 1073741824 0.201 190570 192387 100.95% 144503 75.83% 144933 76.05%
I128 2^12 = 4096 2^22 = 4194304 1 1816 1821 100.28% 1903 104.79% 1905 104.90%
I128 2^12 = 4096 2^26 = 67108864 1 45585 45616 100.07% 45823 100.52% 45829 100.54%
I128 2^12 = 4096 2^30 = 1073741824 1 803494 802234 99.84% 804028 100.07% 809422 100.74%
I128 2^16 = 65536 2^22 = 4194304 1 5426 5391 99.35% 5432 100.11% 5431 100.09%
I128 2^16 = 65536 2^26 = 67108864 1 20912 20874 99.82% 22138 105.86% 22158 105.96%
I128 2^16 = 65536 2^30 = 1073741824 1 579030 578849 99.97% 579926 100.15% 579673 100.11%
I128 2^20 = 1048576 2^22 = 4194304 1 1908 1897 99.42% 1914 100.31% 1913 100.26%
I128 2^20 = 1048576 2^26 = 67108864 1 86362 85799 99.35% 86501 100.16% 86509 100.17%
I128 2^20 = 1048576 2^30 = 1073741824 1 325561 324647 99.72% 345121 106.01% 345348 106.08%
I128 2^12 = 4096 2^22 = 4194304 0.201 1805 1804 99.94% 1885 104.43% 1890 104.71%
I128 2^12 = 4096 2^26 = 67108864 0.201 45078 44787 99.35% 45119 100.09% 45006 99.84%
I128 2^12 = 4096 2^30 = 1073741824 0.201 790849 790246 99.92% 789066 99.77% 788230 99.67%
I128 2^16 = 65536 2^22 = 4194304 0.201 5425 5390 99.35% 5430 100.09% 5433 100.15%
I128 2^16 = 65536 2^26 = 67108864 0.201 20733 20694 99.81% 21955 105.89% 21975 105.99%
I128 2^16 = 65536 2^30 = 1073741824 0.201 569609 569317 99.95% 570682 100.19% 570469 100.15%
I128 2^20 = 1048576 2^22 = 4194304 0.201 1907 1896 99.42% 1913 100.31% 1913 100.31%
I128 2^20 = 1048576 2^26 = 67108864 0.201 86335 85765 99.34% 86465 100.15% 86475 100.16%
I128 2^20 = 1048576 2^30 = 1073741824 0.201 322907 322056 99.74% 342227 105.98% 342593 106.10%
F32 2^12 = 4096 2^22 = 4194304 1 538.54 536.098 99.55% 546.961 101.56% 546.583 101.49%
F32 2^12 = 4096 2^26 = 67108864 1 4218 4206 99.72% 4387 104.01% 4390 104.08%
F32 2^12 = 4096 2^30 = 1073741824 1 76705 76848 100.19% 77639 101.22% 77684 101.28%
F32 2^16 = 65536 2^22 = 4194304 1 1039 1038 99.90% 1036 99.71% 1036 99.71%
F32 2^16 = 65536 2^26 = 67108864 1 6905 6882 99.67% 6968 100.91% 6970 100.94%
F32 2^16 = 65536 2^30 = 1073741824 1 51133 51170 100.07% 53807 105.23% 53809 105.23%
F32 2^20 = 1048576 2^22 = 4194304 1 169.133 171.395 101.34% 173.927 102.83% 174.545 103.20%
F32 2^20 = 1048576 2^26 = 67108864 1 16353 16429 100.46% 16448 100.58% 16454 100.62%
F32 2^20 = 1048576 2^30 = 1073741824 1 110096 109928 99.85% 111248 101.05% 111267 101.06%
F32 2^12 = 4096 2^22 = 4194304 0.201 540.257 538.898 99.75% 549.305 101.67% 549.943 101.79%
F32 2^12 = 4096 2^26 = 67108864 0.201 4164 4171 100.17% 4352 104.51% 4355 104.59%
F32 2^12 = 4096 2^30 = 1073741824 0.201 76053 76159 100.14% 77048 101.31% 77019 101.27%
F32 2^16 = 65536 2^22 = 4194304 0.201 1046 1047 100.10% 1044 99.81% 1045 99.90%
F32 2^16 = 65536 2^26 = 67108864 0.201 6941 6933 99.88% 7019 101.12% 7018 101.11%
F32 2^16 = 65536 2^30 = 1073741824 0.201 50694 50782 100.17% 53405 105.35% 53379 105.30%
F32 2^20 = 1048576 2^22 = 4194304 0.201 167.528 170.387 101.71% 173.007 103.27% 174.216 103.99%
F32 2^20 = 1048576 2^26 = 67108864 0.201 16257 16283 100.16% 16305 100.30% 16318 100.38%
F32 2^20 = 1048576 2^30 = 1073741824 0.201 109040 108883 99.86% 110188 101.05% 110189 101.05%
F64 2^12 = 4096 2^22 = 4194304 1 841.793 842.753 100.11% 873.314 103.74% 872.16 103.61%
F64 2^12 = 4096 2^26 = 67108864 1 13131 13422 102.22% 13193 100.47% 13092 99.70%
F64 2^12 = 4096 2^30 = 1073741824 1 231598 233485 100.81% 234794 101.38% 233545 100.84%
F64 2^16 = 65536 2^22 = 4194304 1 1889 1892 100.16% 1891 100.11% 1907 100.95%
F64 2^16 = 65536 2^26 = 67108864 1 9404 9400 99.96% 9743 103.60% 9757 103.75%
F64 2^16 = 65536 2^30 = 1073741824 1 159098 162490 102.13% 157836 99.21% 157135 98.77%
F64 2^20 = 1048576 2^22 = 4194304 1 392.044 392.222 100.05% 396.421 101.12% 396.611 101.16%
F64 2^20 = 1048576 2^26 = 67108864 1 29525 29571 100.16% 29836 101.05% 29587 100.21%
F64 2^20 = 1048576 2^30 = 1073741824 1 146251 146177 99.95% 152512 104.28% 150459 102.88%
F64 2^12 = 4096 2^22 = 4194304 0.201 833.185 834.042 100.10% 872.656 104.74% 858.737 103.07%
F64 2^12 = 4096 2^26 = 67108864 0.201 12806 13020 101.67% 12965 101.24% 12721 99.34%
F64 2^12 = 4096 2^30 = 1073741824 0.201 227853 229705 100.81% 232754 102.15% 229502 100.72%
F64 2^16 = 65536 2^22 = 4194304 0.201 1887 1891 100.21% 1906 101.01% 1890 100.16%
F64 2^16 = 65536 2^26 = 67108864 0.201 9341 9332 99.90% 9739 104.26% 9608 102.86%
F64 2^16 = 65536 2^30 = 1073741824 0.201 154879 158349 102.24% 152135 98.23% 150872 97.41%
F64 2^20 = 1048576 2^22 = 4194304 0.201 388.24 389.095 100.22% 395.222 101.80% 392.773 101.17%
F64 2^20 = 1048576 2^26 = 67108864 0.201 29508 29555 100.16% 29819 101.05% 29570 100.21%
F64 2^20 = 1048576 2^30 = 1073741824 0.201 145305 145135 99.88% 151392 104.19% 149473 102.87%

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested review from a team as code owners January 9, 2025 12:10
@elstehle elstehle force-pushed the enh/large-seg-support-seg-sort branch from 8889147 to 154861d Compare January 9, 2025 16:18
Copy link
Contributor

github-actions bot commented Jan 9, 2025

🟨 CI finished in 2h 28m: Pass: 92%/96 | Total: 2d 15h | Avg: 39m 32s | Max: 1h 30m | Hits: 352%/15012
  • 🟨 cub: Pass: 85%/47 | Total: 1d 15h | Avg: 50m 55s | Max: 1h 30m | Hits: 459%/3900

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  84%/45  | Total:  1d 14h | Avg: 50m 44s | Max:  1h 30m | Hits: 459%/3900  
      🟩 arm64              Pass: 100%/2   | Total:  1h 50m | Avg: 55m 02s | Max: 55m 26s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 34s | Max: 58m 33s
      🔍 nvcc               Pass:  84%/45  | Total:  1d 14h | Avg: 50m 40s | Max:  1h 30m | Hits: 459%/3900  
    🔍 gpu: v100 🔍
      🟩 h100               Pass: 100%/2   | Total: 43m 26s | Avg: 21m 43s | Max: 23m 48s
      🔍 v100               Pass:  84%/45  | Total:  1d 15h | Avg: 52m 13s | Max:  1h 30m | Hits: 459%/3900  
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  82%/40  | Total:  1d 11h | Avg: 52m 31s | Max:  1h 05m | Hits: 459%/3900  
      🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 30m | Avg:  1h 30m | Max:  1h 30m
      🟩 GraphCapture       Pass: 100%/1   | Total: 27m 19s | Avg: 27m 19s | Max: 27m 19s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 29m | Avg: 29m 50s | Max: 48m 13s
      🟩 TestGPU            Pass: 100%/2   | Total:  1h 24m | Avg: 42m 27s | Max:  1h 01m
    🟨 ctk
      🟨 12.0               Pass:  62%/8   | Total:  7h 06m | Avg: 53m 21s | Max: 59m 48s | Hits: 445%/1560  
      🟩 12.5               Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟨 12.6               Pass:  89%/37  | Total:  1d 06h | Avg: 49m 44s | Max:  1h 30m | Hits: 468%/2340  
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 53m | Avg: 56m 34s | Max: 58m 33s
      🟨 nvcc12.0           Pass:  62%/8   | Total:  7h 06m | Avg: 53m 21s | Max: 59m 48s | Hits: 445%/1560  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
      🟨 nvcc12.6           Pass:  88%/35  | Total:  1d 04h | Avg: 49m 21s | Max:  1h 30m | Hits: 468%/2340  
    🟨 cxx
      🟨 Clang9             Pass:  25%/4   | Total:  3h 27m | Avg: 51m 54s | Max: 52m 45s
      🟩 Clang10            Pass: 100%/1   | Total: 54m 23s | Avg: 54m 23s | Max: 54m 23s
      🟩 Clang11            Pass: 100%/1   | Total: 51m 07s | Avg: 51m 07s | Max: 51m 07s
      🟩 Clang12            Pass: 100%/1   | Total: 49m 47s | Avg: 49m 47s | Max: 49m 47s
      🟩 Clang13            Pass: 100%/1   | Total: 53m 38s | Avg: 53m 38s | Max: 53m 38s
      🟩 Clang14            Pass: 100%/1   | Total: 49m 41s | Avg: 49m 41s | Max: 49m 41s
      🟩 Clang15            Pass: 100%/1   | Total: 52m 39s | Avg: 52m 39s | Max: 52m 39s
      🟩 Clang16            Pass: 100%/1   | Total: 51m 48s | Avg: 51m 48s | Max: 51m 48s
      🟩 Clang17            Pass: 100%/1   | Total: 50m 17s | Avg: 50m 17s | Max: 50m 17s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 21m | Avg: 45m 55s | Max: 58m 33s
      🟨 GCC7               Pass:  25%/4   | Total:  3h 27m | Avg: 51m 57s | Max: 53m 56s
      🟩 GCC8               Pass: 100%/1   | Total: 52m 34s | Avg: 52m 34s | Max: 52m 34s
      🟨 GCC9               Pass:  66%/3   | Total:  2h 36m | Avg: 52m 15s | Max: 54m 52s
      🟩 GCC10              Pass: 100%/1   | Total: 54m 48s | Avg: 54m 48s | Max: 54m 48s
      🟩 GCC11              Pass: 100%/1   | Total: 54m 35s | Avg: 54m 35s | Max: 54m 35s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 35m | Avg: 31m 40s | Max: 51m 34s
      🟩 GCC13              Pass: 100%/8   | Total:  6h 51m | Avg: 51m 29s | Max:  1h 30m
      🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 53m | Avg: 57m 59s | Max: 59m 48s | Hits: 445%/2340  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 44s | Max:  1h 04m | Hits: 480%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🟨 cxx_family
      🟨 Clang              Pass:  84%/19  | Total: 15h 42m | Avg: 49m 36s | Max: 58m 33s
      🟨 GCC                Pass:  80%/21  | Total: 17h 13m | Avg: 49m 12s | Max:  1h 30m
      🟩 MSVC               Pass: 100%/5   | Total:  4h 51m | Avg: 58m 17s | Max:  1h 04m | Hits: 459%/3900  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
    🟨 std
      🟥 11                 Pass:   0%/5   | Total:  4h 17m | Avg: 51m 27s | Max: 53m 56s
      🟨 14                 Pass:  33%/3   | Total:  2h 43m | Avg: 54m 20s | Max: 59m 48s | Hits: 448%/780   
      🟩 17                 Pass: 100%/13  | Total: 11h 50m | Avg: 54m 39s | Max:  1h 00m | Hits: 460%/2340  
      🟩 20                 Pass: 100%/26  | Total: 21h 02m | Avg: 48m 33s | Max:  1h 30m | Hits: 468%/780   
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 43m 26s | Avg: 21m 43s | Max: 23m 48s
      🟩 90a                Pass: 100%/1   | Total: 21m 58s | Avg: 21m 58s | Max: 21m 58s
    
  • 🟩 thrust: Pass: 100%/46 | Total: 22h 46m | Avg: 29m 42s | Max: 52m 40s | Hits: 314%/11112

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 35m 37s | Avg: 17m 48s | Max: 23m 55s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 21h 50m | Avg: 29m 47s | Max: 52m 40s | Hits: 314%/11112 
      🟩 arm64              Pass: 100%/2   | Total: 56m 04s | Avg: 28m 02s | Max: 29m 51s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  4h 18m | Avg: 32m 21s | Max: 51m 31s | Hits: 300%/3704  
      🟩 12.5               Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
      🟩 12.6               Pass: 100%/36  | Total: 16h 50m | Avg: 28m 04s | Max: 52m 40s | Hits: 321%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 50m 13s | Avg: 25m 06s | Max: 26m 03s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  4h 18m | Avg: 32m 21s | Max: 51m 31s | Hits: 300%/3704  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
      🟩 nvcc12.6           Pass: 100%/34  | Total: 16h 00m | Avg: 28m 15s | Max: 52m 40s | Hits: 321%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 50m 13s | Avg: 25m 06s | Max: 26m 03s
      🟩 nvcc               Pass: 100%/44  | Total: 21h 56m | Avg: 29m 54s | Max: 52m 40s | Hits: 314%/11112 
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 48m | Avg: 27m 10s | Max: 30m 43s
      🟩 Clang10            Pass: 100%/1   | Total: 31m 24s | Avg: 31m 24s | Max: 31m 24s
      🟩 Clang11            Pass: 100%/1   | Total: 28m 50s | Avg: 28m 50s | Max: 28m 50s
      🟩 Clang12            Pass: 100%/1   | Total: 31m 03s | Avg: 31m 03s | Max: 31m 03s
      🟩 Clang13            Pass: 100%/1   | Total: 28m 38s | Avg: 28m 38s | Max: 28m 38s
      🟩 Clang14            Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
      🟩 Clang15            Pass: 100%/1   | Total: 30m 00s | Avg: 30m 00s | Max: 30m 00s
      🟩 Clang16            Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
      🟩 Clang17            Pass: 100%/1   | Total: 28m 59s | Avg: 28m 59s | Max: 28m 59s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 32m | Avg: 21m 50s | Max: 30m 21s
      🟩 GCC7               Pass: 100%/4   | Total:  1h 43m | Avg: 25m 49s | Max: 29m 53s
      🟩 GCC8               Pass: 100%/1   | Total: 29m 08s | Avg: 29m 08s | Max: 29m 08s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 29m | Avg: 29m 59s | Max: 33m 25s
      🟩 GCC10              Pass: 100%/1   | Total: 29m 31s | Avg: 29m 31s | Max: 29m 31s
      🟩 GCC11              Pass: 100%/1   | Total: 29m 33s | Avg: 29m 33s | Max: 29m 33s
      🟩 GCC12              Pass: 100%/1   | Total: 33m 56s | Avg: 33m 56s | Max: 33m 56s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 51m | Avg: 21m 27s | Max: 35m 48s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 25m | Avg: 48m 37s | Max: 51m 31s | Hits: 301%/5556  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 18m | Avg: 46m 02s | Max: 52m 40s | Hits: 327%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 18m | Avg: 26m 15s | Max: 31m 24s
      🟩 GCC                Pass: 100%/19  | Total:  8h 07m | Avg: 25m 37s | Max: 35m 48s
      🟩 MSVC               Pass: 100%/6   | Total:  4h 43m | Avg: 47m 19s | Max: 52m 40s | Hits: 314%/11112 
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 22h 46m | Avg: 29m 42s | Max: 52m 40s | Hits: 314%/11112 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 21h 17m | Avg: 31m 56s | Max: 52m 40s | Hits: 304%/9260  
      🟩 TestCPU            Pass: 100%/3   | Total: 50m 31s | Avg: 16m 50s | Max: 34m 45s | Hits: 365%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 38m 23s | Avg: 12m 47s | Max: 16m 01s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 16m 33s | Avg: 16m 33s | Max: 16m 33s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 57m | Avg: 23m 28s | Max: 24m 53s
      🟩 14                 Pass: 100%/3   | Total:  1h 45m | Avg: 35m 00s | Max: 46m 14s | Hits: 302%/1852  
      🟩 17                 Pass: 100%/13  | Total:  7h 49m | Avg: 36m 08s | Max: 51m 31s | Hits: 304%/5556  
      🟩 20                 Pass: 100%/23  | Total: 10h 38m | Avg: 27m 46s | Max: 52m 40s | Hits: 335%/3704  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 57s | Avg: 4m 28s | Max: 6m 55s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
      🟩 Test               Pass: 100%/1   | Total:  6m 55s | Avg:  6m 55s | Max:  6m 55s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 96)

# Runner
69 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
11 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

Copy link
Contributor

🟩 CI finished in 1h 38m: Pass: 100%/96 | Total: 20h 20m | Avg: 12m 42s | Max: 37m 19s | Hits: 425%/15012
  • 🟩 cub: Pass: 100%/47 | Total: 12h 24m | Avg: 15m 50s | Max: 34m 59s | Hits: 597%/3900

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total: 11h 54m | Avg: 15m 52s | Max: 34m 59s | Hits: 597%/3900  
      🟩 arm64              Pass: 100%/2   | Total: 30m 22s | Avg: 15m 11s | Max: 16m 21s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  2h 26m | Avg: 18m 15s | Max: 34m 59s | Hits: 597%/1560  
      🟩 12.5               Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
      🟩 12.6               Pass: 100%/37  | Total:  9h 26m | Avg: 15m 19s | Max: 34m 50s | Hits: 597%/2340  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 30s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  2h 26m | Avg: 18m 15s | Max: 34m 59s | Hits: 597%/1560  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
      🟩 nvcc12.6           Pass: 100%/35  | Total:  9h 08m | Avg: 15m 39s | Max: 34m 50s | Hits: 597%/2340  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 30s
      🟩 nvcc               Pass: 100%/45  | Total: 12h 05m | Avg: 16m 07s | Max: 34m 59s | Hits: 597%/3900  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 51m 29s | Avg: 12m 52s | Max: 13m 27s
      🟩 Clang10            Pass: 100%/1   | Total: 12m 38s | Avg: 12m 38s | Max: 12m 38s
      🟩 Clang11            Pass: 100%/1   | Total: 11m 03s | Avg: 11m 03s | Max: 11m 03s
      🟩 Clang12            Pass: 100%/1   | Total: 12m 19s | Avg: 12m 19s | Max: 12m 19s
      🟩 Clang13            Pass: 100%/1   | Total: 11m 52s | Avg: 11m 52s | Max: 11m 52s
      🟩 Clang14            Pass: 100%/1   | Total: 12m 37s | Avg: 12m 37s | Max: 12m 37s
      🟩 Clang15            Pass: 100%/1   | Total: 11m 17s | Avg: 11m 17s | Max: 11m 17s
      🟩 Clang16            Pass: 100%/1   | Total: 11m 59s | Avg: 11m 59s | Max: 11m 59s
      🟩 Clang17            Pass: 100%/1   | Total: 11m 57s | Avg: 11m 57s | Max: 11m 57s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 42m | Avg: 14m 34s | Max: 22m 32s
      🟩 GCC7               Pass: 100%/4   | Total: 49m 44s | Avg: 12m 26s | Max: 13m 01s
      🟩 GCC8               Pass: 100%/1   | Total: 11m 51s | Avg: 11m 51s | Max: 11m 51s
      🟩 GCC9               Pass: 100%/3   | Total: 38m 34s | Avg: 12m 51s | Max: 13m 45s
      🟩 GCC10              Pass: 100%/1   | Total: 11m 03s | Avg: 11m 03s | Max: 11m 03s
      🟩 GCC11              Pass: 100%/1   | Total: 12m 32s | Avg: 12m 32s | Max: 12m 32s
      🟩 GCC12              Pass: 100%/3   | Total: 38m 03s | Avg: 12m 41s | Max: 19m 36s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 09m | Avg: 16m 14s | Max: 24m 28s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 42m | Avg: 34m 11s | Max: 34m 59s | Hits: 597%/2340  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 09m | Avg: 34m 40s | Max: 34m 50s | Hits: 597%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  4h 09m | Avg: 13m 07s | Max: 22m 32s
      🟩 GCC                Pass: 100%/21  | Total:  4h 51m | Avg: 13m 53s | Max: 24m 28s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 51m | Avg: 34m 23s | Max: 34m 59s | Hits: 597%/3900  
      🟩 NVHPC              Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 25m 54s | Avg: 12m 57s | Max: 19m 36s
      🟩 v100               Pass: 100%/45  | Total: 11h 58m | Avg: 15m 58s | Max: 34m 59s | Hits: 597%/3900  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  9h 55m | Avg: 14m 53s | Max: 34m 59s | Hits: 597%/3900  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 12s | Avg: 22m 12s | Max: 22m 12s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 34s | Avg: 17m 34s | Max: 17m 34s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 02m | Avg: 20m 42s | Max: 21m 29s
      🟩 TestGPU            Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 28s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 25m 54s | Avg: 12m 57s | Max: 19m 36s
      🟩 90a                Pass: 100%/1   | Total:  6m 45s | Avg:  6m 45s | Max:  6m 45s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 04m | Avg: 12m 57s | Max: 13m 45s
      🟩 14                 Pass: 100%/3   | Total: 59m 39s | Avg: 19m 53s | Max: 33m 52s | Hits: 597%/780   
      🟩 17                 Pass: 100%/13  | Total:  3h 45m | Avg: 17m 23s | Max: 34m 59s | Hits: 597%/2340  
      🟩 20                 Pass: 100%/26  | Total:  6h 34m | Avg: 15m 09s | Max: 34m 50s | Hits: 597%/780   
    
  • 🟩 thrust: Pass: 100%/46 | Total: 7h 19m | Avg: 9m 33s | Max: 37m 19s | Hits: 365%/11112

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 24m 32s | Avg: 12m 16s | Max: 18m 55s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total:  7h 09m | Avg:  9m 45s | Max: 37m 19s | Hits: 365%/11112 
      🟩 arm64              Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  5m 24s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  1h 22m | Avg: 10m 15s | Max: 26m 36s | Hits: 365%/3704  
      🟩 12.5               Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
      🟩 12.6               Pass: 100%/36  | Total:  5h 29m | Avg:  9m 09s | Max: 37m 19s | Hits: 365%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  5m 20s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  1h 22m | Avg: 10m 15s | Max: 26m 36s | Hits: 365%/3704  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
      🟩 nvcc12.6           Pass: 100%/34  | Total:  5h 19m | Avg:  9m 23s | Max: 37m 19s | Hits: 365%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  5m 20s
      🟩 nvcc               Pass: 100%/44  | Total:  7h 09m | Avg:  9m 45s | Max: 37m 19s | Hits: 365%/11112 
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 22m 53s | Avg:  5m 43s | Max:  6m 18s
      🟩 Clang10            Pass: 100%/1   | Total:  6m 58s | Avg:  6m 58s | Max:  6m 58s
      🟩 Clang11            Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
      🟩 Clang12            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
      🟩 Clang13            Pass: 100%/1   | Total:  5m 19s | Avg:  5m 19s | Max:  5m 19s
      🟩 Clang14            Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 31s | Avg:  5m 31s | Max:  5m 31s
      🟩 Clang18            Pass: 100%/7   | Total: 46m 21s | Avg:  6m 37s | Max: 13m 35s
      🟩 GCC7               Pass: 100%/4   | Total: 19m 19s | Avg:  4m 49s | Max:  5m 10s
      🟩 GCC8               Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
      🟩 GCC9               Pass: 100%/3   | Total: 15m 33s | Avg:  5m 11s | Max:  5m 39s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 45s | Avg:  5m 45s | Max:  5m 45s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
      🟩 GCC12              Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 05m | Avg:  8m 08s | Max: 18m 55s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 19m | Avg: 26m 38s | Max: 27m 45s | Hits: 365%/5556  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 35m | Avg: 31m 45s | Max: 37m 19s | Hits: 365%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  1h 54m | Avg:  6m 00s | Max: 13m 35s
      🟩 GCC                Pass: 100%/19  | Total:  2h 02m | Avg:  6m 27s | Max: 18m 55s
      🟩 MSVC               Pass: 100%/6   | Total:  2h 55m | Avg: 29m 11s | Max: 37m 19s | Hits: 365%/11112 
      🟩 NVHPC              Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total:  7h 19m | Avg:  9m 33s | Max: 37m 19s | Hits: 365%/11112 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  5h 43m | Avg:  8m 34s | Max: 29m 33s | Hits: 365%/9260  
      🟩 TestCPU            Pass: 100%/3   | Total: 52m 07s | Avg: 17m 22s | Max: 37m 19s | Hits: 365%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 44m 14s | Avg: 14m 44s | Max: 18m 55s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 29s | Avg:  4m 29s | Max:  4m 29s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total: 24m 33s | Avg:  4m 54s | Max:  6m 07s
      🟩 14                 Pass: 100%/3   | Total: 36m 52s | Avg: 12m 17s | Max: 25m 33s | Hits: 365%/1852  
      🟩 17                 Pass: 100%/13  | Total:  2h 27m | Avg: 11m 20s | Max: 28m 23s | Hits: 365%/5556  
      🟩 20                 Pass: 100%/23  | Total:  3h 26m | Avg:  8m 57s | Max: 37m 19s | Hits: 365%/3704  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 41s | Avg: 4m 20s | Max: 6m 43s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
      🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 96)

# Runner
69 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
11 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@fbusato
Copy link
Contributor

fbusato commented Jan 11, 2025

looks good! Indeed, many test cases were redundant, e.g. descending ordering and overwrite.
the idea of using a histogram is great.

@elstehle
Copy link
Collaborator Author

Thank you for your review, @fbusato! I assume that your review is limited to the testing changes in #3246. I updated that PR to address your feedback. If you are happy with the new tests, could you please approve #3246? Then we will have this PR to focus on the algorithmic changes to enable support for large number of items and segments on DeviceSegmentedSort.

@elstehle elstehle force-pushed the enh/large-seg-support-seg-sort branch from f38b403 to 83a390e Compare January 14, 2025 05:25
Copy link
Contributor

🟩 CI finished in 1h 16m: Pass: 100%/78 | Total: 17h 19m | Avg: 13m 19s | Max: 36m 50s | Hits: 422%/12340
  • 🟩 cub: Pass: 100%/38 | Total: 10h 36m | Avg: 16m 44s | Max: 36m 50s | Hits: 590%/3120

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total: 10h 05m | Avg: 16m 48s | Max: 36m 50s | Hits: 590%/3120  
      🟩 arm64              Pass: 100%/2   | Total: 30m 54s | Avg: 15m 27s | Max: 16m 25s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 23m | Avg: 16m 43s | Max: 34m 50s | Hits: 590%/780   
      🟩 12.5               Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
      🟩 12.6               Pass: 100%/31  | Total:  8h 38m | Avg: 16m 43s | Max: 36m 50s | Hits: 590%/2340  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 50s | Avg:  9m 25s | Max:  9m 45s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 23m | Avg: 16m 43s | Max: 34m 50s | Hits: 590%/780   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
      🟩 nvcc12.6           Pass: 100%/29  | Total:  8h 19m | Avg: 17m 13s | Max: 36m 50s | Hits: 590%/2340  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 50s | Avg:  9m 25s | Max:  9m 45s
      🟩 nvcc               Pass: 100%/36  | Total: 10h 17m | Avg: 17m 08s | Max: 36m 50s | Hits: 590%/3120  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 48m 56s | Avg: 12m 14s | Max: 13m 04s
      🟩 Clang15            Pass: 100%/1   | Total: 12m 18s | Avg: 12m 18s | Max: 12m 18s
      🟩 Clang16            Pass: 100%/1   | Total: 11m 45s | Avg: 11m 45s | Max: 11m 45s
      🟩 Clang17            Pass: 100%/1   | Total: 12m 29s | Avg: 12m 29s | Max: 12m 29s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 47m | Avg: 15m 21s | Max: 26m 04s
      🟩 GCC7               Pass: 100%/2   | Total: 23m 43s | Avg: 11m 51s | Max: 11m 57s
      🟩 GCC8               Pass: 100%/1   | Total: 11m 34s | Avg: 11m 34s | Max: 11m 34s
      🟩 GCC9               Pass: 100%/2   | Total: 24m 00s | Avg: 12m 00s | Max: 12m 10s
      🟩 GCC10              Pass: 100%/1   | Total: 11m 47s | Avg: 11m 47s | Max: 11m 47s
      🟩 GCC11              Pass: 100%/1   | Total: 12m 59s | Avg: 12m 59s | Max: 12m 59s
      🟩 GCC12              Pass: 100%/3   | Total: 37m 27s | Avg: 12m 29s | Max: 19m 26s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 25m | Avg: 18m 07s | Max: 29m 49s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 30s | Max: 36m 10s | Hits: 590%/1560  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 48s | Max: 36m 50s | Hits: 590%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  3h 12m | Avg: 13m 47s | Max: 26m 04s
      🟩 GCC                Pass: 100%/18  | Total:  4h 26m | Avg: 14m 48s | Max: 29m 49s
      🟩 MSVC               Pass: 100%/4   | Total:  2h 22m | Avg: 35m 39s | Max: 36m 50s | Hits: 590%/3120  
      🟩 NVHPC              Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 25m 47s | Avg: 12m 53s | Max: 19m 26s
      🟩 v100               Pass: 100%/36  | Total: 10h 10m | Avg: 16m 57s | Max: 36m 50s | Hits: 590%/3120  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  7h 49m | Avg: 15m 08s | Max: 36m 50s | Hits: 590%/3120  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 25m 46s | Avg: 25m 46s | Max: 25m 46s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 19s | Avg: 17m 19s | Max: 17m 19s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 46s | Max: 26m 04s
      🟩 TestGPU            Pass: 100%/2   | Total: 52m 07s | Avg: 26m 03s | Max: 29m 49s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 25m 47s | Avg: 12m 53s | Max: 19m 26s
      🟩 90a                Pass: 100%/1   | Total:  6m 27s | Avg:  6m 27s | Max:  6m 27s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  4h 01m | Avg: 17m 16s | Max: 36m 10s | Hits: 590%/2340  
      🟩 20                 Pass: 100%/24  | Total:  6h 34m | Avg: 16m 25s | Max: 36m 50s | Hits: 590%/780   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 6h 07m | Avg: 9m 56s | Max: 34m 30s | Hits: 365%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 20m 43s | Avg: 10m 21s | Max: 14m 34s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total:  5h 58m | Avg: 10m 13s | Max: 34m 30s | Hits: 365%/9220  
      🟩 arm64              Pass: 100%/2   | Total:  9m 34s | Avg:  4m 47s | Max:  5m 00s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 46m 38s | Avg:  9m 19s | Max: 26m 36s | Hits: 365%/1844  
      🟩 12.5               Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
      🟩 12.6               Pass: 100%/30  | Total:  4h 53m | Avg:  9m 46s | Max: 34m 30s | Hits: 365%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 03s | Avg:  5m 01s | Max:  5m 03s
      🟩 nvcc12.0           Pass: 100%/5   | Total: 46m 38s | Avg:  9m 19s | Max: 26m 36s | Hits: 365%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
      🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 42m | Avg: 10m 06s | Max: 34m 30s | Hits: 365%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 03s | Avg:  5m 01s | Max:  5m 03s
      🟩 nvcc               Pass: 100%/35  | Total:  5h 57m | Avg: 10m 12s | Max: 34m 30s | Hits: 365%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 21m 05s | Avg:  5m 16s | Max:  5m 40s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 26s | Avg:  5m 26s | Max:  5m 26s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 53s | Avg:  5m 53s | Max:  5m 53s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
      🟩 Clang18            Pass: 100%/7   | Total: 45m 26s | Avg:  6m 29s | Max: 12m 16s
      🟩 GCC7               Pass: 100%/2   | Total: 10m 17s | Avg:  5m 08s | Max:  5m 25s
      🟩 GCC8               Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
      🟩 GCC9               Pass: 100%/2   | Total: 11m 00s | Avg:  5m 30s | Max:  5m 53s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 38s | Avg:  5m 38s | Max:  5m 38s
      🟩 GCC12              Pass: 100%/1   | Total:  6m 08s | Avg:  6m 08s | Max:  6m 08s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 03m | Avg:  7m 56s | Max: 14m 34s
      🟩 MSVC14.29          Pass: 100%/2   | Total: 53m 58s | Avg: 26m 59s | Max: 27m 22s | Hits: 365%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 34m | Avg: 31m 39s | Max: 34m 30s | Hits: 365%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  1h 23m | Avg:  5m 56s | Max: 12m 16s
      🟩 GCC                Pass: 100%/16  | Total:  1h 47m | Avg:  6m 43s | Max: 14m 34s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 28m | Avg: 29m 47s | Max: 34m 30s | Hits: 365%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total:  6h 07m | Avg:  9m 56s | Max: 34m 30s | Hits: 365%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  4h 38m | Avg:  8m 58s | Max: 31m 46s | Hits: 365%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 18s | Avg: 16m 26s | Max: 34m 30s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 40m 02s | Avg: 13m 20s | Max: 14m 34s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 46s | Avg:  4m 46s | Max:  4m 46s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  2h 30m | Avg: 10m 45s | Max: 28m 42s | Hits: 365%/5532  
      🟩 20                 Pass: 100%/21  | Total:  3h 16m | Avg:  9m 21s | Max: 34m 30s | Hits: 365%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 01s | Avg: 5m 00s | Max: 8m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 55s | Avg:  1m 55s | Max:  1m 55s
      🟩 Test               Pass: 100%/1   | Total:  8m 06s | Avg:  8m 06s | Max:  8m 06s
    
  • 🟩 python: Pass: 100%/1 | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

cub/cub/device/dispatch/dispatch_segmented_sort.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_segmented_sort.cuh Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_segmented_sort.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_segmented_sort.cuh Outdated Show resolved Hide resolved
Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits

@bernhardmgruber bernhardmgruber removed their request for review January 14, 2025 08:51
@elstehle elstehle force-pushed the enh/large-seg-support-seg-sort branch from 83a390e to 56e3b23 Compare January 14, 2025 10:12
@elstehle elstehle mentioned this pull request Jan 14, 2025
24 tasks
Copy link
Contributor

🟩 CI finished in 1h 16m: Pass: 100%/78 | Total: 16h 52m | Avg: 12m 58s | Max: 38m 00s | Hits: 422%/12340
  • 🟩 cub: Pass: 100%/38 | Total: 10h 06m | Avg: 15m 57s | Max: 38m 00s | Hits: 590%/3120

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  9h 37m | Avg: 16m 02s | Max: 38m 00s | Hits: 590%/3120  
      🟩 arm64              Pass: 100%/2   | Total: 28m 47s | Avg: 14m 23s | Max: 14m 41s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 18m | Avg: 15m 37s | Max: 31m 04s | Hits: 590%/780   
      🟩 12.5               Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
      🟩 12.6               Pass: 100%/31  | Total:  8h 13m | Avg: 15m 55s | Max: 38m 00s | Hits: 590%/2340  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 31s | Avg:  9m 15s | Max:  9m 28s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 18m | Avg: 15m 37s | Max: 31m 04s | Hits: 590%/780   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
      🟩 nvcc12.6           Pass: 100%/29  | Total:  7h 54m | Avg: 16m 22s | Max: 38m 00s | Hits: 590%/2340  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 31s | Avg:  9m 15s | Max:  9m 28s
      🟩 nvcc               Pass: 100%/36  | Total:  9h 48m | Avg: 16m 20s | Max: 38m 00s | Hits: 590%/3120  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 46m 39s | Avg: 11m 39s | Max: 12m 01s
      🟩 Clang15            Pass: 100%/1   | Total: 11m 42s | Avg: 11m 42s | Max: 11m 42s
      🟩 Clang16            Pass: 100%/1   | Total: 11m 30s | Avg: 11m 30s | Max: 11m 30s
      🟩 Clang17            Pass: 100%/1   | Total: 11m 53s | Avg: 11m 53s | Max: 11m 53s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 42m | Avg: 14m 41s | Max: 22m 57s
      🟩 GCC7               Pass: 100%/2   | Total: 23m 25s | Avg: 11m 42s | Max: 11m 56s
      🟩 GCC8               Pass: 100%/1   | Total: 11m 31s | Avg: 11m 31s | Max: 11m 31s
      🟩 GCC9               Pass: 100%/2   | Total: 23m 34s | Avg: 11m 47s | Max: 11m 53s
      🟩 GCC10              Pass: 100%/1   | Total: 12m 09s | Avg: 12m 09s | Max: 12m 09s
      🟩 GCC11              Pass: 100%/1   | Total: 12m 30s | Avg: 12m 30s | Max: 12m 30s
      🟩 GCC12              Pass: 100%/3   | Total: 37m 50s | Avg: 12m 36s | Max: 19m 31s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 05m | Avg: 15m 44s | Max: 22m 34s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 06m | Avg: 33m 25s | Max: 35m 47s | Hits: 590%/1560  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 13m | Avg: 36m 39s | Max: 38m 00s | Hits: 590%/1560  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  3h 04m | Avg: 13m 11s | Max: 22m 57s
      🟩 GCC                Pass: 100%/18  | Total:  4h 06m | Avg: 13m 42s | Max: 22m 34s
      🟩 MSVC               Pass: 100%/4   | Total:  2h 20m | Avg: 35m 02s | Max: 38m 00s | Hits: 590%/3120  
      🟩 NVHPC              Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 25m 52s | Avg: 12m 56s | Max: 19m 31s
      🟩 v100               Pass: 100%/36  | Total:  9h 40m | Avg: 16m 07s | Max: 38m 00s | Hits: 590%/3120  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  7h 42m | Avg: 14m 54s | Max: 38m 00s | Hits: 590%/3120  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 59s | Avg: 21m 59s | Max: 21m 59s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 52s | Avg: 15m 52s | Max: 15m 52s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 01m | Avg: 20m 23s | Max: 22m 33s
      🟩 TestGPU            Pass: 100%/2   | Total: 45m 31s | Avg: 22m 45s | Max: 22m 57s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 25m 52s | Avg: 12m 56s | Max: 19m 31s
      🟩 90a                Pass: 100%/1   | Total:  6m 53s | Avg:  6m 53s | Max:  6m 53s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  3h 55m | Avg: 16m 47s | Max: 35m 47s | Hits: 590%/2340  
      🟩 20                 Pass: 100%/24  | Total:  6h 11m | Avg: 15m 28s | Max: 38m 00s | Hits: 590%/780   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 6h 08m | Avg: 9m 57s | Max: 33m 59s | Hits: 365%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 16m 37s | Avg:  8m 18s | Max: 11m 00s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total:  5h 58m | Avg: 10m 14s | Max: 33m 59s | Hits: 365%/9220  
      🟩 arm64              Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  5m 00s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 48m 22s | Avg:  9m 40s | Max: 28m 01s | Hits: 365%/1844  
      🟩 12.5               Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
      🟩 12.6               Pass: 100%/30  | Total:  4h 50m | Avg:  9m 40s | Max: 33m 59s | Hits: 365%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 06s
      🟩 nvcc12.0           Pass: 100%/5   | Total: 48m 22s | Avg:  9m 40s | Max: 28m 01s | Hits: 365%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
      🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 40m | Avg: 10m 00s | Max: 33m 59s | Hits: 365%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 06s
      🟩 nvcc               Pass: 100%/35  | Total:  5h 58m | Avg: 10m 13s | Max: 33m 59s | Hits: 365%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 21m 09s | Avg:  5m 17s | Max:  5m 35s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 46s | Avg:  5m 46s | Max:  5m 46s
      🟩 Clang18            Pass: 100%/7   | Total: 44m 42s | Avg:  6m 23s | Max: 12m 26s
      🟩 GCC7               Pass: 100%/2   | Total: 10m 53s | Avg:  5m 26s | Max:  5m 34s
      🟩 GCC8               Pass: 100%/1   | Total:  5m 19s | Avg:  5m 19s | Max:  5m 19s
      🟩 GCC9               Pass: 100%/2   | Total: 10m 21s | Avg:  5m 10s | Max:  5m 26s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 28s | Avg:  5m 28s | Max:  5m 28s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
      🟩 GCC12              Pass: 100%/1   | Total:  6m 23s | Avg:  6m 23s | Max:  6m 23s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 00m | Avg:  7m 33s | Max: 15m 37s
      🟩 MSVC14.29          Pass: 100%/2   | Total: 57m 26s | Avg: 28m 43s | Max: 29m 25s | Hits: 365%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 33m | Avg: 31m 19s | Max: 33m 59s | Hits: 365%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  1h 22m | Avg:  5m 54s | Max: 12m 26s
      🟩 GCC                Pass: 100%/16  | Total:  1h 44m | Avg:  6m 32s | Max: 15m 37s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 31m | Avg: 30m 16s | Max: 33m 59s | Hits: 365%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total:  6h 08m | Avg:  9m 57s | Max: 33m 59s | Hits: 365%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  4h 40m | Avg:  9m 02s | Max: 30m 39s | Hits: 365%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 48m 45s | Avg: 16m 15s | Max: 33m 59s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 39m 03s | Avg: 13m 01s | Max: 15m 37s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 23s | Avg:  4m 23s | Max:  4m 23s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  2h 35m | Avg: 11m 06s | Max: 30m 39s | Hits: 365%/5532  
      🟩 20                 Pass: 100%/21  | Total:  3h 16m | Avg:  9m 20s | Max: 33m 59s | Hits: 365%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 10s | Avg: 5m 35s | Max: 9m 10s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
      🟩 Test               Pass: 100%/1   | Total:  9m 10s | Avg:  9m 10s | Max:  9m 10s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@elstehle elstehle merged commit 08420d4 into NVIDIA:main Jan 14, 2025
95 checks passed
shwina pushed a commit to shwina/cccl that referenced this pull request Jan 16, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 20, 2025
implement `add_sat`

split `signed`/`unsigned` implementation, improve implementation for MSVC

improve device `add_sat` implementation

add `add_sat` test

improve generic `add_sat` implementation for signed types

implement `sub_sat`

allow more msvc intrinsics on x86

add op tests

partially implement `mul_sat`

implement `div_sat` and `saturate_cast`

add `saturate_cast` test

simplify `div_sat` test

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <[email protected]>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <[email protected]>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <[email protected]>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <[email protected]>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <[email protected]>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <[email protected]>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <[email protected]>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop CUB's util_compiler.cuh (#3302)

All contained macros were deprecated

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <[email protected]>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <[email protected]>

---------

Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: Georgii Evtushenko <[email protected]>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

move to c++17, finalize device optimization

fix msvc compilation, update tests

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <[email protected]>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <[email protected]>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <[email protected]>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <[email protected]>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <[email protected]>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <[email protected]>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <[email protected]>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <[email protected]>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <[email protected]>

---------

Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: Georgii Evtushenko <[email protected]>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

Fix CI issues (#3443)

update docs

fix review

restrict allowed types

replace constexpr implementations with generic

optimize `__is_arithmetic_integral`
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
update docs

update docs

add `memcmp`, `memmove` and `memchr` implementations

implement tests

Use cuda::std::min/max in Thrust (NVIDIA#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (NVIDIA#2773)

Deprecate thrust::null_type (NVIDIA#3367)

Deprecate cub::DeviceSpmv (NVIDIA#3320)

Fixes: NVIDIA#896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (NVIDIA#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (NVIDIA#3385)

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (NVIDIA#3366)

Replace `typedef` with `using` in libcu++ (NVIDIA#3368)

Deprecate thrust::optional (NVIDIA#3307)

Fixes: NVIDIA#3306

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

refactor `<cuda/std/cstdint>` (NVIDIA#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (NVIDIA#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes NVIDIA#3411

Extend CUB reduce benchmarks (NVIDIA#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: NVIDIA#3283

Update upload-pages-artifact to v3 (NVIDIA#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421)

`std::linalg` accessors and `transposed_layout` (NVIDIA#2962)

Add round up/down to multiple (NVIDIA#3234)

[FEA]: Introduce Python module with CCCL headers (NVIDIA#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* NVIDIA#3201 (comment)

* NVIDIA#3201 (comment)

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* NVIDIA#3201 (comment)

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d6.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a21.

Error message: NVIDIA#3201 (comment)

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd.

* Implement suggestion by @shwina (NVIDIA#3201 (review))

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Fix CI issues (NVIDIA#3443)

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436)

Default transform_iterator's copy ctor (NVIDIA#3395)

Fixes: NVIDIA#2393

Turn C++ dialect warning into error (NVIDIA#3453)

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437)

* uses thrust's dynamic dispatch for merge_sort

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Refactor allocator handling of contiguous_storage (NVIDIA#3050)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Improve docs of std headers (NVIDIA#3416)

Drop C++11 and C++14 support for all of cccl (NVIDIA#3417)

* Drop C++11 and C++14 support for all of cccl

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Deprecate a few CUB macros (NVIDIA#3456)

Deprecate thrust universal iterator categories (NVIDIA#3461)

Fix launch args order (NVIDIA#3465)

Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432)

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433)

Drop universal iterator categories (NVIDIA#3474)

Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

Refactor CUB's util_debug (NVIDIA#3345)
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
…o `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
Cleanup util_arch (NVIDIA#2773)

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Deprecate thrust::numeric_limits (NVIDIA#3366)

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

fix documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add support for large number of items and large number of segments to device_segmented_sort.cuh
4 participants