Adds support for large number of items and large number of segments to `DeviceSegmentedSort` #3308

elstehle · 2025-01-09T12:10:33Z

Description

⛓️ Depends on #3246

Benchmark results:

We switched from hardcoded int offset type to choose_offset_t. The following table depicts the performance difference on H100 for (a) when the user passes in 32-bit wide offset type (i.e., the leftmost two data columns) over the performance we have in main today and (b) when the user passes in 64-bit wide offset type (i.e., the rightmost two data columns) over the performance we have in main today.

x	Diff u32 vs i32.main any num items	Diff u32 vs i32.main 2^30 num items	Diff u64 vs i32.main any num items	Diff u64 vs i32.main 2^30 num items
min	98.66%	98.66%	74.74%	76.05%
max	102.24%	102.24%	106.10%	106.10%
avg	100.33%	100.26%	98.95%	99.10%

Performance comparison for different offset types over main.i32

T{ct}	Segments{io}	Elements{io}	Entropy	I32	u32	u32/i32 time	i64	i64/i32 time	u64	u64/i32 time
I8	2^12 = 4096	2^22 = 4194304	1	171.064	170.077	99.42%	171.608	100.32%	171.669	100.35%
I8	2^12 = 4096	2^26 = 67108864	1	989.288	984.608	99.53%	1004	101.49%	1007	101.79%
I8	2^12 = 4096	2^30 = 1073741824	1	15242	15185	99.63%	15336	100.62%	15422	101.18%
I8	2^16 = 65536	2^22 = 4194304	1	323.736	320.816	99.10%	321.864	99.42%	322.058	99.48%
I8	2^16 = 65536	2^26 = 67108864	1	1805	1792	99.28%	1808	100.17%	1807	100.11%
I8	2^16 = 65536	2^30 = 1073741824	1	10383	10300	99.20%	10525	101.37%	10545	101.56%
I8	2^20 = 1048576	2^22 = 4194304	1	152.102	151.422	99.55%	156.698	103.02%	157.003	103.22%
I8	2^20 = 1048576	2^26 = 67108864	1	4643	4648	100.11%	4683	100.86%	4676	100.71%
I8	2^20 = 1048576	2^30 = 1073741824	1	27983	28023	100.14%	28284	101.08%	28261	100.99%
I8	2^12 = 4096	2^22 = 4194304	0.201	169.123	169.35	100.13%	170.66	100.91%	170.81	101.00%
I8	2^12 = 4096	2^26 = 67108864	0.201	969.104	968.397	99.93%	987.917	101.94%	988.25	101.98%
I8	2^12 = 4096	2^30 = 1073741824	0.201	14944	15025	100.54%	15094	101.00%	15163	101.47%
I8	2^16 = 65536	2^22 = 4194304	0.201	321.08	320.83	99.92%	321.694	100.19%	322.101	100.32%
I8	2^16 = 65536	2^26 = 67108864	0.201	1783	1784	100.06%	1800	100.95%	1799	100.90%
I8	2^16 = 65536	2^30 = 1073741824	0.201	10120	10128	100.08%	10349	102.26%	10373	102.50%
I8	2^20 = 1048576	2^22 = 4194304	0.201	151.437	150.722	99.53%	156.332	103.23%	156.826	103.56%
I8	2^20 = 1048576	2^26 = 67108864	0.201	4643	4645	100.04%	4680	100.80%	4674	100.67%
I8	2^20 = 1048576	2^30 = 1073741824	0.201	27877	27906	100.10%	28166	101.04%	28145	100.96%
I16	2^12 = 4096	2^22 = 4194304	1	331.151	334.64	101.05%	335.582	101.34%	336.183	101.52%
I16	2^12 = 4096	2^26 = 67108864	1	2148	2159	100.51%	2196	102.23%	2198	102.33%
I16	2^12 = 4096	2^30 = 1073741824	1	34572	34688	100.34%	34521	99.85%	34527	99.87%
I16	2^16 = 65536	2^22 = 4194304	1	656.383	666.756	101.58%	661.087	100.72%	661.379	100.76%
I16	2^16 = 65536	2^26 = 67108864	1	4142	4194	101.26%	4210	101.64%	4214	101.74%
I16	2^16 = 65536	2^30 = 1073741824	1	25701	25861	100.62%	26508	103.14%	26501	103.11%
I16	2^20 = 1048576	2^22 = 4194304	1	166.968	167.761	100.47%	172.948	103.58%	173.548	103.94%
I16	2^20 = 1048576	2^26 = 67108864	1	10198	10359	101.58%	10305	101.05%	10308	101.08%
I16	2^20 = 1048576	2^30 = 1073741824	1	65475	66307	101.27%	66562	101.66%	66604	101.72%
I16	2^12 = 4096	2^22 = 4194304	0.201	329.561	333.422	101.17%	334.258	101.43%	335.089	101.68%
I16	2^12 = 4096	2^26 = 67108864	0.201	2109	2124	100.71%	2164	102.61%	2167	102.75%
I16	2^12 = 4096	2^30 = 1073741824	0.201	34106	34248	100.42%	34201	100.28%	34213	100.31%
I16	2^16 = 65536	2^22 = 4194304	0.201	655.958	666.458	101.60%	660.648	100.71%	661.596	100.86%
I16	2^16 = 65536	2^26 = 67108864	0.201	4130	4183	101.28%	4198	101.65%	4202	101.74%
I16	2^16 = 65536	2^30 = 1073741824	0.201	25185	25389	100.81%	26057	103.46%	26078	103.55%
I16	2^20 = 1048576	2^22 = 4194304	0.201	166.447	167.336	100.53%	173.288	104.11%	172.969	103.92%
I16	2^20 = 1048576	2^26 = 67108864	0.201	10194	10355	101.58%	10298	101.02%	10302	101.06%
I16	2^20 = 1048576	2^30 = 1073741824	0.201	65315	66158	101.29%	66389	101.64%	66447	101.73%
I32	2^12 = 4096	2^22 = 4194304	1	531.756	538.586	101.28%	544.199	102.34%	544.732	102.44%
I32	2^12 = 4096	2^26 = 67108864	1	4334	4332	99.95%	4415	101.87%	4418	101.94%
I32	2^12 = 4096	2^30 = 1073741824	1	79322	79113	99.74%	78212	98.60%	78226	98.62%
I32	2^16 = 65536	2^22 = 4194304	1	1005	1026	102.09%	1021	101.59%	1021	101.59%
I32	2^16 = 65536	2^26 = 67108864	1	6736	6848	101.66%	6889	102.27%	6892	102.32%
I32	2^16 = 65536	2^30 = 1073741824	1	52419	52379	99.92%	53737	102.51%	53750	102.54%
I32	2^20 = 1048576	2^22 = 4194304	1	159.591	162.123	101.59%	166.05	104.05%	164.699	103.20%
I32	2^20 = 1048576	2^26 = 67108864	1	15913	16124	101.33%	16097	101.16%	16053	100.88%
I32	2^20 = 1048576	2^30 = 1073741824	1	107649	108443	100.74%	109075	101.32%	109081	101.33%
I32	2^12 = 4096	2^22 = 4194304	0.201	534.099	536.152	100.38%	540.949	101.28%	540.975	101.29%
I32	2^12 = 4096	2^26 = 67108864	0.201	4280	4224	98.69%	4313	100.77%	4312	100.75%
I32	2^12 = 4096	2^30 = 1073741824	0.201	78510	77641	98.89%	76831	97.86%	76788	97.81%
I32	2^16 = 65536	2^22 = 4194304	0.201	1013	1026	101.28%	1021	100.79%	1020	100.69%
I32	2^16 = 65536	2^26 = 67108864	0.201	6787	6833	100.68%	6873	101.27%	6871	101.24%
I32	2^16 = 65536	2^30 = 1073741824	0.201	51874	51179	98.66%	52563	101.33%	52561	101.32%
I32	2^20 = 1048576	2^22 = 4194304	0.201	158.486	158.639	100.10%	165.821	104.63%	164.546	103.82%
I32	2^20 = 1048576	2^26 = 67108864	0.201	15905	16103	101.24%	16082	101.11%	16087	101.14%
I32	2^20 = 1048576	2^30 = 1073741824	0.201	107473	108220	100.70%	108789	101.22%	108796	101.23%
I64	2^12 = 4096	2^22 = 4194304	1	992.396	995.195	100.28%	823.864	83.02%	827.203	83.35%
I64	2^12 = 4096	2^26 = 67108864	1	13335	13358	100.17%	12605	94.53%	12683	95.11%
I64	2^12 = 4096	2^30 = 1073741824	1	263087	263605	100.20%	227206	86.36%	228033	86.68%
I64	2^16 = 65536	2^22 = 4194304	1	2499	2508	100.36%	1872	74.91%	1871	74.87%
I64	2^16 = 65536	2^26 = 67108864	1	12189	12237	100.39%	9363	76.82%	9386	77.00%
I64	2^16 = 65536	2^30 = 1073741824	1	164230	164130	99.94%	150192	91.45%	150436	91.60%
I64	2^20 = 1048576	2^22 = 4194304	1	456.432	459.818	100.74%	399.507	87.53%	398.75	87.36%
I64	2^20 = 1048576	2^26 = 67108864	1	39183	39567	100.98%	29295	74.76%	29295	74.76%
I64	2^20 = 1048576	2^30 = 1073741824	1	191465	193515	101.07%	145777	76.14%	146110	76.31%
I64	2^12 = 4096	2^22 = 4194304	0.201	980.146	987.815	100.78%	815.258	83.18%	820.249	83.69%
I64	2^12 = 4096	2^26 = 67108864	0.201	13039	13085	100.35%	12278	94.16%	12367	94.85%
I64	2^12 = 4096	2^30 = 1073741824	0.201	256766	257719	100.37%	222320	86.58%	223453	87.03%
I64	2^16 = 65536	2^22 = 4194304	0.201	2481	2506	101.01%	1869	75.33%	1869	75.33%
I64	2^16 = 65536	2^26 = 67108864	0.201	12051	12161	100.91%	9276	76.97%	9303	77.20%
I64	2^16 = 65536	2^30 = 1073741824	0.201	160395	160791	100.25%	146069	91.07%	146554	91.37%
I64	2^20 = 1048576	2^22 = 4194304	0.201	456.292	458.946	100.58%	399.642	87.58%	398.532	87.34%
I64	2^20 = 1048576	2^26 = 67108864	0.201	39165	39546	100.97%	29271	74.74%	29271	74.74%
I64	2^20 = 1048576	2^30 = 1073741824	0.201	190570	192387	100.95%	144503	75.83%	144933	76.05%
I128	2^12 = 4096	2^22 = 4194304	1	1816	1821	100.28%	1903	104.79%	1905	104.90%
I128	2^12 = 4096	2^26 = 67108864	1	45585	45616	100.07%	45823	100.52%	45829	100.54%
I128	2^12 = 4096	2^30 = 1073741824	1	803494	802234	99.84%	804028	100.07%	809422	100.74%
I128	2^16 = 65536	2^22 = 4194304	1	5426	5391	99.35%	5432	100.11%	5431	100.09%
I128	2^16 = 65536	2^26 = 67108864	1	20912	20874	99.82%	22138	105.86%	22158	105.96%
I128	2^16 = 65536	2^30 = 1073741824	1	579030	578849	99.97%	579926	100.15%	579673	100.11%
I128	2^20 = 1048576	2^22 = 4194304	1	1908	1897	99.42%	1914	100.31%	1913	100.26%
I128	2^20 = 1048576	2^26 = 67108864	1	86362	85799	99.35%	86501	100.16%	86509	100.17%
I128	2^20 = 1048576	2^30 = 1073741824	1	325561	324647	99.72%	345121	106.01%	345348	106.08%
I128	2^12 = 4096	2^22 = 4194304	0.201	1805	1804	99.94%	1885	104.43%	1890	104.71%
I128	2^12 = 4096	2^26 = 67108864	0.201	45078	44787	99.35%	45119	100.09%	45006	99.84%
I128	2^12 = 4096	2^30 = 1073741824	0.201	790849	790246	99.92%	789066	99.77%	788230	99.67%
I128	2^16 = 65536	2^22 = 4194304	0.201	5425	5390	99.35%	5430	100.09%	5433	100.15%
I128	2^16 = 65536	2^26 = 67108864	0.201	20733	20694	99.81%	21955	105.89%	21975	105.99%
I128	2^16 = 65536	2^30 = 1073741824	0.201	569609	569317	99.95%	570682	100.19%	570469	100.15%
I128	2^20 = 1048576	2^22 = 4194304	0.201	1907	1896	99.42%	1913	100.31%	1913	100.31%
I128	2^20 = 1048576	2^26 = 67108864	0.201	86335	85765	99.34%	86465	100.15%	86475	100.16%
I128	2^20 = 1048576	2^30 = 1073741824	0.201	322907	322056	99.74%	342227	105.98%	342593	106.10%
F32	2^12 = 4096	2^22 = 4194304	1	538.54	536.098	99.55%	546.961	101.56%	546.583	101.49%
F32	2^12 = 4096	2^26 = 67108864	1	4218	4206	99.72%	4387	104.01%	4390	104.08%
F32	2^12 = 4096	2^30 = 1073741824	1	76705	76848	100.19%	77639	101.22%	77684	101.28%
F32	2^16 = 65536	2^22 = 4194304	1	1039	1038	99.90%	1036	99.71%	1036	99.71%
F32	2^16 = 65536	2^26 = 67108864	1	6905	6882	99.67%	6968	100.91%	6970	100.94%
F32	2^16 = 65536	2^30 = 1073741824	1	51133	51170	100.07%	53807	105.23%	53809	105.23%
F32	2^20 = 1048576	2^22 = 4194304	1	169.133	171.395	101.34%	173.927	102.83%	174.545	103.20%
F32	2^20 = 1048576	2^26 = 67108864	1	16353	16429	100.46%	16448	100.58%	16454	100.62%
F32	2^20 = 1048576	2^30 = 1073741824	1	110096	109928	99.85%	111248	101.05%	111267	101.06%
F32	2^12 = 4096	2^22 = 4194304	0.201	540.257	538.898	99.75%	549.305	101.67%	549.943	101.79%
F32	2^12 = 4096	2^26 = 67108864	0.201	4164	4171	100.17%	4352	104.51%	4355	104.59%
F32	2^12 = 4096	2^30 = 1073741824	0.201	76053	76159	100.14%	77048	101.31%	77019	101.27%
F32	2^16 = 65536	2^22 = 4194304	0.201	1046	1047	100.10%	1044	99.81%	1045	99.90%
F32	2^16 = 65536	2^26 = 67108864	0.201	6941	6933	99.88%	7019	101.12%	7018	101.11%
F32	2^16 = 65536	2^30 = 1073741824	0.201	50694	50782	100.17%	53405	105.35%	53379	105.30%
F32	2^20 = 1048576	2^22 = 4194304	0.201	167.528	170.387	101.71%	173.007	103.27%	174.216	103.99%
F32	2^20 = 1048576	2^26 = 67108864	0.201	16257	16283	100.16%	16305	100.30%	16318	100.38%
F32	2^20 = 1048576	2^30 = 1073741824	0.201	109040	108883	99.86%	110188	101.05%	110189	101.05%
F64	2^12 = 4096	2^22 = 4194304	1	841.793	842.753	100.11%	873.314	103.74%	872.16	103.61%
F64	2^12 = 4096	2^26 = 67108864	1	13131	13422	102.22%	13193	100.47%	13092	99.70%
F64	2^12 = 4096	2^30 = 1073741824	1	231598	233485	100.81%	234794	101.38%	233545	100.84%
F64	2^16 = 65536	2^22 = 4194304	1	1889	1892	100.16%	1891	100.11%	1907	100.95%
F64	2^16 = 65536	2^26 = 67108864	1	9404	9400	99.96%	9743	103.60%	9757	103.75%
F64	2^16 = 65536	2^30 = 1073741824	1	159098	162490	102.13%	157836	99.21%	157135	98.77%
F64	2^20 = 1048576	2^22 = 4194304	1	392.044	392.222	100.05%	396.421	101.12%	396.611	101.16%
F64	2^20 = 1048576	2^26 = 67108864	1	29525	29571	100.16%	29836	101.05%	29587	100.21%
F64	2^20 = 1048576	2^30 = 1073741824	1	146251	146177	99.95%	152512	104.28%	150459	102.88%
F64	2^12 = 4096	2^22 = 4194304	0.201	833.185	834.042	100.10%	872.656	104.74%	858.737	103.07%
F64	2^12 = 4096	2^26 = 67108864	0.201	12806	13020	101.67%	12965	101.24%	12721	99.34%
F64	2^12 = 4096	2^30 = 1073741824	0.201	227853	229705	100.81%	232754	102.15%	229502	100.72%
F64	2^16 = 65536	2^22 = 4194304	0.201	1887	1891	100.21%	1906	101.01%	1890	100.16%
F64	2^16 = 65536	2^26 = 67108864	0.201	9341	9332	99.90%	9739	104.26%	9608	102.86%
F64	2^16 = 65536	2^30 = 1073741824	0.201	154879	158349	102.24%	152135	98.23%	150872	97.41%
F64	2^20 = 1048576	2^22 = 4194304	0.201	388.24	389.095	100.22%	395.222	101.80%	392.773	101.17%
F64	2^20 = 1048576	2^26 = 67108864	0.201	29508	29555	100.16%	29819	101.05%	29570	100.21%
F64	2^20 = 1048576	2^30 = 1073741824	0.201	145305	145135	99.88%	151392	104.19%	149473	102.87%

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2025-01-09T18:48:53Z

🟨 CI finished in 2h 28m: Pass: 92%/96 | Total: 2d 15h | Avg: 39m 32s | Max: 1h 30m | Hits: 352%/15012

🟨 cub: Pass: 85%/47 | Total: 1d 15h | Avg: 50m 55s | Max: 1h 30m | Hits: 459%/3900

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  84%/45  | Total:  1d 14h | Avg: 50m 44s | Max:  1h 30m | Hits: 459%/3900  
  🟩 arm64              Pass: 100%/2   | Total:  1h 50m | Avg: 55m 02s | Max: 55m 26s
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 34s | Max: 58m 33s
  🔍 nvcc               Pass:  84%/45  | Total:  1d 14h | Avg: 50m 40s | Max:  1h 30m | Hits: 459%/3900  
🔍 gpu: v100 🔍
  🟩 h100               Pass: 100%/2   | Total: 43m 26s | Avg: 21m 43s | Max: 23m 48s
  🔍 v100               Pass:  84%/45  | Total:  1d 15h | Avg: 52m 13s | Max:  1h 30m | Hits: 459%/3900  
🔍 jobs: Build 🔍
  🔍 Build              Pass:  82%/40  | Total:  1d 11h | Avg: 52m 31s | Max:  1h 05m | Hits: 459%/3900  
  🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 30m | Avg:  1h 30m | Max:  1h 30m
  🟩 GraphCapture       Pass: 100%/1   | Total: 27m 19s | Avg: 27m 19s | Max: 27m 19s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 29m | Avg: 29m 50s | Max: 48m 13s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 24m | Avg: 42m 27s | Max:  1h 01m
🟨 ctk
  🟨 12.0               Pass:  62%/8   | Total:  7h 06m | Avg: 53m 21s | Max: 59m 48s | Hits: 445%/1560  
  🟩 12.5               Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
  🟨 12.6               Pass:  89%/37  | Total:  1d 06h | Avg: 49m 44s | Max:  1h 30m | Hits: 468%/2340  
🟨 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 53m | Avg: 56m 34s | Max: 58m 33s
  🟨 nvcc12.0           Pass:  62%/8   | Total:  7h 06m | Avg: 53m 21s | Max: 59m 48s | Hits: 445%/1560  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
  🟨 nvcc12.6           Pass:  88%/35  | Total:  1d 04h | Avg: 49m 21s | Max:  1h 30m | Hits: 468%/2340  
🟨 cxx
  🟨 Clang9             Pass:  25%/4   | Total:  3h 27m | Avg: 51m 54s | Max: 52m 45s
  🟩 Clang10            Pass: 100%/1   | Total: 54m 23s | Avg: 54m 23s | Max: 54m 23s
  🟩 Clang11            Pass: 100%/1   | Total: 51m 07s | Avg: 51m 07s | Max: 51m 07s
  🟩 Clang12            Pass: 100%/1   | Total: 49m 47s | Avg: 49m 47s | Max: 49m 47s
  🟩 Clang13            Pass: 100%/1   | Total: 53m 38s | Avg: 53m 38s | Max: 53m 38s
  🟩 Clang14            Pass: 100%/1   | Total: 49m 41s | Avg: 49m 41s | Max: 49m 41s
  🟩 Clang15            Pass: 100%/1   | Total: 52m 39s | Avg: 52m 39s | Max: 52m 39s
  🟩 Clang16            Pass: 100%/1   | Total: 51m 48s | Avg: 51m 48s | Max: 51m 48s
  🟩 Clang17            Pass: 100%/1   | Total: 50m 17s | Avg: 50m 17s | Max: 50m 17s
  🟩 Clang18            Pass: 100%/7   | Total:  5h 21m | Avg: 45m 55s | Max: 58m 33s
  🟨 GCC7               Pass:  25%/4   | Total:  3h 27m | Avg: 51m 57s | Max: 53m 56s
  🟩 GCC8               Pass: 100%/1   | Total: 52m 34s | Avg: 52m 34s | Max: 52m 34s
  🟨 GCC9               Pass:  66%/3   | Total:  2h 36m | Avg: 52m 15s | Max: 54m 52s
  🟩 GCC10              Pass: 100%/1   | Total: 54m 48s | Avg: 54m 48s | Max: 54m 48s
  🟩 GCC11              Pass: 100%/1   | Total: 54m 35s | Avg: 54m 35s | Max: 54m 35s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 35m | Avg: 31m 40s | Max: 51m 34s
  🟩 GCC13              Pass: 100%/8   | Total:  6h 51m | Avg: 51m 29s | Max:  1h 30m
  🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 53m | Avg: 57m 59s | Max: 59m 48s | Hits: 445%/2340  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 44s | Max:  1h 04m | Hits: 480%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
🟨 cxx_family
  🟨 Clang              Pass:  84%/19  | Total: 15h 42m | Avg: 49m 36s | Max: 58m 33s
  🟨 GCC                Pass:  80%/21  | Total: 17h 13m | Avg: 49m 12s | Max:  1h 30m
  🟩 MSVC               Pass: 100%/5   | Total:  4h 51m | Avg: 58m 17s | Max:  1h 04m | Hits: 459%/3900  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 05m
🟨 std
  🟥 11                 Pass:   0%/5   | Total:  4h 17m | Avg: 51m 27s | Max: 53m 56s
  🟨 14                 Pass:  33%/3   | Total:  2h 43m | Avg: 54m 20s | Max: 59m 48s | Hits: 448%/780   
  🟩 17                 Pass: 100%/13  | Total: 11h 50m | Avg: 54m 39s | Max:  1h 00m | Hits: 460%/2340  
  🟩 20                 Pass: 100%/26  | Total: 21h 02m | Avg: 48m 33s | Max:  1h 30m | Hits: 468%/780   
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 43m 26s | Avg: 21m 43s | Max: 23m 48s
  🟩 90a                Pass: 100%/1   | Total: 21m 58s | Avg: 21m 58s | Max: 21m 58s

🟩 thrust: Pass: 100%/46 | Total: 22h 46m | Avg: 29m 42s | Max: 52m 40s | Hits: 314%/11112

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 35m 37s | Avg: 17m 48s | Max: 23m 55s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total: 21h 50m | Avg: 29m 47s | Max: 52m 40s | Hits: 314%/11112 
  🟩 arm64              Pass: 100%/2   | Total: 56m 04s | Avg: 28m 02s | Max: 29m 51s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  4h 18m | Avg: 32m 21s | Max: 51m 31s | Hits: 300%/3704  
  🟩 12.5               Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
  🟩 12.6               Pass: 100%/36  | Total: 16h 50m | Avg: 28m 04s | Max: 52m 40s | Hits: 321%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 50m 13s | Avg: 25m 06s | Max: 26m 03s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  4h 18m | Avg: 32m 21s | Max: 51m 31s | Hits: 300%/3704  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
  🟩 nvcc12.6           Pass: 100%/34  | Total: 16h 00m | Avg: 28m 15s | Max: 52m 40s | Hits: 321%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 50m 13s | Avg: 25m 06s | Max: 26m 03s
  🟩 nvcc               Pass: 100%/44  | Total: 21h 56m | Avg: 29m 54s | Max: 52m 40s | Hits: 314%/11112 
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 48m | Avg: 27m 10s | Max: 30m 43s
  🟩 Clang10            Pass: 100%/1   | Total: 31m 24s | Avg: 31m 24s | Max: 31m 24s
  🟩 Clang11            Pass: 100%/1   | Total: 28m 50s | Avg: 28m 50s | Max: 28m 50s
  🟩 Clang12            Pass: 100%/1   | Total: 31m 03s | Avg: 31m 03s | Max: 31m 03s
  🟩 Clang13            Pass: 100%/1   | Total: 28m 38s | Avg: 28m 38s | Max: 28m 38s
  🟩 Clang14            Pass: 100%/1   | Total: 27m 17s | Avg: 27m 17s | Max: 27m 17s
  🟩 Clang15            Pass: 100%/1   | Total: 30m 00s | Avg: 30m 00s | Max: 30m 00s
  🟩 Clang16            Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
  🟩 Clang17            Pass: 100%/1   | Total: 28m 59s | Avg: 28m 59s | Max: 28m 59s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 32m | Avg: 21m 50s | Max: 30m 21s
  🟩 GCC7               Pass: 100%/4   | Total:  1h 43m | Avg: 25m 49s | Max: 29m 53s
  🟩 GCC8               Pass: 100%/1   | Total: 29m 08s | Avg: 29m 08s | Max: 29m 08s
  🟩 GCC9               Pass: 100%/3   | Total:  1h 29m | Avg: 29m 59s | Max: 33m 25s
  🟩 GCC10              Pass: 100%/1   | Total: 29m 31s | Avg: 29m 31s | Max: 29m 31s
  🟩 GCC11              Pass: 100%/1   | Total: 29m 33s | Avg: 29m 33s | Max: 29m 33s
  🟩 GCC12              Pass: 100%/1   | Total: 33m 56s | Avg: 33m 56s | Max: 33m 56s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 51m | Avg: 21m 27s | Max: 35m 48s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  2h 25m | Avg: 48m 37s | Max: 51m 31s | Hits: 301%/5556  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 18m | Avg: 46m 02s | Max: 52m 40s | Hits: 327%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  8h 18m | Avg: 26m 15s | Max: 31m 24s
  🟩 GCC                Pass: 100%/19  | Total:  8h 07m | Avg: 25m 37s | Max: 35m 48s
  🟩 MSVC               Pass: 100%/6   | Total:  4h 43m | Avg: 47m 19s | Max: 52m 40s | Hits: 314%/11112 
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 36m | Avg: 48m 23s | Max: 49m 01s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total: 22h 46m | Avg: 29m 42s | Max: 52m 40s | Hits: 314%/11112 
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total: 21h 17m | Avg: 31m 56s | Max: 52m 40s | Hits: 304%/9260  
  🟩 TestCPU            Pass: 100%/3   | Total: 50m 31s | Avg: 16m 50s | Max: 34m 45s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 38m 23s | Avg: 12m 47s | Max: 16m 01s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 16m 33s | Avg: 16m 33s | Max: 16m 33s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 57m | Avg: 23m 28s | Max: 24m 53s
  🟩 14                 Pass: 100%/3   | Total:  1h 45m | Avg: 35m 00s | Max: 46m 14s | Hits: 302%/1852  
  🟩 17                 Pass: 100%/13  | Total:  7h 49m | Avg: 36m 08s | Max: 51m 31s | Hits: 304%/5556  
  🟩 20                 Pass: 100%/23  | Total: 10h 38m | Avg: 27m 46s | Max: 52m 40s | Hits: 335%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 57s | Avg: 4m 28s | Max: 6m 55s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 57s | Avg:  4m 28s | Max:  6m 55s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
  🟩 Test               Pass: 100%/1   | Total:  6m 55s | Avg:  6m 55s | Max:  6m 55s

🟩 python: Pass: 100%/1 | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 39s | Avg: 27m 39s | Max: 27m 39s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
11	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-10T09:51:45Z

🟩 CI finished in 1h 38m: Pass: 100%/96 | Total: 20h 20m | Avg: 12m 42s | Max: 37m 19s | Hits: 425%/15012

🟩 cub: Pass: 100%/47 | Total: 12h 24m | Avg: 15m 50s | Max: 34m 59s | Hits: 597%/3900

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total: 11h 54m | Avg: 15m 52s | Max: 34m 59s | Hits: 597%/3900  
  🟩 arm64              Pass: 100%/2   | Total: 30m 22s | Avg: 15m 11s | Max: 16m 21s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  2h 26m | Avg: 18m 15s | Max: 34m 59s | Hits: 597%/1560  
  🟩 12.5               Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
  🟩 12.6               Pass: 100%/37  | Total:  9h 26m | Avg: 15m 19s | Max: 34m 50s | Hits: 597%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 30s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  2h 26m | Avg: 18m 15s | Max: 34m 59s | Hits: 597%/1560  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  9h 08m | Avg: 15m 39s | Max: 34m 50s | Hits: 597%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 30s
  🟩 nvcc               Pass: 100%/45  | Total: 12h 05m | Avg: 16m 07s | Max: 34m 59s | Hits: 597%/3900  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 51m 29s | Avg: 12m 52s | Max: 13m 27s
  🟩 Clang10            Pass: 100%/1   | Total: 12m 38s | Avg: 12m 38s | Max: 12m 38s
  🟩 Clang11            Pass: 100%/1   | Total: 11m 03s | Avg: 11m 03s | Max: 11m 03s
  🟩 Clang12            Pass: 100%/1   | Total: 12m 19s | Avg: 12m 19s | Max: 12m 19s
  🟩 Clang13            Pass: 100%/1   | Total: 11m 52s | Avg: 11m 52s | Max: 11m 52s
  🟩 Clang14            Pass: 100%/1   | Total: 12m 37s | Avg: 12m 37s | Max: 12m 37s
  🟩 Clang15            Pass: 100%/1   | Total: 11m 17s | Avg: 11m 17s | Max: 11m 17s
  🟩 Clang16            Pass: 100%/1   | Total: 11m 59s | Avg: 11m 59s | Max: 11m 59s
  🟩 Clang17            Pass: 100%/1   | Total: 11m 57s | Avg: 11m 57s | Max: 11m 57s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 42m | Avg: 14m 34s | Max: 22m 32s
  🟩 GCC7               Pass: 100%/4   | Total: 49m 44s | Avg: 12m 26s | Max: 13m 01s
  🟩 GCC8               Pass: 100%/1   | Total: 11m 51s | Avg: 11m 51s | Max: 11m 51s
  🟩 GCC9               Pass: 100%/3   | Total: 38m 34s | Avg: 12m 51s | Max: 13m 45s
  🟩 GCC10              Pass: 100%/1   | Total: 11m 03s | Avg: 11m 03s | Max: 11m 03s
  🟩 GCC11              Pass: 100%/1   | Total: 12m 32s | Avg: 12m 32s | Max: 12m 32s
  🟩 GCC12              Pass: 100%/3   | Total: 38m 03s | Avg: 12m 41s | Max: 19m 36s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 09m | Avg: 16m 14s | Max: 24m 28s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 42m | Avg: 34m 11s | Max: 34m 59s | Hits: 597%/2340  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 09m | Avg: 34m 40s | Max: 34m 50s | Hits: 597%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  4h 09m | Avg: 13m 07s | Max: 22m 32s
  🟩 GCC                Pass: 100%/21  | Total:  4h 51m | Avg: 13m 53s | Max: 24m 28s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 51m | Avg: 34m 23s | Max: 34m 59s | Hits: 597%/3900  
  🟩 NVHPC              Pass: 100%/2   | Total: 31m 35s | Avg: 15m 47s | Max: 15m 53s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 25m 54s | Avg: 12m 57s | Max: 19m 36s
  🟩 v100               Pass: 100%/45  | Total: 11h 58m | Avg: 15m 58s | Max: 34m 59s | Hits: 597%/3900  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  9h 55m | Avg: 14m 53s | Max: 34m 59s | Hits: 597%/3900  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 12s | Avg: 22m 12s | Max: 22m 12s
  🟩 GraphCapture       Pass: 100%/1   | Total: 17m 34s | Avg: 17m 34s | Max: 17m 34s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 02m | Avg: 20m 42s | Max: 21m 29s
  🟩 TestGPU            Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 24m 28s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 54s | Avg: 12m 57s | Max: 19m 36s
  🟩 90a                Pass: 100%/1   | Total:  6m 45s | Avg:  6m 45s | Max:  6m 45s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 04m | Avg: 12m 57s | Max: 13m 45s
  🟩 14                 Pass: 100%/3   | Total: 59m 39s | Avg: 19m 53s | Max: 33m 52s | Hits: 597%/780   
  🟩 17                 Pass: 100%/13  | Total:  3h 45m | Avg: 17m 23s | Max: 34m 59s | Hits: 597%/2340  
  🟩 20                 Pass: 100%/26  | Total:  6h 34m | Avg: 15m 09s | Max: 34m 50s | Hits: 597%/780

🟩 thrust: Pass: 100%/46 | Total: 7h 19m | Avg: 9m 33s | Max: 37m 19s | Hits: 365%/11112

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 24m 32s | Avg: 12m 16s | Max: 18m 55s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  7h 09m | Avg:  9m 45s | Max: 37m 19s | Hits: 365%/11112 
  🟩 arm64              Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  5m 24s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  1h 22m | Avg: 10m 15s | Max: 26m 36s | Hits: 365%/3704  
  🟩 12.5               Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
  🟩 12.6               Pass: 100%/36  | Total:  5h 29m | Avg:  9m 09s | Max: 37m 19s | Hits: 365%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  5m 20s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  1h 22m | Avg: 10m 15s | Max: 26m 36s | Hits: 365%/3704  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
  🟩 nvcc12.6           Pass: 100%/34  | Total:  5h 19m | Avg:  9m 23s | Max: 37m 19s | Hits: 365%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  5m 20s
  🟩 nvcc               Pass: 100%/44  | Total:  7h 09m | Avg:  9m 45s | Max: 37m 19s | Hits: 365%/11112 
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 22m 53s | Avg:  5m 43s | Max:  6m 18s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 58s | Avg:  6m 58s | Max:  6m 58s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 19s | Avg:  5m 19s | Max:  5m 19s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 31s | Avg:  5m 31s | Max:  5m 31s
  🟩 Clang18            Pass: 100%/7   | Total: 46m 21s | Avg:  6m 37s | Max: 13m 35s
  🟩 GCC7               Pass: 100%/4   | Total: 19m 19s | Avg:  4m 49s | Max:  5m 10s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 GCC9               Pass: 100%/3   | Total: 15m 33s | Avg:  5m 11s | Max:  5m 39s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 45s | Avg:  5m 45s | Max:  5m 45s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 05m | Avg:  8m 08s | Max: 18m 55s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 19m | Avg: 26m 38s | Max: 27m 45s | Hits: 365%/5556  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 35m | Avg: 31m 45s | Max: 37m 19s | Hits: 365%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  1h 54m | Avg:  6m 00s | Max: 13m 35s
  🟩 GCC                Pass: 100%/19  | Total:  2h 02m | Avg:  6m 27s | Max: 18m 55s
  🟩 MSVC               Pass: 100%/6   | Total:  2h 55m | Avg: 29m 11s | Max: 37m 19s | Hits: 365%/11112 
  🟩 NVHPC              Pass: 100%/2   | Total: 27m 39s | Avg: 13m 49s | Max: 13m 53s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  7h 19m | Avg:  9m 33s | Max: 37m 19s | Hits: 365%/11112 
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  5h 43m | Avg:  8m 34s | Max: 29m 33s | Hits: 365%/9260  
  🟩 TestCPU            Pass: 100%/3   | Total: 52m 07s | Avg: 17m 22s | Max: 37m 19s | Hits: 365%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 44m 14s | Avg: 14m 44s | Max: 18m 55s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 29s | Avg:  4m 29s | Max:  4m 29s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 24m 33s | Avg:  4m 54s | Max:  6m 07s
  🟩 14                 Pass: 100%/3   | Total: 36m 52s | Avg: 12m 17s | Max: 25m 33s | Hits: 365%/1852  
  🟩 17                 Pass: 100%/13  | Total:  2h 27m | Avg: 11m 20s | Max: 28m 23s | Hits: 365%/5556  
  🟩 20                 Pass: 100%/23  | Total:  3h 26m | Avg:  8m 57s | Max: 37m 19s | Hits: 365%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 41s | Avg: 4m 20s | Max: 6m 43s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 41s | Avg:  4m 20s | Max:  6m 43s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
  🟩 Test               Pass: 100%/1   | Total:  6m 43s | Avg:  6m 43s | Max:  6m 43s

🟩 python: Pass: 100%/1 | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 27m 49s | Avg: 27m 49s | Max: 27m 49s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
69	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
11	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

cub/test/catch2_radix_sort_helper.cuh

cub/test/catch2_segmented_sort_helper.cuh

fbusato · 2025-01-11T00:32:49Z

looks good! Indeed, many test cases were redundant, e.g. descending ordering and overwrite.
the idea of using a histogram is great.

elstehle · 2025-01-14T05:16:47Z

Thank you for your review, @fbusato! I assume that your review is limited to the testing changes in #3246. I updated that PR to address your feedback. If you are happy with the new tests, could you please approve #3246? Then we will have this PR to focus on the algorithmic changes to enable support for large number of items and segments on DeviceSegmentedSort.

github-actions · 2025-01-14T06:43:27Z

🟩 CI finished in 1h 16m: Pass: 100%/78 | Total: 17h 19m | Avg: 13m 19s | Max: 36m 50s | Hits: 422%/12340

🟩 cub: Pass: 100%/38 | Total: 10h 36m | Avg: 16m 44s | Max: 36m 50s | Hits: 590%/3120

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 10h 05m | Avg: 16m 48s | Max: 36m 50s | Hits: 590%/3120  
  🟩 arm64              Pass: 100%/2   | Total: 30m 54s | Avg: 15m 27s | Max: 16m 25s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 23m | Avg: 16m 43s | Max: 34m 50s | Hits: 590%/780   
  🟩 12.5               Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
  🟩 12.6               Pass: 100%/31  | Total:  8h 38m | Avg: 16m 43s | Max: 36m 50s | Hits: 590%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 50s | Avg:  9m 25s | Max:  9m 45s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 23m | Avg: 16m 43s | Max: 34m 50s | Hits: 590%/780   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
  🟩 nvcc12.6           Pass: 100%/29  | Total:  8h 19m | Avg: 17m 13s | Max: 36m 50s | Hits: 590%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 50s | Avg:  9m 25s | Max:  9m 45s
  🟩 nvcc               Pass: 100%/36  | Total: 10h 17m | Avg: 17m 08s | Max: 36m 50s | Hits: 590%/3120  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 48m 56s | Avg: 12m 14s | Max: 13m 04s
  🟩 Clang15            Pass: 100%/1   | Total: 12m 18s | Avg: 12m 18s | Max: 12m 18s
  🟩 Clang16            Pass: 100%/1   | Total: 11m 45s | Avg: 11m 45s | Max: 11m 45s
  🟩 Clang17            Pass: 100%/1   | Total: 12m 29s | Avg: 12m 29s | Max: 12m 29s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 47m | Avg: 15m 21s | Max: 26m 04s
  🟩 GCC7               Pass: 100%/2   | Total: 23m 43s | Avg: 11m 51s | Max: 11m 57s
  🟩 GCC8               Pass: 100%/1   | Total: 11m 34s | Avg: 11m 34s | Max: 11m 34s
  🟩 GCC9               Pass: 100%/2   | Total: 24m 00s | Avg: 12m 00s | Max: 12m 10s
  🟩 GCC10              Pass: 100%/1   | Total: 11m 47s | Avg: 11m 47s | Max: 11m 47s
  🟩 GCC11              Pass: 100%/1   | Total: 12m 59s | Avg: 12m 59s | Max: 12m 59s
  🟩 GCC12              Pass: 100%/3   | Total: 37m 27s | Avg: 12m 29s | Max: 19m 26s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 25m | Avg: 18m 07s | Max: 29m 49s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 30s | Max: 36m 10s | Hits: 590%/1560  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 48s | Max: 36m 50s | Hits: 590%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  3h 12m | Avg: 13m 47s | Max: 26m 04s
  🟩 GCC                Pass: 100%/18  | Total:  4h 26m | Avg: 14m 48s | Max: 29m 49s
  🟩 MSVC               Pass: 100%/4   | Total:  2h 22m | Avg: 35m 39s | Max: 36m 50s | Hits: 590%/3120  
  🟩 NVHPC              Pass: 100%/2   | Total: 33m 57s | Avg: 16m 58s | Max: 17m 22s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 25m 47s | Avg: 12m 53s | Max: 19m 26s
  🟩 v100               Pass: 100%/36  | Total: 10h 10m | Avg: 16m 57s | Max: 36m 50s | Hits: 590%/3120  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  7h 49m | Avg: 15m 08s | Max: 36m 50s | Hits: 590%/3120  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 25m 46s | Avg: 25m 46s | Max: 25m 46s
  🟩 GraphCapture       Pass: 100%/1   | Total: 17m 19s | Avg: 17m 19s | Max: 17m 19s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 11m | Avg: 23m 46s | Max: 26m 04s
  🟩 TestGPU            Pass: 100%/2   | Total: 52m 07s | Avg: 26m 03s | Max: 29m 49s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 47s | Avg: 12m 53s | Max: 19m 26s
  🟩 90a                Pass: 100%/1   | Total:  6m 27s | Avg:  6m 27s | Max:  6m 27s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 01m | Avg: 17m 16s | Max: 36m 10s | Hits: 590%/2340  
  🟩 20                 Pass: 100%/24  | Total:  6h 34m | Avg: 16m 25s | Max: 36m 50s | Hits: 590%/780

🟩 thrust: Pass: 100%/37 | Total: 6h 07m | Avg: 9m 56s | Max: 34m 30s | Hits: 365%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 20m 43s | Avg: 10m 21s | Max: 14m 34s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  5h 58m | Avg: 10m 13s | Max: 34m 30s | Hits: 365%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  9m 34s | Avg:  4m 47s | Max:  5m 00s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total: 46m 38s | Avg:  9m 19s | Max: 26m 36s | Hits: 365%/1844  
  🟩 12.5               Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
  🟩 12.6               Pass: 100%/30  | Total:  4h 53m | Avg:  9m 46s | Max: 34m 30s | Hits: 365%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 03s | Avg:  5m 01s | Max:  5m 03s
  🟩 nvcc12.0           Pass: 100%/5   | Total: 46m 38s | Avg:  9m 19s | Max: 26m 36s | Hits: 365%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 42m | Avg: 10m 06s | Max: 34m 30s | Hits: 365%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 03s | Avg:  5m 01s | Max:  5m 03s
  🟩 nvcc               Pass: 100%/35  | Total:  5h 57m | Avg: 10m 12s | Max: 34m 30s | Hits: 365%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 21m 05s | Avg:  5m 16s | Max:  5m 40s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 26s | Avg:  5m 26s | Max:  5m 26s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 53s | Avg:  5m 53s | Max:  5m 53s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
  🟩 Clang18            Pass: 100%/7   | Total: 45m 26s | Avg:  6m 29s | Max: 12m 16s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 17s | Avg:  5m 08s | Max:  5m 25s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
  🟩 GCC9               Pass: 100%/2   | Total: 11m 00s | Avg:  5m 30s | Max:  5m 53s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 38s | Avg:  5m 38s | Max:  5m 38s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 08s | Avg:  6m 08s | Max:  6m 08s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 03m | Avg:  7m 56s | Max: 14m 34s
  🟩 MSVC14.29          Pass: 100%/2   | Total: 53m 58s | Avg: 26m 59s | Max: 27m 22s | Hits: 365%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 34m | Avg: 31m 39s | Max: 34m 30s | Hits: 365%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  1h 23m | Avg:  5m 56s | Max: 12m 16s
  🟩 GCC                Pass: 100%/16  | Total:  1h 47m | Avg:  6m 43s | Max: 14m 34s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 28m | Avg: 29m 47s | Max: 34m 30s | Hits: 365%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total: 27m 56s | Avg: 13m 58s | Max: 14m 18s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total:  6h 07m | Avg:  9m 56s | Max: 34m 30s | Hits: 365%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  4h 38m | Avg:  8m 58s | Max: 31m 46s | Hits: 365%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 49m 18s | Avg: 16m 26s | Max: 34m 30s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 40m 02s | Avg: 13m 20s | Max: 14m 34s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 46s | Avg:  4m 46s | Max:  4m 46s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  2h 30m | Avg: 10m 45s | Max: 28m 42s | Hits: 365%/5532  
  🟩 20                 Pass: 100%/21  | Total:  3h 16m | Avg:  9m 21s | Max: 34m 30s | Hits: 365%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 01s | Avg: 5m 00s | Max: 8m 06s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  8m 06s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 55s | Avg:  1m 55s | Max:  1m 55s
  🟩 Test               Pass: 100%/1   | Total:  8m 06s | Avg:  8m 06s | Max:  8m 06s

🟩 python: Pass: 100%/1 | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 23s | Avg: 25m 23s | Max: 25m 23s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

cub/cub/device/dispatch/dispatch_segmented_sort.cuh

cub/cub/device/device_segmented_sort.cuh

cub/cub/device/dispatch/dispatch_segmented_sort.cuh

miscco

Minor nits

…n/end iterators

cub/cub/device/dispatch/dispatch_segmented_sort.cuh

cub/test/catch2_segmented_sort_helper.cuh

github-actions · 2025-01-14T12:41:01Z

🟩 CI finished in 1h 16m: Pass: 100%/78 | Total: 16h 52m | Avg: 12m 58s | Max: 38m 00s | Hits: 422%/12340

🟩 cub: Pass: 100%/38 | Total: 10h 06m | Avg: 15m 57s | Max: 38m 00s | Hits: 590%/3120

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total:  9h 37m | Avg: 16m 02s | Max: 38m 00s | Hits: 590%/3120  
  🟩 arm64              Pass: 100%/2   | Total: 28m 47s | Avg: 14m 23s | Max: 14m 41s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 18m | Avg: 15m 37s | Max: 31m 04s | Hits: 590%/780   
  🟩 12.5               Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
  🟩 12.6               Pass: 100%/31  | Total:  8h 13m | Avg: 15m 55s | Max: 38m 00s | Hits: 590%/2340  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 18m 31s | Avg:  9m 15s | Max:  9m 28s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 18m | Avg: 15m 37s | Max: 31m 04s | Hits: 590%/780   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
  🟩 nvcc12.6           Pass: 100%/29  | Total:  7h 54m | Avg: 16m 22s | Max: 38m 00s | Hits: 590%/2340  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 18m 31s | Avg:  9m 15s | Max:  9m 28s
  🟩 nvcc               Pass: 100%/36  | Total:  9h 48m | Avg: 16m 20s | Max: 38m 00s | Hits: 590%/3120  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 46m 39s | Avg: 11m 39s | Max: 12m 01s
  🟩 Clang15            Pass: 100%/1   | Total: 11m 42s | Avg: 11m 42s | Max: 11m 42s
  🟩 Clang16            Pass: 100%/1   | Total: 11m 30s | Avg: 11m 30s | Max: 11m 30s
  🟩 Clang17            Pass: 100%/1   | Total: 11m 53s | Avg: 11m 53s | Max: 11m 53s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 42m | Avg: 14m 41s | Max: 22m 57s
  🟩 GCC7               Pass: 100%/2   | Total: 23m 25s | Avg: 11m 42s | Max: 11m 56s
  🟩 GCC8               Pass: 100%/1   | Total: 11m 31s | Avg: 11m 31s | Max: 11m 31s
  🟩 GCC9               Pass: 100%/2   | Total: 23m 34s | Avg: 11m 47s | Max: 11m 53s
  🟩 GCC10              Pass: 100%/1   | Total: 12m 09s | Avg: 12m 09s | Max: 12m 09s
  🟩 GCC11              Pass: 100%/1   | Total: 12m 30s | Avg: 12m 30s | Max: 12m 30s
  🟩 GCC12              Pass: 100%/3   | Total: 37m 50s | Avg: 12m 36s | Max: 19m 31s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 05m | Avg: 15m 44s | Max: 22m 34s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 06m | Avg: 33m 25s | Max: 35m 47s | Hits: 590%/1560  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 13m | Avg: 36m 39s | Max: 38m 00s | Hits: 590%/1560  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  3h 04m | Avg: 13m 11s | Max: 22m 57s
  🟩 GCC                Pass: 100%/18  | Total:  4h 06m | Avg: 13m 42s | Max: 22m 34s
  🟩 MSVC               Pass: 100%/4   | Total:  2h 20m | Avg: 35m 02s | Max: 38m 00s | Hits: 590%/3120  
  🟩 NVHPC              Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 18m 01s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 25m 52s | Avg: 12m 56s | Max: 19m 31s
  🟩 v100               Pass: 100%/36  | Total:  9h 40m | Avg: 16m 07s | Max: 38m 00s | Hits: 590%/3120  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  7h 42m | Avg: 14m 54s | Max: 38m 00s | Hits: 590%/3120  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 59s | Avg: 21m 59s | Max: 21m 59s
  🟩 GraphCapture       Pass: 100%/1   | Total: 15m 52s | Avg: 15m 52s | Max: 15m 52s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 01m | Avg: 20m 23s | Max: 22m 33s
  🟩 TestGPU            Pass: 100%/2   | Total: 45m 31s | Avg: 22m 45s | Max: 22m 57s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 52s | Avg: 12m 56s | Max: 19m 31s
  🟩 90a                Pass: 100%/1   | Total:  6m 53s | Avg:  6m 53s | Max:  6m 53s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  3h 55m | Avg: 16m 47s | Max: 35m 47s | Hits: 590%/2340  
  🟩 20                 Pass: 100%/24  | Total:  6h 11m | Avg: 15m 28s | Max: 38m 00s | Hits: 590%/780

🟩 thrust: Pass: 100%/37 | Total: 6h 08m | Avg: 9m 57s | Max: 33m 59s | Hits: 365%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 16m 37s | Avg:  8m 18s | Max: 11m 00s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  5h 58m | Avg: 10m 14s | Max: 33m 59s | Hits: 365%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  5m 00s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total: 48m 22s | Avg:  9m 40s | Max: 28m 01s | Hits: 365%/1844  
  🟩 12.5               Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
  🟩 12.6               Pass: 100%/30  | Total:  4h 50m | Avg:  9m 40s | Max: 33m 59s | Hits: 365%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 06s
  🟩 nvcc12.0           Pass: 100%/5   | Total: 48m 22s | Avg:  9m 40s | Max: 28m 01s | Hits: 365%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  4h 40m | Avg: 10m 00s | Max: 33m 59s | Hits: 365%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 05s | Avg:  5m 02s | Max:  5m 06s
  🟩 nvcc               Pass: 100%/35  | Total:  5h 58m | Avg: 10m 13s | Max: 33m 59s | Hits: 365%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 21m 09s | Avg:  5m 17s | Max:  5m 35s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 17s | Avg:  5m 17s | Max:  5m 17s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 46s | Avg:  5m 46s | Max:  5m 46s
  🟩 Clang18            Pass: 100%/7   | Total: 44m 42s | Avg:  6m 23s | Max: 12m 26s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 53s | Avg:  5m 26s | Max:  5m 34s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 19s | Avg:  5m 19s | Max:  5m 19s
  🟩 GCC9               Pass: 100%/2   | Total: 10m 21s | Avg:  5m 10s | Max:  5m 26s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 28s | Avg:  5m 28s | Max:  5m 28s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 49s | Avg:  5m 49s | Max:  5m 49s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 23s | Avg:  6m 23s | Max:  6m 23s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 00m | Avg:  7m 33s | Max: 15m 37s
  🟩 MSVC14.29          Pass: 100%/2   | Total: 57m 26s | Avg: 28m 43s | Max: 29m 25s | Hits: 365%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 33m | Avg: 31m 19s | Max: 33m 59s | Hits: 365%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  1h 22m | Avg:  5m 54s | Max: 12m 26s
  🟩 GCC                Pass: 100%/16  | Total:  1h 44m | Avg:  6m 32s | Max: 15m 37s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 31m | Avg: 30m 16s | Max: 33m 59s | Hits: 365%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total: 29m 27s | Avg: 14m 43s | Max: 15m 05s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total:  6h 08m | Avg:  9m 57s | Max: 33m 59s | Hits: 365%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  4h 40m | Avg:  9m 02s | Max: 30m 39s | Hits: 365%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 48m 45s | Avg: 16m 15s | Max: 33m 59s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 39m 03s | Avg: 13m 01s | Max: 15m 37s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 23s | Avg:  4m 23s | Max:  4m 23s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  2h 35m | Avg: 11m 06s | Max: 30m 39s | Hits: 365%/5532  
  🟩 20                 Pass: 100%/21  | Total:  3h 16m | Avg:  9m 20s | Max: 33m 59s | Hits: 365%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 10s | Avg: 5m 35s | Max: 9m 10s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  9m 10s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
  🟩 Test               Pass: 100%/1   | Total:  9m 10s | Avg:  9m 10s | Max:  9m 10s

🟩 python: Pass: 100%/1 | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 26m 20s | Avg: 26m 20s | Max: 26m 20s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

…o `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

…o `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

…o `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements

Cleanup util_arch (NVIDIA#2773) Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Deprecate thrust::numeric_limits (NVIDIA#3366) Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 fix documentation

elstehle requested review from a team as code owners January 9, 2025 12:10

elstehle requested review from bernhardmgruber and gonidelis January 9, 2025 12:10

elstehle force-pushed the enh/large-seg-support-seg-sort branch from 8889147 to 154861d Compare January 9, 2025 16:18