-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improves DeviceSegmentedSort
test run time for large number of items and segments
#3246
base: main
Are you sure you want to change the base?
Improves DeviceSegmentedSort
test run time for large number of items and segments
#3246
Conversation
🟩 CI finished in 1h 07m: Pass: 100%/96 | Total: 20h 57m | Avg: 13m 05s | Max: 42m 53s | Hits: 98%/12392
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 96)
# | Runner |
---|---|
71 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 1h 53m: Pass: 100%/96 | Total: 13h 56m | Avg: 8m 42s | Max: 35m 43s | Hits: 99%/12392
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 96)
# | Runner |
---|---|
71 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
a5dc0db
to
2228a87
Compare
🟨 CI finished in 1h 45m: Pass: 98%/92 | Total: 1d 03h | Avg: 18m 02s | Max: 1h 16m | Hits: 160%/9748
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 92)
# | Runner |
---|---|
69 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
7 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 2h 37m: Pass: 100%/92 | Total: 1d 03h | Avg: 18m 11s | Max: 1h 16m | Hits: 160%/9748
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 92)
# | Runner |
---|---|
69 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
7 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
2228a87
to
b8cedc1
Compare
🟩 CI finished in 2h 01m: Pass: 100%/96 | Total: 2d 16h | Avg: 40m 14s | Max: 1h 06m | Hits: 303%/15012
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 96)
# | Runner |
---|---|
69 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
11 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
could you please summarize the changes that helped to reduce the runtime? |
The PR is touching two tests:
For (1), we switched from invoking For (2), (a) tests never finished and (b) segment generation was generating overlapping segments, which lead to test failures, because it creates a race on which of the segments pointing to the same output region would be sorted first. So, we switched from generating random inputs to generating a sequence of |
Description
Closes #3222
Reduces the per-test run time from six minutes to six seconds.
Once this PR is merged, I'm planning to integrate a similar approach to
DeviceSegmentedRadixSort
in #3245.The PR is touching two tests:
For (1), we switched from invoking
std::stable_sort
as a means of verifying that the items were sorted correctly to using histograms over the input items. This lowered per-test-instance run time from six minutes to six seconds for these tests.For (2), (a) tests never finished and (b) segment generation was generating overlapping segments, which lead to test failures, because it creates a race on which of the segments pointing to the same output region would be sorted first. So, we switched from generating random inputs to generating a sequence of
0, 1, 2, ..., max_histo_size-1, 0, 1, 2
. We use a fixed segment size over this input sequence, chunking it up, say, every 1000 items. We then use an analytical model to compute the histogram over the input values for a given segment and use that histogram to understand what the sorted output range of that segment would look like. E.g., if we know0
is repeated four times in the first segment, we know the sorted sequence should start with0
and beginning at offset four should continue with key1
. So on and so forth.Checklist