-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX shfl_sync
#3241
base: main
Are you sure you want to change the base?
PTX shfl_sync
#3241
Conversation
🟩 CI finished in 1h 49m: Pass: 100%/170 | Total: 3d 02h | Avg: 26m 12s | Max: 1h 08m | Hits: 76%/22526
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a quick review. I would love to have @ahendriksen's opinion, since it touches his work on the PTX exposure. Also, he has a way better PTX understanding than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have sent some comments in private as well. The data
parameter should be a template parameter to allow shuffling any 32-bit value.
|
||
template <dot_shfl_mode _ShuffleMode> | ||
_CCCL_DEVICE static inline _CUDA_VSTD::uint32_t __shfl_sync_dst_lane( | ||
shfl_mode_t<_ShuffleMode> __shfl_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @ahendriksen said this must be a template argument, otherwise it would not be usable in an if constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit annoying that the compiler has to instantiate multiple functions, one for each type, that perform exactly the same functionality.
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
🟩 CI finished in 1h 37m: Pass: 100%/170 | Total: 2d 17h | Avg: 23m 17s | Max: 1h 05m | Hits: 82%/22529
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
@ahendriksen @miscco I modified the return type and added the predicate as an output parameter in the last commit |
🟨 CI finished in 2h 59m: Pass: 98%/164 | Total: 3d 03h | Avg: 27m 26s | Max: 1h 13m | Hits: 434%/15316
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
@@ -102,7 +96,7 @@ _CCCL_NODISCARD _CCCL_DEVICE static inline shfl_return_values<_Tp> shfl_sync( | |||
"shfl.sync.sync.idx.b32 %0|p, %2, %3, %4, %5; \n\t\t" | |||
"selp.s32 %1, 1, 0, p; \n\t" | |||
"}" | |||
: "=r"(__ret), "=r"(__pred) | |||
: "=r"(__ret), "=r"(__pred1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just
static_cast<_CUDA_VSTD::int32_t>(__pred)
I thought bool were also 32bit on GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on that. @ahendriksen do you have any preference, bool
or int
for pred
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, sizeof(bool)
is 1 even on CUDA. A single bool
value can physically require 32-bit because there are no smaller registers on gpus
🟨 CI finished in 1h 37m: Pass: 99%/164 | Total: 1d 01h | Avg: 9m 19s | Max: 1h 11m | Hits: 536%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 13h 56m: Pass: 100%/164 | Total: 1d 00h | Avg: 9m 00s | Max: 37m 07s | Hits: 536%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
Related to #2976
Description
Provide C++ implementation of PTX
shfl_sync
.In addition to CUDA intrinsics, the function provide the following features: