-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for org.apache.spark.sql.catalyst.expressions.Bin
#2760
base: branch-25.02
Are you sure you want to change the base?
Add support for org.apache.spark.sql.catalyst.expressions.Bin
#2760
Conversation
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, love to see some Java tests
#include <cudf/column/column_device_view.cuh> | ||
#include <cudf/column/column_factories.hpp> | ||
#include <cudf/detail/null_mask.hpp> | ||
#include <cudf/detail/nvtx/ranges.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless include (not sure)?
|
||
TEST_F(LongToBinaryStringTests, FromLongToBinary) | ||
{ | ||
auto const longs = cudf::test::fixed_width_column_wrapper<int64_t>{0L, 1L, 10L, -0L, -1L}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some edge case like null, LONG_MAX, LONG_MIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test case at plugin level to make sure Bin(13.3) returns 1101 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test case at plugin level to make sure Bin(13.3) returns 1101 ?
I tested it locally, and Bin(13.3)
indeed returns 1101
. I will soon submit the plugin PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a Spark test not a good test for binary ops. Spark only accepts a Long as the input to bin
So any floating point input gets a cast to long inserted in before bin is called. So that test is really a test that Spark is doing the right thing. Even then it would probably be something that we would want to put in the integration tests if we did test it at all.
Note that you can also pass in a string as an input and it will still try to cast it to a long before calling bin
.
{ | ||
auto const size = 64 - __clzll(value); | ||
// If the value is 0, the size should be 1 | ||
return size > 0 ? size : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can first check if the value == 0 to save a __clzll
call in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't that introduce thread divergence? I understand how that can be an advantage on a CPU, but I don't really see it on a GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PTX translated from the code is as follows:
auto const size = 64 - __clzll(value);
return size > 0 ? size : 1;
// PTX
clz.b64 %r1, %rd1; // %r1: number of leading zeros in %rd1 (value)
mov.u32 %r2, 64; // %r2: constant value 64
sub.s32 %r3, %r2, %r1; // %r3: size = 64 - %r1
max.s32 %r4, %r3, 1; // %r4: return max(size, 1)
if (!value) return 1;
return 64 - __clzll(value);
// PTX
setp.eq.s64 %p1, %rd1, 0; // %p1: set to true if %rd1 (value) is zero
clz.b64 %r1, %rd1; // %r1: number of leading zeros in %rd1 (value)
mov.u32 %r2, 64; // %r2: constant value 64
sub.s32 %r3, %r2, %r1; // %r3: size = 64 - %r1
selp.b32 %r4, 1, %r3, %p1; // %r4: if %p1 is true, return 1; otherwise, return %r3 (final size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't that introduce thread divergence? I understand how that can be an advantage on a CPU, but I don't really see it on a GPU.
Yes, I was thinking there must be an if else to check size == 0 case so we can put it earlier than clz to save some calls without introducing new branch. But it looks like the compiler will optimize the size > 0 ? size : 1;
to a max.s32
so it's branch less then the original approach looks better in anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now changed it to this: return max(64 - __clzll(value), 1);
auto results = spark_rapids_jni::long_to_binary_string(longs, cudf::get_default_stream()); | ||
|
||
auto const expected = cudf::test::strings_column_wrapper{ | ||
"0", "1", "1010", "0", "1111111111111111111111111111111111111111111111111111111111111111"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing for me is going to be performance. Do you have any numbers comparing the CPU to the GPU? If you want to wait for a plugin patch that is fine. But we need to do it somewhere.
|
||
TEST_F(LongToBinaryStringTests, FromLongToBinary) | ||
{ | ||
auto const longs = cudf::test::fixed_width_column_wrapper<int64_t>{0L, 1L, 10L, -0L, -1L}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a Spark test not a good test for binary ops. Spark only accepts a Long as the input to bin
So any floating point input gets a cast to long inserted in before bin is called. So that test is really a test that Spark is doing the right thing. Even then it would probably be something that we would want to put in the integration tests if we did test it at all.
Note that you can also pass in a string as an input and it will still try to cast it to a long before calling bin
.
auto const value = d_longs.element<LongType>(idx); | ||
char* d_buffer = d_chars + d_offsets[idx]; | ||
for (auto i = d_sizes[idx] - 1; i >= 0; --i) { | ||
*d_buffer++ = value & (1LL << i) ? '1' : '0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*d_buffer++ = '0' + ((value & (1LL << i)) >> i);
perhaps this approach is more efficient since it avoids branching, which might degrade performance on GPUs with warp divergence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe *d_buffer++ = '0' + ((value & (1LL << i)) != 0);
? It will be (very slightly) cheaper and easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thirtiseven
Yes, this one is also a Branch-Free expression since the compiler shall use setne
instruction avoids branching by directly setting a register based on the zero flag (ZF):
cmp rax, 0
setne al
and al, 1
add eax, 48
The corresponding codes of my alternative would be translated into:
sar rax, cl
add rax, 48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
__device__ void operator()(cudf::size_type idx) | ||
{ | ||
if (d_longs.is_null(idx)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Just some nice-to-have improvement, use constexpr if
instead of if
and add an extra template variable nullable
for this functor. Because we already knew whether the column_view is nullable or NOT.
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
Signed-off-by: Yan Feng <[email protected]>
About the performance, we may try the following approach.
for (auto i = d_sizes[idx] - 1; i >= 0; --i) {
*d_buffer++ = value & (1LL << i) ? '1' : '0'; ==>> create a buffer in the stack:
char buf[64];
// totally have 64 lines:
buf[0] = value & (1LL << 0) ? '1' : '0';
buf[1] = value & (1LL << 1) ? '1' : '0';
buf[2] = value & (1LL << 2) ? '1' : '0';
...
buf[62] = value & (1LL << 62) ? '1' : '0';
buf[63] = value & (1LL << 63) ? '1' : '0';
// some of the above lines are unnecessary if the long values have leading have zeros, but keep them to avoid GPU divergence.
// write out by 4 chars
num_chars = x
while(num_chars > 4) {
int* d_int_buf = (int*)(d_buffer)
int* int_buf = (int*)buf
d_int_buf[i] = int_buf[i]
num_chars -= 4
}
write out left chars
// Or use mem_cpy to copy multiple chars at one time
mem_cpy(src, dst, 0, size)
|
I found an example, maybe it's useful: void intToBin(uint64_t i, char* out) {
assert((out % sizeof(uint64_t)) == 0); // check alignment
uint64_t zeros = 0x3030303030303030ULL; // "00000000"
uint64_t mask = 0x0101010101010101ULL;
uint64_t *out_64 = reinterprent_cast<uint64_t*>(out);
uint64_t bit_index = 0;
_BitScanForward64(&bit_index, i)
i = i << bit_index;
out_64[0] = zeros | _pdep_u64(i >> 56, mask);
out_64[1] = zeros | _pdep_u64(i >> 48, mask);
out_64[2] = zeros | _pdep_u64(i >> 40, mask);
out_64[3] = zeros | _pdep_u64(i >> 32, mask);
out_64[4] = zeros | _pdep_u64(i >> 24, mask);
out_64[5] = zeros | _pdep_u64(i >> 16, mask);
out_64[6] = zeros | _pdep_u64(i >> 8, mask);
out_64[7] = zeros | _pdep_u64(i >> 0, mask);
_BitScanReverse64(&bit_index, i)
out[64 - bit_index] = '\0';
}
Sorry, I did not find CUDA version of _pdep_u64 |
Contribute to NVIDIA/spark-rapids#11648
This PR adds support for
org.apache.spark.sql.catalyst.expressions.Bin
.What it does is taking a
long
and outputting its binary representation.