-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid converting Decimal32/Decimal64 in to_arrow
and from_arrow
APIs
#17422
base: branch-25.02
Are you sure you want to change the base?
Conversation
/ok to test |
Looks like a style fix is needed, and also you'll need to merge in the latest 25.02 to get all of CI passing again. I'll review in a few. |
@vyasr Updated to fix the style issue and updated the branch |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review - looks good so far! Still need to go through changes in tests/interop
.
A few questions:
/ok to test |
1 similar comment
/ok to test |
Hmm, it looks like the failing pyarrow tests might be a lack of support in pyarrow for decimal64/decimal32. I'll have to go fix that 😦 |
The failing tests will get addressed by apache/arrow#44882 and apache/arrow#45014. Once those are merged and the next pyarrow release is released, we can update the pyarrow version so that it will support decimal32/decimal64 directly in pyarrow. |
Ooops forgot to push the change. I've pushed the updates now, and should be good to run the CI (hopefully) |
@zeroshade Let's merge #17794 first, it is nearly done -- just waiting on CI to complete. I asked for a slight modification to the skip criteria there, so there will be minor merge conflicts with this PR. |
/okay to test |
"git_shallow" : false, | ||
"patches" : [ | ||
{ | ||
"file" : "${current_json_dir}/nanoarrow_clang_tidy_compliance.diff", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this patch still relevant? I don’t see it here. Can this override be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade Can we remove this? I'm okay with merging this PR and opening a follow-up for cleanup, since CI has passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like #17618 removed this patch. I'll merge this PR and file a follow-up to delete this override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized we are missing some reviews. If we need to push any changes to this branch for other reviews, we should go ahead and delete this. Otherwise I think merging this and removing the overrides in a follow-up is the best case scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this patch should be removed, but I agree that it's not worth rerunning CI just for that. If there are no other changes here we can do it in a follow-up just fine.
/okay to test |
@galipremsagar @davidwendt @bdice I don't see where the Btw: we are working on pushing out a 19.0.1 patch fix for pyarrow to fix the issue, but it'll be a few days before that happens still. |
Yes, sorry Matt. We are having some build issues in CI that we are trying to work through that are showing up here as well. |
/ok to test |
I think CI issues should be cleared up now. I triggered a new run. |
Requesting reviews from the @rapidsai/cudf-java-codeowners and @rapidsai/cudf-python-codeowners teams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to run some tests on Spark to know if this is a breaking change or not. Spark uses java-arrow (a version that we do not control) for interchange with python. I am concerned that the version of python-arrow that is used is also not compatible with this change, or if it is, then it will send back decimal values that are not compatible with the java version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving python changes with one question about what the skipped tests mean for decimal32/64 support when the user doesn't have pyarrow 19.
pytest.param( | ||
Decimal64Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(10, 6), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(16, 7), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal32Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal32 format string only supported in pyarrow >=19", | ||
), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does this mean that decimal32/64 are no longer supported unless the user has pyarrow >= 19? Or do I misunderstand what is going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, you're correct. (Well, they are still supported, but you can't convert to pyarrow unless they have >= 19)
The only way to avoid that would be to add a flag at the C++ level to tell it to fall back and upcast the decimal32/64 back to decimal128 and then conditionally set that flag if they are using pyarrow <19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this tie in with the Java discussion here? #17422 (review)
Do we need a way to make this support older pyarrow versions with the previous behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to make it support older pyarrow versions, we'll need to decide on what the interface change to to_arrow_device
/ to_arrow_host
should be to pass an option telling it to upcast decimal 32/64 instead of doing zero-copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where exactly is it failing? I would expect that the computation of got
proceeds fine, but assert_eq
fails because we do the comparison by converting the cudf object to host via to_arrow
. Is that right?
If so, I am OK with skipping this. I think the same argument that @zeroshade made in #17422 (comment) applies here too: users can still do this, but they have to perform the cast themselves first. I don't view that as a negative since it makes explicit a copy that was previously happening implicitly anyway. We could make the test "work" by casting manually, but at that point we'd have to also inject extra information about the source dtype pre-cast since that information would be lost.
Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to be extra helpful to users we could insert a dtype check in to_arrow
and raise a more informative error. Given the upcoming changes I'm inclined to say that's not worthwhile, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vyasr it fails on the pyarrow side because it doesn't recognize the schema format string coming from the C ABI. Older versions of pyarrow reject the bit width of 32 and 64.
Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.
Has this work begun yet? I also really want to be exposing the to_arrow_device
and from_arrow_device
in Python. Currently, pylibcudf forces copy between CPU and GPU for arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving this for the java side of things, but I am not happy with it, because it paints us into a corner. We support Spark 3.2.0+ right now. The python interface for Spark 3.2.0 through 3.4.x requires at least pyarrow 1.0.0. Spark 3.5.0+ currently requires at least pyarrow 4.0.0.
Neither of these are are compatible with pyarrow >= 19. So it will not work for us to support any kind of decimal transfer. The reason I am approving this because the Spark RAPIDS plugin doesn't currently support decimal for python UDFs. It looks like that was a missed feature when we added in decimal support. I made a 3 line change to our plugin to add in decimal support, tested it, and it worked just fine without this patch. With this patch it fails.
I would love the ability to serialize the data in a way that is compatible with older versions of arrow, but it is not a requirement right now.
@revans2 The only way for it to be compatible with older versions of arrow would be to have it perform the cast/copy/conversion to decimal128. That said, what makes those incompatible with pyarrow >= 19 if they only specify the minimum being 1.0 / 4.0? Do they have a maximum version they support? If the spark interfaces don't work with the newer versions of pyarrow, that's a problem and should be addressed by either us in the Arrow project or by Spark. |
We don't technically have any control over what the customer's environment is setup as beyond the requirements that we have the cuda drivers installed and our plugin is on the classpath. Even the cuda runtime can be too much for some of them (I'm looking at you dataproc). It is highly unlikely that anyone is running with the 5 year old pyarrow 1.0.0, but it is possible (Spark 3.2.0, which we have customers running is 4 years old). We want our plugin to be compatible with whatever setup that they have already setup and tested. It is, however, very likely that our customers are not running on the pyarrow 19.0.0, which is 13 days old as of writing this. The environments we run in are often set by CSPs and large enterprises where they find something that works and don't touch it for years except when the security people force them to. I want to reduce the friction for adopting our product as much as possible. But you do have a point. I can work around it myself and cast the decimal values to DECIMAL_128 before I send them. It would slow things down, especially for nested data, but it would work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one thread that @wence- asked about decimal32/64 support in the Python test suite that we should make sure we come to a resolution on that everyone is happy with, but aside from that this looks great. Thanks as always for the attention to detail in the test suite!
"git_shallow" : false, | ||
"patches" : [ | ||
{ | ||
"file" : "${current_json_dir}/nanoarrow_clang_tidy_compliance.diff", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this patch should be removed, but I agree that it's not worth rerunning CI just for that. If there are no other changes here we can do it in a follow-up just fine.
pytest.param( | ||
Decimal64Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(10, 6), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(16, 7), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal32Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal32 format string only supported in pyarrow >=19", | ||
), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where exactly is it failing? I would expect that the computation of got
proceeds fine, but assert_eq
fails because we do the comparison by converting the cudf object to host via to_arrow
. Is that right?
If so, I am OK with skipping this. I think the same argument that @zeroshade made in #17422 (comment) applies here too: users can still do this, but they have to perform the cast themselves first. I don't view that as a negative since it makes explicit a copy that was previously happening implicitly anyway. We could make the test "work" by casting manually, but at that point we'd have to also inject extra information about the source dtype pre-cast since that information would be lost.
Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.
pytest.param( | ||
Decimal64Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(10, 6), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal64Dtype(16, 7), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal64 format string only supported in pyarrow >=19", | ||
), | ||
), | ||
pytest.param( | ||
Decimal32Dtype(6, 3), | ||
marks=pytest.mark.skipif( | ||
pa._generated_version.version_tuple[0] < 19, | ||
reason="decimal32 format string only supported in pyarrow >=19", | ||
), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to be extra helpful to users we could insert a dtype check in to_arrow
and raise a more informative error. Given the upcoming changes I'm inclined to say that's not worthwhile, though.
Description
Now that the Arrow format includes
Decimal32
andDecimal64
data types, CUDF no longer needs to convert them to decimal128 when importing/exporting values via theto_arrow
andfrom_arrow
APIs. Instead we can just treat them like any other fixed-width data type and use the buffers directly.This doesn't fully address #17080 as it doesn't make any changes to the Parquet side of things
This also incorporates the changes from #17405 which are needed for debug tests. That should get merged first, and then I can rebase this.
Checklist