Avoid converting Decimal32/Decimal64 in `to_arrow` and `from_arrow` APIs #17422

zeroshade · 2024-11-22T20:58:16Z

Description

Now that the Arrow format includes Decimal32 and Decimal64 data types, CUDF no longer needs to convert them to decimal128 when importing/exporting values via the to_arrow and from_arrow APIs. Instead we can just treat them like any other fixed-width data type and use the buffers directly.

This doesn't fully address #17080 as it doesn't make any changes to the Parquet side of things

This also incorporates the changes from #17405 which are needed for debug tests. That should get merged first, and then I can rebase this.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-11-22T20:58:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vyasr · 2024-12-02T18:47:00Z

/ok to test

cpp/src/interop/arrow_utilities.hpp

vyasr · 2024-12-02T22:01:05Z

Looks like a style fix is needed, and also you'll need to merge in the latest 25.02 to get all of CI passing again. I'll review in a few.

zeroshade · 2024-12-05T16:45:22Z

@vyasr Updated to fix the style issue and updated the branch

shrshi · 2024-12-09T06:16:29Z

/ok to test

shrshi

Partial review - looks good so far! Still need to go through changes in tests/interop.
A few questions:

cpp/src/interop/arrow_utilities.hpp

cpp/src/interop/to_arrow_device.cu

cpp/src/interop/to_arrow_schema.cpp

cpp/tests/interop/from_arrow_device_test.cpp

shrshi · 2024-12-10T16:39:16Z

/ok to test

shrshi · 2024-12-11T01:30:25Z

/ok to test

zeroshade · 2024-12-11T16:44:57Z

Hmm, it looks like the failing pyarrow tests might be a lack of support in pyarrow for decimal64/decimal32. I'll have to go fix that 😦

zeroshade · 2024-12-12T18:13:36Z

The failing tests will get addressed by apache/arrow#44882 and apache/arrow#45014. Once those are merged and the next pyarrow release is released, we can update the pyarrow version so that it will support decimal32/decimal64 directly in pyarrow.

zeroshade · 2025-01-24T22:34:36Z

Ooops forgot to push the change. I've pushed the updates now, and should be good to run the CI (hopefully)

bdice · 2025-01-24T22:40:13Z

@zeroshade Let's merge #17794 first, it is nearly done -- just waiting on CI to complete. I asked for a slight modification to the skip criteria there, so there will be minor merge conflicts with this PR.

conda/recipes/cudf/meta.yaml

conda/recipes/pylibcudf/meta.yaml

galipremsagar · 2025-01-25T17:14:37Z

/okay to test

python/cudf/cudf/tests/test_scalar.py

bdice · 2025-01-25T19:03:01Z

cpp/cmake/thirdparty/patches/nanoarrow_override.json

+      "git_shallow" : false,
+      "patches" : [
+        {
+          "file" : "${current_json_dir}/nanoarrow_clang_tidy_compliance.diff",


Is this patch still relevant? I don’t see it here. Can this override be removed?

@zeroshade Can we remove this? I'm okay with merging this PR and opening a follow-up for cleanup, since CI has passed.

Looks like #17618 removed this patch. I'll merge this PR and file a follow-up to delete this override.

I realized we are missing some reviews. If we need to push any changes to this branch for other reviews, we should go ahead and delete this. Otherwise I think merging this and removing the overrides in a follow-up is the best case scenario.

Yes this patch should be removed, but I agree that it's not worth rerunning CI just for that. If there are no other changes here we can do it in a follow-up just fine.

galipremsagar · 2025-01-25T19:03:21Z

/okay to test

zeroshade · 2025-01-27T16:13:28Z

@galipremsagar @davidwendt @bdice I don't see where the py-polars tests that are failing are in the repo, given that the failure us a pyarrow_slice_roundtrip i'm guessing it's the same issue as the others in that it's hitting the bug that needs to get patched?

Btw: we are working on pushing out a 19.0.1 patch fix for pyarrow to fix the issue, but it'll be a few days before that happens still.

davidwendt · 2025-01-27T18:03:36Z

Yes, sorry Matt. We are having some build issues in CI that we are trying to work through that are showing up here as well.
Hopefully we'll be able to CI working again soon.

bdice · 2025-01-28T05:59:13Z

/ok to test

bdice · 2025-01-28T05:59:31Z

I think CI issues should be cleared up now. I triggered a new run.

bdice · 2025-01-28T14:25:35Z

Requesting reviews from the @rapidsai/cudf-java-codeowners and @rapidsai/cudf-python-codeowners teams.

revans2

I need to run some tests on Spark to know if this is a breaking change or not. Spark uses java-arrow (a version that we do not control) for interchange with python. I am concerned that the version of python-arrow that is used is also not compatible with this change, or if it is, then it will send back decimal values that are not compatible with the java version.

wence-

Approving python changes with one question about what the skipped tests mean for decimal32/64 support when the user doesn't have pyarrow 19.

wence- · 2025-01-28T14:32:48Z

python/cudf/cudf/tests/test_reductions.py

+        pytest.param(
+            Decimal64Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(10, 6),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(16, 7),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal32Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal32 format string only supported in pyarrow >=19",
+            ),
+        ),


Question: does this mean that decimal32/64 are no longer supported unless the user has pyarrow >= 19? Or do I misunderstand what is going on?

Nope, you're correct. (Well, they are still supported, but you can't convert to pyarrow unless they have >= 19)

The only way to avoid that would be to add a flag at the C++ level to tell it to fall back and upcast the decimal32/64 back to decimal128 and then conditionally set that flag if they are using pyarrow <19.

Does this tie in with the Java discussion here? #17422 (review)

Do we need a way to make this support older pyarrow versions with the previous behavior?

If we want to make it support older pyarrow versions, we'll need to decide on what the interface change to to_arrow_device / to_arrow_host should be to pass an option telling it to upcast decimal 32/64 instead of doing zero-copy.

Where exactly is it failing? I would expect that the computation of got proceeds fine, but assert_eq fails because we do the comparison by converting the cudf object to host via to_arrow. Is that right?

If so, I am OK with skipping this. I think the same argument that @zeroshade made in #17422 (comment) applies here too: users can still do this, but they have to perform the cast themselves first. I don't view that as a negative since it makes explicit a copy that was previously happening implicitly anyway. We could make the test "work" by casting manually, but at that point we'd have to also inject extra information about the source dtype pre-cast since that information would be lost.

Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.

If we wanted to be extra helpful to users we could insert a dtype check in to_arrow and raise a more informative error. Given the upcoming changes I'm inclined to say that's not worthwhile, though.

@vyasr it fails on the pyarrow side because it doesn't recognize the schema format string coming from the C ABI. Older versions of pyarrow reject the bit width of 32 and 64.

Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.

Has this work begun yet? I also really want to be exposing the to_arrow_device and from_arrow_device in Python. Currently, pylibcudf forces copy between CPU and GPU for arrow.

revans2

I am approving this for the java side of things, but I am not happy with it, because it paints us into a corner. We support Spark 3.2.0+ right now. The python interface for Spark 3.2.0 through 3.4.x requires at least pyarrow 1.0.0. Spark 3.5.0+ currently requires at least pyarrow 4.0.0.

Neither of these are are compatible with pyarrow >= 19. So it will not work for us to support any kind of decimal transfer. The reason I am approving this because the Spark RAPIDS plugin doesn't currently support decimal for python UDFs. It looks like that was a missed feature when we added in decimal support. I made a 3 line change to our plugin to add in decimal support, tested it, and it worked just fine without this patch. With this patch it fails.

I would love the ability to serialize the data in a way that is compatible with older versions of arrow, but it is not a requirement right now.

zeroshade · 2025-01-28T19:48:25Z

@revans2 The only way for it to be compatible with older versions of arrow would be to have it perform the cast/copy/conversion to decimal128.

That said, what makes those incompatible with pyarrow >= 19 if they only specify the minimum being 1.0 / 4.0? Do they have a maximum version they support? If the spark interfaces don't work with the newer versions of pyarrow, that's a problem and should be addressed by either us in the Arrow project or by Spark.

revans2 · 2025-01-28T20:09:17Z

That said, what makes those incompatible with pyarrow >= 19

We don't technically have any control over what the customer's environment is setup as beyond the requirements that we have the cuda drivers installed and our plugin is on the classpath. Even the cuda runtime can be too much for some of them (I'm looking at you dataproc). It is highly unlikely that anyone is running with the 5 year old pyarrow 1.0.0, but it is possible (Spark 3.2.0, which we have customers running is 4 years old). We want our plugin to be compatible with whatever setup that they have already setup and tested.

It is, however, very likely that our customers are not running on the pyarrow 19.0.0, which is 13 days old as of writing this. The environments we run in are often set by CSPs and large enterprises where they find something that works and don't touch it for years except when the security people force them to. I want to reduce the friction for adopting our product as much as possible.

But you do have a point. I can work around it myself and cast the decimal values to DECIMAL_128 before I send them. It would slow things down, especially for nested data, but it would work.

vyasr

There's one thread that @wence- asked about decimal32/64 support in the Python test suite that we should make sure we come to a resolution on that everyone is happy with, but aside from that this looks great. Thanks as always for the attention to detail in the test suite!

vyasr · 2025-01-29T02:45:05Z

cpp/cmake/thirdparty/patches/nanoarrow_override.json

+      "git_shallow" : false,
+      "patches" : [
+        {
+          "file" : "${current_json_dir}/nanoarrow_clang_tidy_compliance.diff",


Yes this patch should be removed, but I agree that it's not worth rerunning CI just for that. If there are no other changes here we can do it in a follow-up just fine.

vyasr · 2025-01-29T03:01:59Z

python/cudf/cudf/tests/test_reductions.py

+        pytest.param(
+            Decimal64Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(10, 6),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(16, 7),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal32Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal32 format string only supported in pyarrow >=19",
+            ),
+        ),


Where exactly is it failing? I would expect that the computation of got proceeds fine, but assert_eq fails because we do the comparison by converting the cudf object to host via to_arrow. Is that right?

If so, I am OK with skipping this. I think the same argument that @zeroshade made in #17422 (comment) applies here too: users can still do this, but they have to perform the cast themselves first. I don't view that as a negative since it makes explicit a copy that was previously happening implicitly anyway. We could make the test "work" by casting manually, but at that point we'd have to also inject extra information about the source dtype pre-cast since that information would be lost.

Relatedly, I expect that in the next release we'll move to pylibcudf exclusively exposing the capsule interface rather than pyarrow types in its interop, at which point we'll be even less tied to a particular pyarrow version's specific support for the Arrow spec than we are now.

vyasr · 2025-01-29T03:04:13Z

python/cudf/cudf/tests/test_reductions.py

+        pytest.param(
+            Decimal64Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(10, 6),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal64Dtype(16, 7),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal64 format string only supported in pyarrow >=19",
+            ),
+        ),
+        pytest.param(
+            Decimal32Dtype(6, 3),
+            marks=pytest.mark.skipif(
+                pa._generated_version.version_tuple[0] < 19,
+                reason="decimal32 format string only supported in pyarrow >=19",
+            ),
+        ),


If we wanted to be extra helpful to users we could insert a dtype check in to_arrow and raise a more informative error. Given the upcoming changes I'm inclined to say that's not worthwhile, though.

zeroshade added 2 commits November 22, 2024 12:45

Fix Debug-mode failing Arrow test

0724fd2

remove unused iostream

3a45ca0

zeroshade requested a review from a team as a code owner November 22, 2024 20:58

zeroshade requested review from shrshi and pmattione-nvidia November 22, 2024 20:58

github-actions bot assigned zeroshade Nov 22, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 22, 2024

zeroshade requested a review from a team as a code owner November 22, 2024 21:00

github-actions bot added the CMake CMake build issue label Nov 22, 2024

davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 22, 2024

github-actions bot assigned vyasr Dec 2, 2024

pmattione-nvidia reviewed Dec 2, 2024

View reviewed changes

cpp/src/interop/arrow_utilities.hpp Outdated Show resolved Hide resolved

pmattione-nvidia approved these changes Dec 5, 2024

View reviewed changes

shrshi reviewed Dec 9, 2024

View reviewed changes

cpp/src/interop/arrow_utilities.hpp Outdated Show resolved Hide resolved

cpp/src/interop/to_arrow_device.cu Outdated Show resolved Hide resolved

cpp/src/interop/to_arrow_schema.cpp Show resolved Hide resolved

shrshi reviewed Dec 10, 2024

View reviewed changes

cpp/tests/interop/from_arrow_device_test.cpp Outdated Show resolved Hide resolved

zeroshade added 4 commits January 17, 2025 13:44

Merge branch 'rapidsai:branch-25.02' into branch-25.02

6cab3fe

Avoid converting decimal32/decimal64

67da453

style fix and cleanup DeviceType

eab0b1e

updates from comments

f2f63ae

Merge branch 'branch-25.02' into decimal32-decimal64

162636a

Merge branch 'branch-25.02' into decimal32-decimal64

c704ecd

github-actions bot assigned galipremsagar Jan 25, 2025

github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Jan 25, 2025

galipremsagar reviewed Jan 25, 2025

View reviewed changes

conda/recipes/cudf/meta.yaml Outdated Show resolved Hide resolved

conda/recipes/pylibcudf/meta.yaml Outdated Show resolved Hide resolved

Apply suggestions from code review

843c36f

galipremsagar reviewed Jan 25, 2025

View reviewed changes

python/cudf/cudf/tests/test_scalar.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/tests/test_scalar.py

e625228

bdice reviewed Jan 25, 2025

View reviewed changes

Merge branch 'branch-25.02' into decimal32-decimal64

f6b627b

Merge branch 'branch-25.02' into decimal32-decimal64

23a8204

github-actions bot assigned bdice Jan 28, 2025

bdice approved these changes Jan 28, 2025

View reviewed changes

revans2 reviewed Jan 28, 2025

View reviewed changes

wence- approved these changes Jan 28, 2025

View reviewed changes

galipremsagar approved these changes Jan 28, 2025

View reviewed changes

revans2 approved these changes Jan 28, 2025

View reviewed changes

davidwendt approved these changes Jan 28, 2025

View reviewed changes

vyasr approved these changes Jan 29, 2025

View reviewed changes

Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs #17422

Are you sure you want to change the base?

Avoid converting Decimal32/Decimal64 in to_arrow and from_arrow APIs #17422

Conversation

zeroshade commented Nov 22, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Nov 22, 2024

vyasr commented Dec 2, 2024

vyasr commented Dec 2, 2024

zeroshade commented Dec 5, 2024

shrshi commented Dec 9, 2024

shrshi left a comment

Choose a reason for hiding this comment

shrshi commented Dec 10, 2024

shrshi commented Dec 11, 2024

zeroshade commented Dec 11, 2024

zeroshade commented Dec 12, 2024

zeroshade commented Jan 24, 2025

bdice commented Jan 24, 2025

galipremsagar commented Jan 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galipremsagar commented Jan 25, 2025

zeroshade commented Jan 27, 2025

davidwendt commented Jan 27, 2025

bdice commented Jan 28, 2025

bdice commented Jan 28, 2025

bdice commented Jan 28, 2025

revans2 left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

zeroshade commented Jan 28, 2025

revans2 commented Jan 28, 2025

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Avoid converting Decimal32/Decimal64 in `to_arrow` and `from_arrow` APIs #17422

Avoid converting Decimal32/Decimal64 in `to_arrow` and `from_arrow` APIs #17422

zeroshade commented Nov 22, 2024 •

edited

Loading