Implement String Accumulations with nanoarrow #60667

WillAyd · 2025-01-06T22:04:44Z

This is an addon to the great work that @rhshadrach is doing in #60633

Here is performance on this branch:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: ser = pd.Series(["foo", "bar", "baz"] * 10000, dtype=pd.ArrowDtype(pa.string()))

In [4]: %timeit ser.cumsum()
504 ms ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit ser.cummin()
1.23 ms ± 12.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit ser.cummax()
1.23 ms ± 36.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

versus without nanoarrow:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: ser = pd.Series(["foo", "bar", "baz"] * 10000, dtype=pd.ArrowDtype(pa.string()))

In [4]: %timeit ser.cumsum()
1.72 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit ser.cummin()
1.49 ms ± 81.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit ser.cummax()
1.58 ms ± 42.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

While they all show some improvement, cummin / cummax don't show as much. I believe this has to do with the fact that the implementations still have to access the Python runtime within a tight loop to perform comparisons. If we cared to optimize further, we could look at utf8proc

This reverts commit 53fc8d9.

WillAyd · 2025-01-08T20:54:58Z

subprojects/nanoarrow.wrap

@@ -0,0 +1,9 @@
+[wrap-file]


Ideally we would use the meson WrapDB entry, but in this case there's an upstream bug in nanoarrow that prevents us from using the latest release (0.6.0). See also apache/arrow-nanoarrow#702

Whenever 0.7.0 gets released we can go back to using the Meson WrapDB entry, although I don't think there is any huge rush for that either

rhshadrach and others added 10 commits January 6, 2025 15:12

ENH: Implement cum* methods for PyArrow strings

e837689

cleanup

009d11b

Cleanup

3a9200d

fixup

d625522

Fix extension tests

de728ad

xfail test when there is no pyarrow

7c12f15

mypy fixups

a3650a9

Implement string accumulations with nanoarrow

83104a4

bump CI meson installs

7b8e782

Suppress warnings

f24c79f

WillAyd force-pushed the pyarrow-string-funcs branch from 2a3c754 to f24c79f Compare January 6, 2025 22:56

WillAyd added 3 commits January 6, 2025 18:03

Remove C++20 concept for now

78872f9

bump meson-python

2325b24

Use nanoarrow C++ helpers and iterate stream in accumulations

0cb78cb

WillAyd force-pushed the pyarrow-string-funcs branch from ad47864 to 0cb78cb Compare January 7, 2025 01:49

WillAyd added 2 commits January 7, 2025 10:56

Work around nanoarrow bug

c8b4fde

Revert back to C++17

192dba6

WillAyd force-pushed the pyarrow-string-funcs branch from 1700cae to 192dba6 Compare January 7, 2025 20:39

WillAyd added 3 commits January 8, 2025 10:53

Remove meson version pins

53fc8d9

Revert "Remove meson version pins"

c8d92b3

This reverts commit 53fc8d9.

Use nanoarrow commit

60c3e6f

WillAyd commented Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement String Accumulations with nanoarrow #60667

Implement String Accumulations with nanoarrow #60667

WillAyd commented Jan 6, 2025

WillAyd Jan 8, 2025

Implement String Accumulations with nanoarrow #60667

Are you sure you want to change the base?

Implement String Accumulations with nanoarrow #60667

Conversation

WillAyd commented Jan 6, 2025

WillAyd Jan 8, 2025

Choose a reason for hiding this comment