GH-15936 add data frame transformation using polars #15942

wendycwong · 2023-11-22T00:46:16Z

This PR fixed the issue here: #15936

Several problems here:

Datatable cannot be installed on Python 3.10 or higher. @st-pasha is fixing this issue as we speak;
If datatable is not available, we will automatically switch to using python polars and pyarrow. Datatable is the default toolbox.
Customer complained about a test frame that is failing when transforming using polars/pyarrow. I added a test to make sure the new implementation works. It works on my laptop. Let's see if Jenkins agrees.

tomasfryda · 2023-11-22T09:07:20Z

h2o-py/h2o/frame.py

-                        return dt_frame.to_pandas()
+                        if can_use_datatable() and not(module == 'polars'): # default to datatable unless specified
+                            return self.convert_with_datatable(fileName)
+                        elif can_use_polars() and can_use_pyarrow():  # 


This seems incorrect to me. You require the pyarrow and yet you don't use it with the polars.

AFAIK there are 4 scenarios that can occur here:

using pyarrow to load the csv and then convert it to pandas (pyarrow internal datastructure is supported in recent pandas versions so no copy occurs here)

using datatable (we have that)

using polars (this is what you added in this PR)

using polars and pyarrow to avoid copying and reformatting data when converting from polars to pandas

I added the code that I used to benchmark that along with the results (in seconds).

with MeasureTime("h2o -> csv -> pyArrow -> pandas"): _, tmp = tempfile.mkstemp(suffix=".csv") os.unlink(tmp) try: h2o.export_file(frame, tmp) dt_frame = pyarrow.csv.read_csv(tmp) pd_frame = dt_frame.to_pandas() finally: os.unlink(tmp) with MeasureTime("h2o -> csv -> datatable -> pandas"): _, tmp = tempfile.mkstemp(suffix=".csv") os.unlink(tmp) try: h2o.export_file(frame, tmp) dt_frame = dt.fread(tmp) pd_frame = dt_frame.to_pandas() finally: os.unlink(tmp) with MeasureTime("h2o -> csv -> polars -> pandas"): _, tmp = tempfile.mkstemp(suffix=".csv") os.unlink(tmp) try: h2o.export_file(frame, tmp) dt_frame = pl.read_csv(tmp) pd_frame = dt_frame.to_pandas() finally: os.unlink(tmp) with MeasureTime("h2o -> csv -> polars (pyArrow) -> pandas"): _, tmp = tempfile.mkstemp(suffix=".csv") os.unlink(tmp) try: h2o.export_file(frame, tmp) dt_frame = pl.read_csv(tmp) pd_frame = dt_frame.to_pandas(use_pyarrow_extension_array=True) finally: os.unlink(tmp)

@tomasfryda

Polars uses pyarrow to do the to_pandas conversion. Hence, I need to have pyarrow. Here is the link: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html

I tried to just use your pyarrow suggestion. Python 3.6 works fine but Python 3.11 is not happy. I get this error:

It just means that the null value is empty string instead of float nan

Oh, sorry I thought pyarrrow is used just when using the use_pyarrow_extension_array. But you are right.

Using use_pyarrow_extension_array requires recent pandas and pyarrow. It would be nice to allow user to use it since it can bring significant performance benefit (especially if you don't have enough memory to duplicate the data) but the restrictions on pandas and pyarrow versions are not good for having it as a default.

I tried using pyarrow all the way, it was not able to handle NaN when used with Python 3.6. However, it does not have a problem with Python 3.9 or above.

sebhrusen · 2023-11-23T11:42:39Z

h2o-py/h2o/frame.py

@@ -1932,7 +1932,7 @@ def structure(self):
            else:
                print("num {}".format(" ".join(it[0] if it else "nan" for it in h2o.as_list(self[:10, i], False)[1:])))

-    def as_data_frame(self, use_pandas=True, header=True, multi_thread=False):
+    def as_data_frame(self, use_pandas=True, header=True, multi_thread=False, module='datatable'):


I find the api even more confusing now (well, it already was), why not simply have something like:

def as_data_frame(self, format='pandas', header=True): """ :param string format: the format of the data frame, can be one of: - list: the data frame is represented as a list of list (native python). - pandas: returns a pandas data frame (default) - datatable: returns a datable frame (not supported for Py 3.10+) - polars: ... - pyarrow: ... Note that if speed is a concern (large data), it is recommended to use one of `polars` (recommended), `pyarrow` or `datatable` if it is already installed in your environment, the resulting dataframe can then easily be converted to `pandas` by calling `to_pandas()` on it. """

note that the @deprecated_params annotation would allow to convert the use_pandas param to either result_format='pandas' or result_format='list' and the multi_thread param to one of the preferred other format

it would also greatly simplify implementation

The as_data_frame is just used to convert h2oFrame to pandas dataframe. The issue here is the user wants it to be faster using multi-thread. The final result is always a pandas dataframe. I am just using either datatable or polars/pyarrow to use the multi-thread mode. If the user just call as_data_frame without anything special, they will get the original single thread mode of conversion. Only the people in the know will know about multi-thread and the different python toolboxes.

The final result is always a pandas dataframe.

Well, not really, we often use it to convert the h2o frame to python lists as well.

Only the people in the know will know about multi-thread and the different python toolboxes.

Not if it's mentioned in the documentation that it's preferable to use polars for top performance (and then convert to pandas if needed).
I mean with your current changes, you already ask them not only to set the parameter multi_thread=True when users just want something faster, they don't care if it will use one or 10 threads. And then on top of that, you ask them to decide on the module that will be used as an intermediate to generate this dataframe: it's way too complicated.

My goal is to simplify this, and at the same time provide a more powerful tool that would also convert directly to polars or pyarrow if the user needs to.

Seb:

The work I am working to fix is to convert h2o Frame to Pandas frame only for a customer. The only thing the user needs to set is multi-thread=True if they want multi-thread. I want them to use datatable as a default and if they insist, they will choose polar. Pyarrow does not convert NaN correctly for Python version < 3.9 and hence I am not going to fix it now.

The interface is complicated but necessary to achieve what the customer is looking for.

You are right regarding the default output is not always panda frame. However, the work in the issue is just for transforming h2o frame to panda frame.

The use_pandas is a confusing parameter. The result_format is much better.

I deprecated the use_pandas parameter to pandas_frame to denote the output is pandas frame if pandas_frame is true. I don't know what else to simplify. Please suggest to me what you are thinking. Thanks.

sebhrusen

provided suggestions to avoid modifying the test environment during the tests

h2o-py/h2o/frame.py

h2o-py/h2o/utils/shared_utils.py

h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py

GH-15936: fixed all tests problems with python 3.11.

…tions. Move can_install_module to test environment per suggestions from Seb.

h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas.py

h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py

h2o-py/tests/testdir_misc/pyunit_gh_15936_polars2pandas_error.py

…ments.txt per seb suggestions

h2o-py/requirements.txt

sebhrusen

@wendycwong thanks for the changes.
Please still specify the versions in test-requirements.txt

h2o-py/test-requirements.txt

sebhrusen

thank you @wendycwong, and sorry for the long back and forth: I seem to have difficulties expressing myself clearly those days :)

Btw, as test-requirements.txt had to be changed, I'm wondering if we don't need to build fresh images…

h2o-py/h2o/frame.py

sebhrusen

thank you @wendycwong

wendycwong requested review from mn-mikke, sebhrusen and tomasfryda November 22, 2023 00:46

wendycwong changed the base branch from master to rel-3.44.0 November 22, 2023 00:46

wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from 8189898 to ef8a7ff Compare November 22, 2023 00:49

tomasfryda reviewed Nov 22, 2023

View reviewed changes

sebhrusen reviewed Nov 23, 2023

View reviewed changes

wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from 1beb33a to 39faaf3 Compare November 28, 2023 17:29

wendycwong requested review from sebhrusen and tomasfryda November 28, 2023 23:10

sebhrusen requested changes Nov 29, 2023

View reviewed changes

wendycwong added 2 commits November 30, 2023 15:45

GH-15936: add support to polar and pyarrow

4432da9

GH-15936: fixed all tests problems with python 3.11.

GH-15936: use local_context to avoid changing test machines configura…

65fdc84

…tions. Move can_install_module to test environment per suggestions from Seb.

wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from aefca7c to 65fdc84 Compare December 1, 2023 00:41

wendycwong requested a review from sebhrusen December 1, 2023 00:43

sebhrusen reviewed Dec 1, 2023

View reviewed changes

wendycwong added 3 commits December 1, 2023 09:45

add polars, datatable and pyarrow as optional requirements to require…

8a20512

…ments.txt per seb suggestions

remove can_install_modules

e1971bc

remove unwanted methods.

cad8e3e

sebhrusen reviewed Dec 1, 2023

View reviewed changes

h2o-py/requirements.txt Show resolved Hide resolved

wendycwong added 2 commits December 1, 2023 09:51

removed unwanted imports

15077db

add polars/datatable/pyarrows to test requirements.

a00b567

wendycwong requested a review from sebhrusen December 1, 2023 17:58

simplified tests.

1be7447

sebhrusen previously approved these changes Dec 4, 2023

View reviewed changes

h2o-py/test-requirements.txt Outdated Show resolved Hide resolved

added python module version

38e2a55

wendycwong dismissed sebhrusen’s stale review via 38e2a55 December 4, 2023 14:50

wendycwong requested a review from sebhrusen December 4, 2023 14:51

sebhrusen previously approved these changes Dec 4, 2023

View reviewed changes

Fix failing test due to new warning message from as_data_frame.

f86f1dd

wendycwong dismissed sebhrusen’s stale review via f86f1dd December 5, 2023 00:32

sebhrusen reviewed Dec 5, 2023

View reviewed changes

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved

wendycwong added 2 commits December 5, 2023 08:59

capture correct warning message

69e3710

make warning only appears once.

4bd9c85

wendycwong requested a review from sebhrusen December 5, 2023 18:16

sebhrusen approved these changes Dec 5, 2023

View reviewed changes

wendycwong merged commit 965a03c into rel-3.44.0 Dec 6, 2023
2 checks passed

wendycwong deleted the wendy_gh_15936_data_frame_polar branch December 6, 2023 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15936 add data frame transformation using polars #15942

GH-15936 add data frame transformation using polars #15942

wendycwong commented Nov 22, 2023

tomasfryda Nov 22, 2023

wendycwong Nov 22, 2023

tomasfryda Nov 23, 2023

wendycwong Nov 27, 2023

sebhrusen Nov 23, 2023 •

edited

Loading

sebhrusen Nov 23, 2023

sebhrusen Nov 23, 2023

wendycwong Nov 27, 2023

sebhrusen Nov 27, 2023

wendycwong Nov 28, 2023

wendycwong Nov 28, 2023

wendycwong Nov 28, 2023

wendycwong Nov 28, 2023

sebhrusen left a comment •

edited

Loading

sebhrusen left a comment

sebhrusen left a comment

sebhrusen left a comment

GH-15936 add data frame transformation using polars #15942

GH-15936 add data frame transformation using polars #15942

Conversation

wendycwong commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebhrusen Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebhrusen left a comment • edited Loading

Choose a reason for hiding this comment

sebhrusen left a comment

Choose a reason for hiding this comment

sebhrusen left a comment

Choose a reason for hiding this comment

sebhrusen left a comment

Choose a reason for hiding this comment

sebhrusen Nov 23, 2023 •

edited

Loading

sebhrusen left a comment •

edited

Loading