Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15936 add data frame transformation using polars #15942

Merged
merged 12 commits into from
Dec 6, 2023

Conversation

wendycwong
Copy link
Contributor

This PR fixed the issue here: #15936

Several problems here:

  1. Datatable cannot be installed on Python 3.10 or higher. @st-pasha is fixing this issue as we speak;
  2. If datatable is not available, we will automatically switch to using python polars and pyarrow. Datatable is the default toolbox.
  3. Customer complained about a test frame that is failing when transforming using polars/pyarrow. I added a test to make sure the new implementation works. It works on my laptop. Let's see if Jenkins agrees.

@wendycwong wendycwong changed the base branch from master to rel-3.44.0 November 22, 2023 00:46
@wendycwong wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from 8189898 to ef8a7ff Compare November 22, 2023 00:49
return dt_frame.to_pandas()
if can_use_datatable() and not(module == 'polars'): # default to datatable unless specified
return self.convert_with_datatable(fileName)
elif can_use_polars() and can_use_pyarrow(): #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect to me. You require the pyarrow and yet you don't use it with the polars.

AFAIK there are 4 scenarios that can occur here:

  • using pyarrow to load the csv and then convert it to pandas (pyarrow internal datastructure is supported in recent pandas versions so no copy occurs here)
  • using datatable (we have that)
  • using polars (this is what you added in this PR)
  • using polars and pyarrow to avoid copying and reformatting data when converting from polars to pandas

I added the code that I used to benchmark that along with the results (in seconds).

  with MeasureTime("h2o -> csv -> pyArrow -> pandas"):
      _, tmp = tempfile.mkstemp(suffix=".csv")
      os.unlink(tmp)
      try:
          h2o.export_file(frame, tmp)
          dt_frame = pyarrow.csv.read_csv(tmp)
          pd_frame = dt_frame.to_pandas()
      finally:
          os.unlink(tmp)
      
   with MeasureTime("h2o -> csv -> datatable -> pandas"):
      _, tmp = tempfile.mkstemp(suffix=".csv")
      os.unlink(tmp)
      try:
          h2o.export_file(frame, tmp)
          dt_frame = dt.fread(tmp)
          pd_frame = dt_frame.to_pandas()
      finally:
          os.unlink(tmp)
          
  with MeasureTime("h2o -> csv -> polars -> pandas"):
      _, tmp = tempfile.mkstemp(suffix=".csv")
      os.unlink(tmp)
      try:
          h2o.export_file(frame, tmp)
          dt_frame = pl.read_csv(tmp)
          pd_frame = dt_frame.to_pandas()
      finally:
          os.unlink(tmp)
          
  with MeasureTime("h2o -> csv -> polars (pyArrow) -> pandas"):
      _, tmp = tempfile.mkstemp(suffix=".csv")
      os.unlink(tmp)
      try:
          h2o.export_file(frame, tmp)
          dt_frame = pl.read_csv(tmp)
          pd_frame = dt_frame.to_pandas(use_pyarrow_extension_array=True)
      finally:
          os.unlink(tmp)       

Screenshot 2023-11-22 at 9 52 51

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomasfryda

Polars uses pyarrow to do the to_pandas conversion. Hence, I need to have pyarrow. Here is the link: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html

I tried to just use your pyarrow suggestion. Python 3.6 works fine but Python 3.11 is not happy. I get this error:

image

It just means that the null value is empty string instead of float nan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry I thought pyarrrow is used just when using the use_pyarrow_extension_array. But you are right.

Using use_pyarrow_extension_array requires recent pandas and pyarrow. It would be nice to allow user to use it since it can bring significant performance benefit (especially if you don't have enough memory to duplicate the data) but the restrictions on pandas and pyarrow versions are not good for having it as a default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using pyarrow all the way, it was not able to handle NaN when used with Python 3.6. However, it does not have a problem with Python 3.9 or above.

@@ -1932,7 +1932,7 @@ def structure(self):
else:
print("num {}".format(" ".join(it[0] if it else "nan" for it in h2o.as_list(self[:10, i], False)[1:])))

def as_data_frame(self, use_pandas=True, header=True, multi_thread=False):
def as_data_frame(self, use_pandas=True, header=True, multi_thread=False, module='datatable'):
Copy link
Contributor

@sebhrusen sebhrusen Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the api even more confusing now (well, it already was), why not simply have something like:

def as_data_frame(self, format='pandas', header=True):
    """
    :param string format: the format of the data frame, can be one of:
         - list: the data frame is represented as a list of list (native python).
         - pandas: returns a pandas data frame (default)
         - datatable: returns a datable frame (not supported for Py 3.10+)
         - polars: ...
         - pyarrow: ...
         Note that if speed is a concern (large data), it is recommended 
        to use one of `polars` (recommended), `pyarrow` or `datatable`
         if it is already installed in your environment, the resulting dataframe 
         can then easily be converted to `pandas` by calling `to_pandas()` on it.
   """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the @deprecated_params annotation would allow to convert the use_pandas param to either result_format='pandas' or result_format='list' and the multi_thread param to one of the preferred other format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would also greatly simplify implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The as_data_frame is just used to convert h2oFrame to pandas dataframe. The issue here is the user wants it to be faster using multi-thread. The final result is always a pandas dataframe. I am just using either datatable or polars/pyarrow to use the multi-thread mode. If the user just call as_data_frame without anything special, they will get the original single thread mode of conversion. Only the people in the know will know about multi-thread and the different python toolboxes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final result is always a pandas dataframe.

Well, not really, we often use it to convert the h2o frame to python lists as well.

Only the people in the know will know about multi-thread and the different python toolboxes.

Not if it's mentioned in the documentation that it's preferable to use polars for top performance (and then convert to pandas if needed).
I mean with your current changes, you already ask them not only to set the parameter multi_thread=True when users just want something faster, they don't care if it will use one or 10 threads. And then on top of that, you ask them to decide on the module that will be used as an intermediate to generate this dataframe: it's way too complicated.

My goal is to simplify this, and at the same time provide a more powerful tool that would also convert directly to polars or pyarrow if the user needs to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seb:

The work I am working to fix is to convert h2o Frame to Pandas frame only for a customer. The only thing the user needs to set is multi-thread=True if they want multi-thread. I want them to use datatable as a default and if they insist, they will choose polar. Pyarrow does not convert NaN correctly for Python version < 3.9 and hence I am not going to fix it now.

The interface is complicated but necessary to achieve what the customer is looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right regarding the default output is not always panda frame. However, the work in the issue is just for transforming h2o frame to panda frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use_pandas is a confusing parameter. The result_format is much better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deprecated the use_pandas parameter to pandas_frame to denote the output is pandas frame if pandas_frame is true. I don't know what else to simplify. Please suggest to me what you are thinking. Thanks.

@wendycwong wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from 1beb33a to 39faaf3 Compare November 28, 2023 17:29
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provided suggestions to avoid modifying the test environment during the tests

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
h2o-py/h2o/utils/shared_utils.py Outdated Show resolved Hide resolved
wendycwong added 2 commits November 30, 2023 15:45
GH-15936: fixed all tests problems with python 3.11.
…tions. Move can_install_module to test environment per suggestions from Seb.
@wendycwong wendycwong force-pushed the wendy_gh_15936_data_frame_polar branch from aefca7c to 65fdc84 Compare December 1, 2023 00:41
@wendycwong wendycwong requested a review from sebhrusen December 1, 2023 00:43
@wendycwong wendycwong requested a review from sebhrusen December 1, 2023 17:58
sebhrusen
sebhrusen previously approved these changes Dec 4, 2023
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wendycwong thanks for the changes.
Please still specify the versions in test-requirements.txt

h2o-py/test-requirements.txt Outdated Show resolved Hide resolved
sebhrusen
sebhrusen previously approved these changes Dec 4, 2023
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @wendycwong, and sorry for the long back and forth: I seem to have difficulties expressing myself clearly those days :)

Btw, as test-requirements.txt had to be changed, I'm wondering if we don't need to build fresh images…

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
@wendycwong wendycwong requested a review from sebhrusen December 5, 2023 18:16
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @wendycwong

@wendycwong wendycwong merged commit 965a03c into rel-3.44.0 Dec 6, 2023
2 checks passed
@wendycwong wendycwong deleted the wendy_gh_15936_data_frame_polar branch December 6, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants