-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15936 add data frame transformation using polars #15942
Conversation
8189898
to
ef8a7ff
Compare
h2o-py/h2o/frame.py
Outdated
return dt_frame.to_pandas() | ||
if can_use_datatable() and not(module == 'polars'): # default to datatable unless specified | ||
return self.convert_with_datatable(fileName) | ||
elif can_use_polars() and can_use_pyarrow(): # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect to me. You require the pyarrow
and yet you don't use it with the polars.
AFAIK there are 4 scenarios that can occur here:
- using pyarrow to load the csv and then convert it to pandas (pyarrow internal datastructure is supported in recent pandas versions so no copy occurs here)
- using datatable (we have that)
- using polars (this is what you added in this PR)
- using polars and pyarrow to avoid copying and reformatting data when converting from polars to pandas
I added the code that I used to benchmark that along with the results (in seconds).
with MeasureTime("h2o -> csv -> pyArrow -> pandas"):
_, tmp = tempfile.mkstemp(suffix=".csv")
os.unlink(tmp)
try:
h2o.export_file(frame, tmp)
dt_frame = pyarrow.csv.read_csv(tmp)
pd_frame = dt_frame.to_pandas()
finally:
os.unlink(tmp)
with MeasureTime("h2o -> csv -> datatable -> pandas"):
_, tmp = tempfile.mkstemp(suffix=".csv")
os.unlink(tmp)
try:
h2o.export_file(frame, tmp)
dt_frame = dt.fread(tmp)
pd_frame = dt_frame.to_pandas()
finally:
os.unlink(tmp)
with MeasureTime("h2o -> csv -> polars -> pandas"):
_, tmp = tempfile.mkstemp(suffix=".csv")
os.unlink(tmp)
try:
h2o.export_file(frame, tmp)
dt_frame = pl.read_csv(tmp)
pd_frame = dt_frame.to_pandas()
finally:
os.unlink(tmp)
with MeasureTime("h2o -> csv -> polars (pyArrow) -> pandas"):
_, tmp = tempfile.mkstemp(suffix=".csv")
os.unlink(tmp)
try:
h2o.export_file(frame, tmp)
dt_frame = pl.read_csv(tmp)
pd_frame = dt_frame.to_pandas(use_pyarrow_extension_array=True)
finally:
os.unlink(tmp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars uses pyarrow to do the to_pandas conversion. Hence, I need to have pyarrow. Here is the link: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html
I tried to just use your pyarrow suggestion. Python 3.6 works fine but Python 3.11 is not happy. I get this error:
![image](https://private-user-images.githubusercontent.com/7231712/285067501-fb8a80a1-fb0c-4798-9796-c57ca2c2591a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMjYyNTEsIm5iZiI6MTczOTAyNTk1MSwicGF0aCI6Ii83MjMxNzEyLzI4NTA2NzUwMS1mYjhhODBhMS1mYjBjLTQ3OTgtOTc5Ni1jNTdjYTJjMjU5MWEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDhUMTQ0NTUxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDFiZWZiODc1ZTdhODQ1NTJkMjliMjlhYTRjMWExOGM4MGI0NDI4NjdkMGZlM2I5OGRkNjcwNWFjMjcxOTBjNCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.F9wEzXseC98VSQx6DZVQG_rjTnrWzdlLG8oo-hw1TVc)
It just means that the null value is empty string instead of float nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry I thought pyarrrow is used just when using the use_pyarrow_extension_array
. But you are right.
Using use_pyarrow_extension_array
requires recent pandas and pyarrow. It would be nice to allow user to use it since it can bring significant performance benefit (especially if you don't have enough memory to duplicate the data) but the restrictions on pandas and pyarrow versions are not good for having it as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using pyarrow all the way, it was not able to handle NaN when used with Python 3.6. However, it does not have a problem with Python 3.9 or above.
h2o-py/h2o/frame.py
Outdated
@@ -1932,7 +1932,7 @@ def structure(self): | |||
else: | |||
print("num {}".format(" ".join(it[0] if it else "nan" for it in h2o.as_list(self[:10, i], False)[1:]))) | |||
|
|||
def as_data_frame(self, use_pandas=True, header=True, multi_thread=False): | |||
def as_data_frame(self, use_pandas=True, header=True, multi_thread=False, module='datatable'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the api even more confusing now (well, it already was), why not simply have something like:
def as_data_frame(self, format='pandas', header=True):
"""
:param string format: the format of the data frame, can be one of:
- list: the data frame is represented as a list of list (native python).
- pandas: returns a pandas data frame (default)
- datatable: returns a datable frame (not supported for Py 3.10+)
- polars: ...
- pyarrow: ...
Note that if speed is a concern (large data), it is recommended
to use one of `polars` (recommended), `pyarrow` or `datatable`
if it is already installed in your environment, the resulting dataframe
can then easily be converted to `pandas` by calling `to_pandas()` on it.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that the @deprecated_params
annotation would allow to convert the use_pandas
param to either result_format='pandas'
or result_format='list'
and the multi_thread
param to one of the preferred other format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would also greatly simplify implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The as_data_frame is just used to convert h2oFrame to pandas dataframe. The issue here is the user wants it to be faster using multi-thread. The final result is always a pandas dataframe. I am just using either datatable or polars/pyarrow to use the multi-thread mode. If the user just call as_data_frame without anything special, they will get the original single thread mode of conversion. Only the people in the know will know about multi-thread and the different python toolboxes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final result is always a pandas dataframe.
Well, not really, we often use it to convert the h2o frame to python lists as well.
Only the people in the know will know about multi-thread and the different python toolboxes.
Not if it's mentioned in the documentation that it's preferable to use polars
for top performance (and then convert to pandas if needed).
I mean with your current changes, you already ask them not only to set the parameter multi_thread=True
when users just want something faster, they don't care if it will use one or 10 threads. And then on top of that, you ask them to decide on the module
that will be used as an intermediate to generate this dataframe: it's way too complicated.
My goal is to simplify this, and at the same time provide a more powerful tool that would also convert directly to polars
or pyarrow
if the user needs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seb:
The work I am working to fix is to convert h2o Frame to Pandas frame only for a customer. The only thing the user needs to set is multi-thread=True if they want multi-thread. I want them to use datatable as a default and if they insist, they will choose polar. Pyarrow does not convert NaN correctly for Python version < 3.9 and hence I am not going to fix it now.
The interface is complicated but necessary to achieve what the customer is looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right regarding the default output is not always panda frame. However, the work in the issue is just for transforming h2o frame to panda frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use_pandas is a confusing parameter. The result_format is much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deprecated the use_pandas parameter to pandas_frame to denote the output is pandas frame if pandas_frame is true. I don't know what else to simplify. Please suggest to me what you are thinking. Thanks.
1beb33a
to
39faaf3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provided suggestions to avoid modifying the test environment during the tests
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py
Outdated
Show resolved
Hide resolved
GH-15936: fixed all tests problems with python 3.11.
…tions. Move can_install_module to test environment per suggestions from Seb.
aefca7c
to
65fdc84
Compare
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15729_15936_datatable_polars_2_pandas_large.py
Outdated
Show resolved
Hide resolved
h2o-py/tests/testdir_misc/pyunit_gh_15936_polars2pandas_error.py
Outdated
Show resolved
Hide resolved
…ments.txt per seb suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendycwong thanks for the changes.
Please still specify the versions in test-requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @wendycwong, and sorry for the long back and forth: I seem to have difficulties expressing myself clearly those days :)
Btw, as test-requirements.txt
had to be changed, I'm wondering if we don't need to build fresh images…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @wendycwong
This PR fixed the issue here: #15936
Several problems here: