-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15887: allow Py H2OFrame constructor to accept an existing H2OFrame #15898
Conversation
column_names, column_types, na_strings, skipped_columns, force_col_types) | ||
if isinstance(python_obj, H2OFrame): | ||
sc = h2o.h2o.shallow_copy(python_obj, destination_frame) | ||
if skipped_columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebhrusen : I am confused here. The user want to convert a panda frame to H2O frame. If the frame is already H2O frame, should it be easier just to return it? I was thinking this is just a simple check for as_data_frame method....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the user is using a constructor like H2OFrame
, I don't think it's a good idea to return the exact same frame. That's why I return a shallow copy instead: it's cheap (no data copy) but it's not the same object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You worried about user changing the original frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You worried about user changing the original frame.
it's one concern, but not the major one.
I'm more concerned about user expectations: H2OFrame(…)
is a constructor, not a factory, and therefore I think it should always return a fresh instance.
It could be implemented as a factory but it's not:
- if you call
H2OFrame([1, 2, 3])
twice, you get 2 different objects. - if you call
H2OFrame(h2o_fr, skipped_columns=['foo', 'bar'])
, you expect a filtered version of the given frame and it will be a new object. - if you call
H2OFrame(h2o_fr, skipped_columns=['do_not_exist'])
, you expect the same as above, but as the columns don't exist, you will get a new copy of the given frame. - therefore, if you just call
H2OFrame(h2o_fr)
it's more natural to still return a new instance, and therefore a copy (shallow because there's no reason to deep clone here). If we were returning the given frame directly here, it would be an unnecessary special case that may confuse the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused here. The user want to convert a panda frame to H2O frame
well, that's what he wants to do when passing a panda frame, but who knows what he wants to do when passing a H2OFrame? It's an edge case.
Maybe to convince you, here is how pandas itself behaves:
In [24]: import pandas as pd
In [25]: df = pd.DataFrame([1, 2, 3])
In [26]: id(df)
Out[26]: 5090919696
In [27]: df2 = pd.DataFrame([1, 2, 3])
In [28]: id(df2)
Out[28]: 5061990032
In [29]: df3 = pd.DataFrame(df)
In [30]: id(df3)
Out[30]: 5043948112
you see: df3 != df
ea78c28
to
9ba3073
Compare
h2o-py/tests/testdir_apis/Data_Manipulation/pyunit_h2oH2OFrame.py
Outdated
Show resolved
Hide resolved
@@ -162,43 +162,6 @@ def _gen_header(cols): | |||
return ["C" + str(c) for c in range(1, cols + 1, 1)] | |||
|
|||
|
|||
def _check_lists_of_lists(python_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just moved those functions with import business logic to frame.py
where it should be.
shared_utils
should ideally only contain simple utility functions can be used anywhere, nothing too specific to a business module, otherwise it's messy.
@tomasfryda : I have no more questions regarding this PR. Please approve after you finish your review. Thanks, W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you @sebhrusen !
Please don't forget to update the issue numbers in comments (GH-#####
).
@tomasfryda thanks for the review. I created and referenced the issues discovered when adding tests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you!
adresses issue #15887
when called with an existing h2o frame, the Py
H2OFrame
constructor now produces a shallow copy of its argument: no need to reupload data, no need to copy data themselves in the backend, but the new frame obtained is still different from the original and can be modified independently.Note
More an inconsistency in the existing behaviour of
H2OFrame
than a real bug, but still labelled asbug