Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15887: allow Py H2OFrame constructor to accept an existing H2OFrame #15898

Merged
merged 5 commits into from
Nov 22, 2023

Conversation

sebhrusen
Copy link
Contributor

@sebhrusen sebhrusen commented Nov 1, 2023

adresses issue #15887

when called with an existing h2o frame, the Py H2OFrame constructor now produces a shallow copy of its argument: no need to reupload data, no need to copy data themselves in the backend, but the new frame obtained is still different from the original and can be modified independently.

Note
More an inconsistency in the existing behaviour of H2OFrame than a real bug, but still labelled as bug

@sebhrusen sebhrusen added the bug label Nov 1, 2023
@sebhrusen sebhrusen added this to the 3.44.0.2 milestone Nov 1, 2023
@sebhrusen sebhrusen linked an issue Nov 1, 2023 that may be closed by this pull request
@sebhrusen sebhrusen changed the base branch from master to rel-3.44.0 November 1, 2023 21:11
@sebhrusen sebhrusen modified the milestones: 3.44.0.2, 3.44.0.3 Nov 3, 2023
@sebhrusen sebhrusen self-assigned this Nov 3, 2023
column_names, column_types, na_strings, skipped_columns, force_col_types)
if isinstance(python_obj, H2OFrame):
sc = h2o.h2o.shallow_copy(python_obj, destination_frame)
if skipped_columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebhrusen : I am confused here. The user want to convert a panda frame to H2O frame. If the frame is already H2O frame, should it be easier just to return it? I was thinking this is just a simple check for as_data_frame method....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user is using a constructor like H2OFrame, I don't think it's a good idea to return the exact same frame. That's why I return a shallow copy instead: it's cheap (no data copy) but it's not the same object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You worried about user changing the original frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You worried about user changing the original frame.

it's one concern, but not the major one.
I'm more concerned about user expectations: H2OFrame(…) is a constructor, not a factory, and therefore I think it should always return a fresh instance.
It could be implemented as a factory but it's not:

  • if you call H2OFrame([1, 2, 3]) twice, you get 2 different objects.
  • if you call H2OFrame(h2o_fr, skipped_columns=['foo', 'bar']), you expect a filtered version of the given frame and it will be a new object.
  • if you call H2OFrame(h2o_fr, skipped_columns=['do_not_exist']), you expect the same as above, but as the columns don't exist, you will get a new copy of the given frame.
  • therefore, if you just call H2OFrame(h2o_fr) it's more natural to still return a new instance, and therefore a copy (shallow because there's no reason to deep clone here). If we were returning the given frame directly here, it would be an unnecessary special case that may confuse the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wendycwong

I am confused here. The user want to convert a panda frame to H2O frame

well, that's what he wants to do when passing a panda frame, but who knows what he wants to do when passing a H2OFrame? It's an edge case.

Maybe to convince you, here is how pandas itself behaves:

In [24]: import pandas as pd

In [25]: df = pd.DataFrame([1, 2, 3])

In [26]: id(df)
Out[26]: 5090919696

In [27]: df2 = pd.DataFrame([1, 2, 3])

In [28]: id(df2)
Out[28]: 5061990032

In [29]: df3 = pd.DataFrame(df)

In [30]: id(df3)
Out[30]: 5043948112

you see: df3 != df

wendycwong
wendycwong previously approved these changes Nov 13, 2023
@@ -162,43 +162,6 @@ def _gen_header(cols):
return ["C" + str(c) for c in range(1, cols + 1, 1)]


def _check_lists_of_lists(python_obj):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved those functions with import business logic to frame.py where it should be.
shared_utils should ideally only contain simple utility functions can be used anywhere, nothing too specific to a business module, otherwise it's messy.

@wendycwong
Copy link
Contributor

@tomasfryda : I have no more questions regarding this PR. Please approve after you finish your review. Thanks, W

tomasfryda
tomasfryda previously approved these changes Nov 22, 2023
Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you @sebhrusen !

Please don't forget to update the issue numbers in comments (GH-#####).

@sebhrusen
Copy link
Contributor Author

@tomasfryda thanks for the review. I created and referenced the issues discovered when adding tests for H2OFrame, feel free to update your review.

Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@sebhrusen sebhrusen changed the title GH-25887: allow Py H2OFrame constructor to accept an existing H2OFrame GH-15887: allow Py H2OFrame constructor to accept an existing H2OFrame Nov 22, 2023
@sebhrusen sebhrusen merged commit bd84581 into rel-3.44.0 Nov 22, 2023
2 checks passed
@sebhrusen sebhrusen deleted the seb/gh-15887 branch November 22, 2023 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept H2OFrame as input to H2OFrame itself
3 participants