-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15887: allow Py H2OFrame constructor to accept an existing H2OFrame #15898
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
6afc6d8
generates a shallow copy when Py H2OFrame constructor is called with …
sebhrusen 77bf637
add support for columns selection params
sebhrusen 9ba3073
added proper tests for H2OFrame constructor and discovered a bunch of…
sebhrusen e050266
cosmetics
sebhrusen 7cb5063
added references to new issues discovered when writing H2OFrame tests
sebhrusen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,43 +162,6 @@ def _gen_header(cols): | |
return ["C" + str(c) for c in range(1, cols + 1, 1)] | ||
|
||
|
||
def _check_lists_of_lists(python_obj): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just moved those functions with import business logic to |
||
# check we have a lists of flat lists | ||
# returns longest length of sublist | ||
most_cols = 1 | ||
for l in python_obj: | ||
# All items in the list must be a list! | ||
if not isinstance(l, (tuple, list)): | ||
raise ValueError("`python_obj` is a mixture of nested lists and other types.") | ||
most_cols = max(most_cols, len(l)) | ||
for ll in l: | ||
# in fact, we must have a list of flat lists! | ||
if isinstance(ll, (tuple, list)): | ||
raise ValueError("`python_obj` is not a list of flat lists!") | ||
return most_cols | ||
|
||
|
||
def _handle_python_lists(python_obj, check_header): | ||
# convert all inputs to lol | ||
if _is_list_of_lists(python_obj): # do we have a list of lists: [[...], ..., [...]] ? | ||
ncols = _check_lists_of_lists(python_obj) # must be a list of flat lists, raise ValueError if not | ||
elif isinstance(python_obj, (list, tuple)): # single list | ||
ncols = 1 | ||
python_obj = [[e] for e in python_obj] | ||
else: # scalar | ||
python_obj = [[python_obj]] | ||
ncols = 1 | ||
# create the header | ||
if check_header == 1: | ||
header = python_obj[0] | ||
python_obj = python_obj[1:] | ||
else: | ||
header = _gen_header(ncols) | ||
# shape up the data for csv.DictWriter | ||
# data_to_write = [dict(list(zip(header, row))) for row in python_obj] | ||
return header, python_obj | ||
|
||
|
||
def stringify_dict(d): | ||
return stringify_list(["{'key': %s, 'value': %s}" % (_quoted(k), v) for k, v in d.items()]) | ||
|
||
|
@@ -244,38 +207,6 @@ def _is_num_list(l): | |
def _is_list_of_lists(o): | ||
return any(isinstance(l, (tuple, list)) for l in o) | ||
|
||
|
||
def _handle_numpy_array(python_obj, header): | ||
return _handle_python_lists(python_obj.tolist(), header) | ||
|
||
|
||
def _handle_pandas_data_frame(python_obj, header): | ||
data = _handle_python_lists(python_obj.values.tolist(), -1)[1] | ||
return list(str(c) for c in python_obj.columns), data | ||
|
||
|
||
def _handle_python_dicts(python_obj, check_header): | ||
header = list(python_obj.keys()) if python_obj else _gen_header(1) | ||
is_valid = all(re.match(r"^[a-zA-Z_][a-zA-Z0-9_.]*$", col) for col in header) # is this a valid header? | ||
if not is_valid: | ||
raise ValueError( | ||
"Did not get a valid set of column names! Must match the regular expression: ^[a-zA-Z_][a-zA-Z0-9_.]*$ ") | ||
for k in python_obj: # check that each value entry is a flat list/tuple or single int, float, or string | ||
v = python_obj[k] | ||
if isinstance(v, (tuple, list)): # if value is a tuple/list, then it must be flat | ||
if _is_list_of_lists(v): | ||
raise ValueError("Values in the dictionary must be flattened!") | ||
elif is_type(v, str, numeric): | ||
python_obj[k] = [v] | ||
else: | ||
raise ValueError("Encountered invalid dictionary value when constructing H2OFrame. Got: {0}".format(v)) | ||
|
||
zipper = getattr(itertools, "zip_longest", None) or getattr(itertools, "izip_longest", None) or zip | ||
rows = list(map(list, zipper(*list(python_obj.values())))) | ||
data_to_write = [dict(list(zip(header, row))) for row in rows] | ||
return header, data_to_write | ||
|
||
|
||
def _is_fr(o): | ||
return o.__class__.__name__ == "H2OFrame" # hack to avoid circular imports | ||
|
||
|
@@ -426,14 +357,9 @@ def slice_is_normalized(s): | |
quoted = _quoted | ||
is_list = _is_list | ||
is_fr = _is_fr | ||
handle_python_dicts = _handle_python_dicts | ||
handle_pandas_data_frame = _handle_pandas_data_frame | ||
handle_numpy_array = _handle_numpy_array | ||
is_list_of_lists = _is_list_of_lists | ||
is_num_list = _is_num_list | ||
is_str_list = _is_str_list | ||
handle_python_lists = _handle_python_lists | ||
check_lists_of_lists = _check_lists_of_lists | ||
|
||
gen_model_file_name = "h2o-genmodel.jar" | ||
h2o_predictor_class = "hex.genmodel.tools.PredictCsv" | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebhrusen : I am confused here. The user want to convert a panda frame to H2O frame. If the frame is already H2O frame, should it be easier just to return it? I was thinking this is just a simple check for as_data_frame method....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the user is using a constructor like
H2OFrame
, I don't think it's a good idea to return the exact same frame. That's why I return a shallow copy instead: it's cheap (no data copy) but it's not the same object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You worried about user changing the original frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's one concern, but not the major one.
I'm more concerned about user expectations:
H2OFrame(…)
is a constructor, not a factory, and therefore I think it should always return a fresh instance.It could be implemented as a factory but it's not:
H2OFrame([1, 2, 3])
twice, you get 2 different objects.H2OFrame(h2o_fr, skipped_columns=['foo', 'bar'])
, you expect a filtered version of the given frame and it will be a new object.H2OFrame(h2o_fr, skipped_columns=['do_not_exist'])
, you expect the same as above, but as the columns don't exist, you will get a new copy of the given frame.H2OFrame(h2o_fr)
it's more natural to still return a new instance, and therefore a copy (shallow because there's no reason to deep clone here). If we were returning the given frame directly here, it would be an unnecessary special case that may confuse the user.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendycwong
well, that's what he wants to do when passing a panda frame, but who knows what he wants to do when passing a H2OFrame? It's an edge case.
Maybe to convince you, here is how pandas itself behaves:
you see:
df3 != df