Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15677 Fix Match function bug #15896

Merged
merged 16 commits into from
Dec 13, 2023
Merged

Conversation

maurever
Copy link
Contributor

@maurever maurever commented Nov 1, 2023

Issue: #15677

TODO:

  • Do not add new parameter indexes, fix the bug in our implementation according to what R::match does. Using the iris dataset and calling match with c(“setosa”, “versicolor”) should generate a new column with value 1 assigned to rows with “setosa” and 2 assigned to rows with “versicolor”). The answer should be the same with all clients. This implies that R and Python should generate the same column values.
  • The default for nomatch should be NaN for both R and Python. In addition, only allow NaN or other numerical values for nomatch values. No string is supported.
  • Fix the documentation to reflect the changes: nomatch is defaulted to NaN and can only be numerical values. Remove the last sentences for incomparables.
  • Add parameter start_index with default value = 1, if a user wants to change indexing from 0 for example
  • Fix %in% to works as in base library but in different PR GH-15677 Fix %in% function #15929
  • Add better example in doc

@maurever maurever self-assigned this Nov 1, 2023
@maurever maurever marked this pull request as draft November 1, 2023 14:01
@maurever maurever changed the title GH-15677 Fix Match function bug POC GH-15677 Fix Match function bug Nov 1, 2023
@maurever maurever added this to the 3.46.0.1 milestone Nov 1, 2023
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @maurever.
Added some suggestions for doc + concern about sorted numerical values.

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
h2o-r/h2o-package/R/frame.R Outdated Show resolved Hide resolved
h2o-r/h2o-package/R/frame.R Outdated Show resolved Hide resolved
@maurever maurever requested a review from sebhrusen November 17, 2023 14:17
@maurever maurever marked this pull request as ready for review November 17, 2023 14:24
@maurever
Copy link
Contributor Author

maurever commented Nov 17, 2023

@hannah-tillman Could you check the documentation of this feature, please? Thanks!

hannah-tillman
hannah-tillman previously approved these changes Nov 17, 2023
Copy link
Contributor

@hannah-tillman hannah-tillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a few docs updates -- good on my end :)

@maurever maurever requested a review from sebhrusen November 22, 2023 08:49
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, thanks @maurever
Apparently, you still need to rebase to resolve conflicts, so I will approve next time.

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
@maurever maurever force-pushed the maurever_GH-15677_fix_match_bug branch from e6ed78c to 4d30c38 Compare December 1, 2023 12:33
@maurever
Copy link
Contributor Author

maurever commented Dec 1, 2023

Thanks, @sebhrusen; the PR is updated based on your suggestion and rebased.

@maurever maurever requested a review from sebhrusen December 1, 2023 12:36
wendycwong
wendycwong previously approved these changes Dec 1, 2023
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested the Py doc generation with the result rendered in the example, it works nicely!
Also updated the suggestion to use real data from the example.

Maybe same could be done in R.

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
@maurever
Copy link
Contributor Author

maurever commented Dec 5, 2023

@hannah-tillman, please check that the documentation is generated correctly. @sebhrusen also suggested adding output into the R example, but I think it will not work in the R. If you have any idea how to improve the R example to be similar to the Python example, please let me know (or feel free to commit this change here). Thanks!

sebhrusen
sebhrusen previously approved these changes Dec 5, 2023
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @maurever

I took the initiative to commit a small fix in the Py example.
Regarding the R example, looking at R doc, I don't think it's possible to render the results the same way as we can in Py, common practice seems to be using a comment, like:

sample ##--> describe shortly what is expected when this is printed

for example, in this case

sample ##--> `match` column should be made of `1` for `setosa`, `2` for `versicolor` and `NaN` for `virginica`

@hannah-tillman
Copy link
Contributor

@maurever The python docs build fine 👍

For the R documentation, I went through the current R guide to find an example of something adding the output because I didn’t really remember that happening. What I was able to find was a rare case of output being added as a comment, so we could do the same for this? Would look like this:

#' \dontrun{
#' h2o.init()
#' data <- as.h2o(iris)
#' match_col <- h2o.match(data$Species, c("setosa", "versicolor", "setosa"))
#' iris_match <- h2o.cbind(data, match_col)
#' sample <- h2o.splitFrame(iris_match, ratios=0.05, seed=1)[1]
#' sample
#' # [[1]]
#' #   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species C1
#' # 1          5.2         3.5          1.5         0.2     setosa  1
#' # 2          5.0         3.5          1.3         0.3     setosa  1
#' # 3          7.0         3.2          4.7         1.4 versicolor  1
#' # 4          4.9         2.4          3.3         1.0 versicolor  1
#' # 5          5.5         2.4          3.8         1.1 versicolor  1
#' # 6          5.8         2.7          5.1         1.9  virginica  0
#' #
#' # [12 rows x 6 columns] 
#' }

This is the only example I was able to find for output in the R docs, though, so it is currently the only idea i have. Let me know if this is what you want.

@maurever
Copy link
Contributor Author

Thanks, @seb and @hannah-tillman. The PR is now ready for review.

@maurever maurever requested a review from sebhrusen December 11, 2023 10:04
Copy link
Contributor

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thank you @maurever !

@maurever maurever merged commit 528edf3 into master Dec 13, 2023
63 of 68 checks passed
@maurever maurever deleted the maurever_GH-15677_fix_match_bug branch December 13, 2023 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants