Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial thoughts on h2o integration #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

initial thoughts on h2o integration #20

wants to merge 1 commit into from

Conversation

topepo
Copy link
Member

@topepo topepo commented Sep 19, 2021

@ledell, @juliasilge, @hfrick, and @DavisVaughan for a discussion this week.

Comment on lines +81 to +83
* I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character.

* I don't know how to get the holdout predictions either.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like once you extract the "model" objects from the grid, you can do a lot more. The grid is just a lightweight summary object?

library(h2o)

h2o.init()

# Import a sample binary outcome dataset into H2O
data <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(data), y)

# For binary classification, response should be a factor
data[, y] <- as.factor(data[, y])
test[, y] <- as.factor(test[, y])

# Split data into train & validation
ss <- h2o.splitFrame(data, seed = 1)
train <- ss[[1]]
valid <- ss[[2]]

# GBM hyperparameters
gbm_params1 <- list(learn_rate = c(0.01, 0.1),
                    max_depth = c(3, 5, 9),
                    sample_rate = c(0.8, 1.0),
                    col_sample_rate = c(0.2, 0.5, 1.0))

# Train and validate a cartesian grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
                      grid_id = "gbm_grid1",
                      training_frame = train,
                      validation_frame = valid,
                      ntrees = 100,
                      seed = 1,
                      hyper_params = gbm_params1)

model_ids <- gbm_grid1@model_ids

# get the model objects
models <- lapply(model_ids, h2o.getModel)

# make holdout predictions on the assessment data (very noisy)
predictions <- lapply(models, h2o.predict, newdata = valid)
#> <SUPER NOISY HERE>
predictions[[1]]
#>   predict         p0        p1
#> 1       1 0.13336856 0.8666314
#> 2       1 0.11695254 0.8830475
#> 3       1 0.06611336 0.9338866
#> 4       1 0.07556947 0.9244305
#> 5       1 0.09630038 0.9036996
#> 6       0 0.79153668 0.2084633
#> 
#> [2489 rows x 3 columns]

# compute all the performance metrics on test data
performance <- h2o.performance(models[[1]], newdata = test)
h2o.auc(performance)
#> [1] 0.7813055
performance@metrics$AUC
#> [1] 0.7813055

Created on 2021-09-21 by the reprex package (v2.0.0.9000)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavisVaughan Yeah the grid is a summary table, and you can grab the models like you're doing above.

h2o/README.Rmd Show resolved Hide resolved

There are some differences between tidymodels grid tuning and h2o:

* `h2o.grid()` seems to work on one resample at a time.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is helpful to note, but we have a parallelism arg for h2o.grid() which allows for multiple models to be trained in parallel. You specify an integer for how many models to train at once, e.g. parallelism = 5. So in theory it can work with multiple resamples at a time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great. We can link that to the tidymodels control options for parallelism.

* The indices for the modeling and holdout data
* The grid values
* Seeds
* Data details (the data frame and x/y names)
Copy link

@ledell ledell Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal here to apply the same (whole) grid to multiple resample train/valid pairs? Or do you need the more fine-grained control of assigning a specific resample train/valid set to a specific grid combo (e.g. learn_rate = 0.1, max_depth = 5, sample_rate = 0.8, col_sample_rate = 1.0)?

If the former, then we can already support that with the current API, I think. You'd just do something like:

gbm_grid1 <- h2o.grid("gbm", x = x, y = y, 
                      grid_id = "gbm_grid1", 
                      training_frame = data[train_idx_resample_1, ], 
                      validation_frame = data[valid_idx_resample_1, ], 
                      ntrees = 100, seed = 1, 
                      hyper_params = gbm_params1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, applying it to everything would be the plan.

Copy link

@ledell ledell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topepo @DavisVaughan Added a few more comments to the README (with some questions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants