-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial thoughts on h2o integration #20
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
title: "tidymodels and h2o integration" | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE) | ||
``` | ||
|
||
|
||
## Grid search differences | ||
|
||
There are some differences between tidymodels grid tuning and h2o: | ||
|
||
* `h2o.grid()` seems to work on one resample at a time. | ||
* General grids (i.e., not Cartesian products) are not feasible. | ||
* h2o has a more general methodology for processing the grid. tidymodels computes everything while h20 can setup time-based stopping criteria (and others, I think) | ||
|
||
There are some other potentially missing pieces described below. | ||
|
||
## Interactions between tidymodels and h2o | ||
|
||
Non-mutually exclusive approaches, ordered from least complex to most. | ||
|
||
### parsnip integration | ||
|
||
This would be for one-off model fits. The code would look like: | ||
|
||
```{r eval= FALSE} | ||
boost_fit <- | ||
boost_tree(trees = 12, learn_rate = 0.1) %>% | ||
set_mode("regression") %>% | ||
set_engine("h2o", model_id = "gbm fit") %>% #<- extra options here | ||
fit(mpg ~ ., data = mtcars) | ||
``` | ||
|
||
This would create an h2o data frame, run the model, and return the object with the references it needs to make predictions. | ||
|
||
We can write `tidy()` and serialize functions for saving the model. For simple models, we might be able to extract the coeffcients and write predict methods that just do basic matrix algebra. | ||
This may already implemented in [`h2oparsnip`](https://github.com/stevenpawley/h2oparsnip) but I have not looked at the code in-depth. | ||
|
||
### Resamples as chunks | ||
|
||
In R, have a method for `tune_grid()` that would do everything in R except the model fits. The loops for getting results across resamples would be in R and would look something like: | ||
|
||
```{r eval = FALSE} | ||
# rs is a rsample object | ||
for (ind in seq_along(rs$splits)) { | ||
mod <- | ||
rs$splits[[ind]] %>% | ||
analysis() %>% | ||
as.h2o() | ||
val <- | ||
rs$splits[[ind]] %>% | ||
assessment() %>% | ||
as.h2o() | ||
res <- | ||
h2o.grid("gbm", | ||
x = x_names, | ||
y = y_names, | ||
grid_id = "gbm_grid1", | ||
training_frame = mod, | ||
validation_frame = val, | ||
seed = seed, | ||
hyper_params = grid_values) | ||
# plus more options for metrics etc. | ||
|
||
metrics <- foo("See notes below") | ||
|
||
if (control$save_pred) { | ||
preds <- bar("See notes below") | ||
} | ||
} | ||
``` | ||
|
||
We accumulate these in a stage-wise fashion across resamples. | ||
|
||
This is somewhat inefficient since we are passing data back and forth for each resample and it diminishes the efficiency of how h2o can process results. | ||
|
||
Notes: | ||
|
||
* I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character. | ||
topepo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* I don't know how to get the holdout predictions either. | ||
Comment on lines
+81
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like once you extract the "model" objects from the grid, you can do a lot more. The grid is just a lightweight summary object? library(h2o)
h2o.init()
# Import a sample binary outcome dataset into H2O
data <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(data), y)
# For binary classification, response should be a factor
data[, y] <- as.factor(data[, y])
test[, y] <- as.factor(test[, y])
# Split data into train & validation
ss <- h2o.splitFrame(data, seed = 1)
train <- ss[[1]]
valid <- ss[[2]]
# GBM hyperparameters
gbm_params1 <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0),
col_sample_rate = c(0.2, 0.5, 1.0))
# Train and validate a cartesian grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid1",
training_frame = train,
validation_frame = valid,
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
model_ids <- gbm_grid1@model_ids
# get the model objects
models <- lapply(model_ids, h2o.getModel)
# make holdout predictions on the assessment data (very noisy)
predictions <- lapply(models, h2o.predict, newdata = valid)
#> <SUPER NOISY HERE>
predictions[[1]]
#> predict p0 p1
#> 1 1 0.13336856 0.8666314
#> 2 1 0.11695254 0.8830475
#> 3 1 0.06611336 0.9338866
#> 4 1 0.07556947 0.9244305
#> 5 1 0.09630038 0.9036996
#> 6 0 0.79153668 0.2084633
#>
#> [2489 rows x 3 columns]
# compute all the performance metrics on test data
performance <- h2o.performance(models[[1]], newdata = test)
h2o.auc(performance)
#> [1] 0.7813055
performance@metrics$AUC
#> [1] 0.7813055 Created on 2021-09-21 by the reprex package (v2.0.0.9000) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DavisVaughan Yeah the grid is a summary table, and you can grab the models like you're doing above. |
||
|
||
### Loop within h2o | ||
|
||
In this case, we would need to have an h2o api that we can call to give h2o: | ||
|
||
* The indices for the modeling and holdout data | ||
* The grid values | ||
* Seeds | ||
* Data details (the data frame and x/y names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the goal here to apply the same (whole) grid to multiple resample train/valid pairs? Or do you need the more fine-grained control of assigning a specific resample train/valid set to a specific grid combo (e.g. learn_rate = 0.1, max_depth = 5, sample_rate = 0.8, col_sample_rate = 1.0)? If the former, then we can already support that with the current API, I think. You'd just do something like: gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid1",
training_frame = data[train_idx_resample_1, ],
validation_frame = data[valid_idx_resample_1, ],
ntrees = 100, seed = 1,
hyper_params = gbm_params1) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, applying it to everything would be the plan. |
||
|
||
then have h2o return | ||
|
||
* The holdout metrics and predictions for each grid/resample combination. | ||
|
||
tidymodels would then fill in the slots for what the tune objects return. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
title: "tidymodels and h2o integration" | ||
--- | ||
|
||
|
||
|
||
|
||
## Grid search differences | ||
|
||
There are some differences between tidymodels grid tuning and h2o: | ||
|
||
* `h2o.grid()` seems to work on one resample at a time. | ||
* General grids (i.e., not Cartesian products) are not feasible. | ||
* h2o has a more general methodology for processing the grid. tidymodels computes everything while h20 can setup time-based stopping criteria (and others, I think) | ||
|
||
There are some other potentially missing pieces described below. | ||
|
||
## Interactions between tidymodels and h2o | ||
|
||
Non-mutually exclusive approaches, ordered from least complex to most. | ||
|
||
### parsnip integration | ||
|
||
This would be for one-off model fits. The code would look like: | ||
|
||
|
||
```r | ||
boost_fit <- | ||
boost_tree(trees = 12, learn_rate = 0.1) %>% | ||
set_mode("regression") %>% | ||
set_engine("h2o", model_id = "gbm fit") %>% #<- extra options here | ||
fit(mpg ~ ., data = mtcars) | ||
``` | ||
|
||
This would create an h2o data frame, run the model, and return the object with the references it needs to make predictions. | ||
|
||
We can write `tidy()` and serialize functions for saving the model. For simple models, we might be able to extract the coeffcients and write predict methods that just do basic matrix algebra. | ||
This may already implemented in [`h2oparsnip`](https://github.com/stevenpawley/h2oparsnip) but I have not looked at the code in-depth. | ||
|
||
### Resamples as chunks | ||
|
||
In R, have a method for `tune_grid()` that would do everything in R except the model fits. The loops for getting results across resamples would be in R and would look something like: | ||
|
||
|
||
```r | ||
# rs is a rsample object | ||
for (ind in seq_along(rs$splits)) { | ||
mod <- | ||
rs$splits[[ind]] %>% | ||
analysis() %>% | ||
as.h2o() | ||
val <- | ||
rs$splits[[ind]] %>% | ||
assessment() %>% | ||
as.h2o() | ||
res <- | ||
h2o.grid("gbm", | ||
x = x_names, | ||
y = y_names, | ||
grid_id = "gbm_grid1", | ||
training_frame = mod, | ||
validation_frame = val, | ||
seed = seed, | ||
hyper_params = grid_values) | ||
# plus more options for metrics etc. | ||
|
||
metrics <- foo("See notes below") | ||
|
||
if (control$save_pred) { | ||
preds <- bar("See notes below") | ||
} | ||
} | ||
``` | ||
|
||
We accumulate these in a stage-wise fashion across resamples. | ||
|
||
This is somewhat inefficient since we are passing data back and forth for each resample and it diminishes the efficiency of how h2o can process results. | ||
|
||
Notes: | ||
|
||
* I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character. | ||
|
||
* I don't know how to get the holdout predictions either. | ||
|
||
### Loop within h2o | ||
|
||
In this case, we would need to have an h2o api that we can call to give h2o: | ||
|
||
* The indices for the modeling and holdout data | ||
* The grid values | ||
* Seeds | ||
* Data details (the data frame and x/y names) | ||
|
||
then have h2o return | ||
|
||
* The holdout metrics and predictions for each grid/resample combination. | ||
|
||
tidymodels would then fill in the slots for what the tune objects return. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is helpful to note, but we have a
parallelism
arg forh2o.grid()
which allows for multiple models to be trained in parallel. You specify an integer for how many models to train at once, e.g.parallelism = 5
. So in theory it can work with multiple resamples at a time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great. We can link that to the tidymodels control options for parallelism.