-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ls/standardize compound unit for samples/144 #147
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start on a challenging problem! I have asked some questions throughout.
Should only be for each unique combo of task ID vars, NOT for every unique combo (now covered by `validate_output_type_ids`)
Double check that we don't ensemble predictions from any sets of models with different sets of values for task id variables that are not in the compound task id set (we should throw an error if this situation arises). Example:
Edit: this has been addressed and unit tested |
Instead of task ID values
The following commits fix a bug I identified while investigating when derived task ids should be included in the compound task id set. When derived task ids were not included in the compound task id set, Addressing this bug also meant that I could remove the extra |
The following is a summary meant to get reviewers oriented with the functionality in this pull request. Please provide your feedback by Wednesday, January 15. ContextThe hubEnsembles package currently only supports the simplest case of ensembling samples using a linear pool, which adheres to the following conditions:
We are interested in expanding to more complex cases, eliminating one condition at a time. Here, we start by eliminating condition 3. Note that we are NOT interested in eliminating conditions 1 and 2 at this time, though there is some built-in flexibility that will make eliminating them easier in the future. High level summary:Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return. Ensembling steps (excluding validations):
Validation functions for ensembling samples
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two minor, non-essential comments. I read through the code carefully, but did not run anything on my own. Basically, I think the changes look good, and the strategy of tackling one assumption at a time also seems reasonable.
R/linear_pool.R
Outdated
#' @param compound_taskid_set `character` vector of the compound task ID variable | ||
#' set. This argument is only relevant for `output_type` `"sample"`. NULL means | ||
#' that samples are from a multivariate joint distribution across all levels of | ||
#' all task id variables, while equality to `task_id_cols` means that the samples | ||
#' are from separate univariate distributions for each individual prediction task. | ||
#' NA means the compound_taskid_set is not relevant for the current modeling task. | ||
#' Defaults to NA. Derived task ids must be included if all of the task ids their | ||
#' values depend on are part of the compound_taskid_set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this parameter definition dense and hard to wade through. I also note that no additions are made in the @details section below, where maybe they should be as all other output types are covered there. So perhaps a briefer definition here (or the same?) and then adding some additional longer-form explanations, maybe with examples and possibly links to the definitions of the concepts of compound task id and derived task id, below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to list the default case first.
Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken Zhian's suggestion of making the expected inputs a list and reordering the cases, plus added a section in the @details
about how linear_pool
deals with samples. Let me know if this is clearer @nickreich
edit: changes have been made and pushed
R/linear_pool_sample.R
Outdated
# draw the target number of samples from each model for each unique | ||
# combination of compound task ID set variables | ||
split_compound_taskid_set <- model_out_tbl |> | ||
split(f = model_out_tbl[, c("model_id", compound_taskid_set)]) | ||
model_out_tbl <- split_compound_taskid_set |> | ||
purrr::map(.f = function(split_outputs) { | ||
if (nrow(split_outputs) != 0) { | ||
# current_compound_taskid_set has 1 row, where the column | ||
# `target_samples` is the number of samples to draw for this | ||
# combination of model_id and compound task ID set variables | ||
current_compound_taskid_set <- split_outputs |> | ||
dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |> | ||
dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set)) | ||
provided_indices <- unique(split_outputs$output_type_id) | ||
|
||
sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE) | ||
dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx) | ||
} | ||
}) |> | ||
purrr::list_rbind() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the goal should be that the code is "readable" but I spent 10 minutes trying to parse this code (without running it) and can't understand what it is doing. Perhaps some more comments would help.
specifically:
- I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.
- I don't understand what the nested set of dplyr functions on line 72 is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.
@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.
I agree with Nick that this could use some workshopping to make sure it's maintainable.
I have a few suggestions:
- A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!
- Use the control structure to have an early return instead of controlling when you do stuff:
do_this <- function(thing) { if (nrow(thing) == 0) { return(thing) } # do stuff } not_this <- function(thing) { if (nrow(thing) > 0) { # do stuff } }
- I would rename the
split_compound_taskid_set
andsplit_outputs
tosplit_model_task_tbls
andmodel_task_tbl
or something like that to clearly indicate what kind of objects these are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zkamvar I have a couple of questions about points (1) and (2):
- For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?
- For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the
do_this()
case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?
The anonymous function that you have from lines 66--79. Inline functions are generally okay, but once they get longer than 5 lines, they really should stand alone otherwise it's difficult to debug them down the road.
- For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the
do_this()
case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?
I'm so sorry, think I over-engineered my answer 😞. The thing I'm describing here is called a guard clause.
Your function currently consists of an if statement that allows the function to do something. When that statement is FALSE, then the function does nothing. Instead, if you add a guard clause at the beginning that returns early if the function shouldn't do anything to the input, then it helps reduce indentation levels, which makes things easier to read and debug.
# good
if (nrow(thing) == 0) {
return(thing)
}
# do stuff
# -----------------------
# bad
if (nrow(thing) > 0) {
# do stuff
}
Let me know if you want to do some pairing on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications! I just wanted to make sure I was fully understanding everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! I appreciated the comments that provided context and I really appreciated the cleanup of test-linear_pool.R
by adding helper-test_data.R
, it makes it clear what is being tested 🙌🏽
I will admit that I don't fully grok how the compound task ID set is used. I get the concept of setting NULL
to mean that samples are derived from joint distribution across all of the task IDs, but when it comes to NA
or specifying a set that matches the task ID cols... I get kinda lost. It's not your fault, the sample thing is inherently confusing.
As with Nick, I had minor comments that would help improve the maintainability of the code.
And again, good work!
R/linear_pool_sample.R
Outdated
# draw the target number of samples from each model for each unique | ||
# combination of compound task ID set variables | ||
split_compound_taskid_set <- model_out_tbl |> | ||
split(f = model_out_tbl[, c("model_id", compound_taskid_set)]) | ||
model_out_tbl <- split_compound_taskid_set |> | ||
purrr::map(.f = function(split_outputs) { | ||
if (nrow(split_outputs) != 0) { | ||
# current_compound_taskid_set has 1 row, where the column | ||
# `target_samples` is the number of samples to draw for this | ||
# combination of model_id and compound task ID set variables | ||
current_compound_taskid_set <- split_outputs |> | ||
dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |> | ||
dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set)) | ||
provided_indices <- unique(split_outputs$output_type_id) | ||
|
||
sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE) | ||
dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx) | ||
} | ||
}) |> | ||
purrr::list_rbind() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.
@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.
I agree with Nick that this could use some workshopping to make sure it's maintainable.
I have a few suggestions:
- A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!
- Use the control structure to have an early return instead of controlling when you do stuff:
do_this <- function(thing) { if (nrow(thing) == 0) { return(thing) } # do stuff } not_this <- function(thing) { if (nrow(thing) > 0) { # do stuff } }
- I would rename the
split_compound_taskid_set
andsplit_outputs
tosplit_model_task_tbls
andmodel_task_tbl
or something like that to clearly indicate what kind of objects these are.
R/linear_pool.R
Outdated
#' @param compound_taskid_set `character` vector of the compound task ID variable | ||
#' set. This argument is only relevant for `output_type` `"sample"`. NULL means | ||
#' that samples are from a multivariate joint distribution across all levels of | ||
#' all task id variables, while equality to `task_id_cols` means that the samples | ||
#' are from separate univariate distributions for each individual prediction task. | ||
#' NA means the compound_taskid_set is not relevant for the current modeling task. | ||
#' Defaults to NA. Derived task ids must be included if all of the task ids their | ||
#' values depend on are part of the compound_taskid_set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to list the default case first.
Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had one minor suggestion to add a comment, but overall it looks good!
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Evan Ray <[email protected]>
…ed samples per compound unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you, @lshandross! I appreciate you taking the time to refactor the inline functions and to add documentation.
I've added some suggestions for nice-to-haves and bringing your attention to the ordering of output type IDs, but I don't think any of them are dealbreakers.
if (nrow(model_compound_set_tbl) == 0) { | ||
return(model_compound_set_tbl) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love to see this guard clause!
#' Helper function for drawing the requested number of samples from each model for | ||
#' every unique combination of compound task ID set variables when requesting a | ||
#' linear pool of the `sample` output type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this context! It's really helpful
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Co-authored-by: Zhian N. Kamvar <[email protected]>
Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return.