Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ls/standardize compound unit for samples/144 #147

Open
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

lshandross
Copy link
Contributor

@lshandross lshandross commented Nov 27, 2024

Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return.

Copy link

github-actions bot commented Nov 27, 2024

@lshandross lshandross marked this pull request as draft November 27, 2024 20:03
@lshandross lshandross marked this pull request as ready for review December 2, 2024 20:33
@lshandross lshandross requested a review from elray1 December 3, 2024 16:54
Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start on a challenging problem! I have asked some questions throughout.

R/linear_pool.R Outdated Show resolved Hide resolved
R/linear_pool_sample.R Outdated Show resolved Hide resolved
R/linear_pool_sample.R Outdated Show resolved Hide resolved
R/linear_pool_sample.R Outdated Show resolved Hide resolved
R/linear_pool_sample.R Outdated Show resolved Hide resolved
R/linear_pool_sample.R Outdated Show resolved Hide resolved
tests/testthat/test-linear_pool.R Outdated Show resolved Hide resolved
R/validate_ensemble_inputs.R Show resolved Hide resolved
R/linear_pool_sample.R Show resolved Hide resolved
tests/testthat/test-linear_pool.R Outdated Show resolved Hide resolved
@elray1
Copy link
Contributor

elray1 commented Dec 9, 2024

Double check that we don't ensemble predictions from any sets of models with different sets of values for task id variables that are not in the compound task id set (we should throw an error if this situation arises). Example:

  • hub is collecting trajectories (compound task id set doesn't include horizon or target date)
  • model A submits forecasts for horizons 1 through 4, model B only submits forecasts for horizons 1 and 2. We don't want to try to ensemble those.

Edit: this has been addressed and unit tested

@lshandross
Copy link
Contributor Author

The following commits fix a bug I identified while investigating when derived task ids should be included in the compound task id set. When derived task ids were not included in the compound task id set, validate_compound_taskid_set() assumed that there is (additional) dependency across the values for those derived tasks and through a false positive error. Thus, a derived_tasks parameter needed to be added for that validation, even though it's not used in the linear_pool_sample() calculation.

Addressing this bug also meant that I could remove the extra mutate() statement in the first linear_pool() test to fix the mismatched output type ids, which technically do not violate our requirement that sample forecasts must share output type ids across each unique combination of compound task id set variables for every model. (We may still be interested in updating the example data, though, for output type id consistency.)

@lshandross
Copy link
Contributor Author

The following is a summary meant to get reviewers oriented with the functionality in this pull request. Please provide your feedback by Wednesday, January 15.

Context

The hubEnsembles package currently only supports the simplest case of ensembling samples using a linear pool, which adheres to the following conditions:

  1. The ensemble weights all models equally
  2. All component models provide an equal number of samples per compound unit
  3. There is no limit on the number of samples to return (i.e., we simply pool all forecasts into an ensemble and return them)

We are interested in expanding to more complex cases, eliminating one condition at a time. Here, we start by eliminating condition 3. Note that we are NOT interested in eliminating conditions 1 and 2 at this time, though there is some built-in flexibility that will make eliminating them easier in the future.

High level summary:

Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return.

Ensembling steps (excluding validations):

  1. linear_pool() splits provided model inputs by output type and the helper linear_pool_sample() is called to act on the sample predictions
  2. weights are validated to be equal for every model (throws an error if otherwise) OR set to be equal if not provided
  3. If a subset of output samples is requested, do steps 3-4; else, skip to step 5.
  4. Calculate the number of samples to output per model within each compound unit, implemented by a map. If the total number of output samples cannot be divided evenly among the models, the remainder is randomly distributed among them
    a. We use the weights to help determine how many samples to output for each model for a single compound unit
    b. If a model does not predict for a given compound unit, no output samples will be requested from it for that compound unit and will instead be split among the models that do predict for that compound unit
  5. Draw the correct number of samples to output from each model for each compound unit (defined as a unique combination of task id set variable values) by sampling the model's output type ids, executed by a map. This grouping ensures that any samples that are joint across non-compound task id set variables will be connected and drawn together
  6. Update the output type ids to be unique, so as not to falsely link any predictions that are not joint across. The data type of the output_type_id column is either a string or an integer, depending on the column's original data type.
  7. Update the model_id to be that of the ensemble, which may be specified by the user

Validation functions for ensembling samples

  • validate_compound_taskid_set()
    • Checks that component model outputs are compatible with the specified compound task id set (i.e., samples that are joint across can be combined by their shared output type id values; note that the output type id values only need to be the same within each model)
    • Check that all component models forecast for the same set of joint across values (e.g. if samples are joint across horizons, all models must forecast for every horizon). There is an option to return a data.frame that summarizes which predictions are missing for a particular model
  • validate_sample_inputs()
    • Checks validity of inputs to linear_pool_sample()
    • Checks that for each compound unit, all models provide the same number of samples

Copy link
Contributor

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two minor, non-essential comments. I read through the code carefully, but did not run anything on my own. Basically, I think the changes look good, and the strategy of tackling one assumption at a time also seems reasonable.

R/linear_pool.R Outdated
Comment on lines 12 to 19
#' @param compound_taskid_set `character` vector of the compound task ID variable
#' set. This argument is only relevant for `output_type` `"sample"`. NULL means
#' that samples are from a multivariate joint distribution across all levels of
#' all task id variables, while equality to `task_id_cols` means that the samples
#' are from separate univariate distributions for each individual prediction task.
#' NA means the compound_taskid_set is not relevant for the current modeling task.
#' Defaults to NA. Derived task ids must be included if all of the task ids their
#' values depend on are part of the compound_taskid_set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this parameter definition dense and hard to wade through. I also note that no additions are made in the @details section below, where maybe they should be as all other output types are covered there. So perhaps a briefer definition here (or the same?) and then adding some additional longer-form explanations, maybe with examples and possibly links to the definitions of the concepts of compound task id and derived task id, below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to list the default case first.

Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.

Copy link
Contributor Author

@lshandross lshandross Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken Zhian's suggestion of making the expected inputs a list and reordering the cases, plus added a section in the @details about how linear_pool deals with samples. Let me know if this is clearer @nickreich

edit: changes have been made and pushed

Comment on lines 61 to 81
# draw the target number of samples from each model for each unique
# combination of compound task ID set variables
split_compound_taskid_set <- model_out_tbl |>
split(f = model_out_tbl[, c("model_id", compound_taskid_set)])
model_out_tbl <- split_compound_taskid_set |>
purrr::map(.f = function(split_outputs) {
if (nrow(split_outputs) != 0) {
# current_compound_taskid_set has 1 row, where the column
# `target_samples` is the number of samples to draw for this
# combination of model_id and compound task ID set variables
current_compound_taskid_set <- split_outputs |>
dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |>
dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set))
provided_indices <- unique(split_outputs$output_type_id)

sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE)
dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx)
}
}) |>
purrr::list_rbind()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the goal should be that the code is "readable" but I spent 10 minutes trying to parse this code (without running it) and can't understand what it is doing. Perhaps some more comments would help.

specifically:

  1. I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.
  2. I don't understand what the nested set of dplyr functions on line 72 is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.

@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.

I agree with Nick that this could use some workshopping to make sure it's maintainable.

I have a few suggestions:

  1. A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!
  2. Use the control structure to have an early return instead of controlling when you do stuff:
    do_this <- function(thing) {
      if (nrow(thing) == 0) {
        return(thing)
      }
      # do stuff
    }
    not_this <- function(thing) {
      if (nrow(thing) > 0) {
        # do stuff
      }
    }
  3. I would rename the split_compound_taskid_set and split_outputs to split_model_task_tbls and model_task_tbl or something like that to clearly indicate what kind of objects these are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zkamvar I have a couple of questions about points (1) and (2):

  1. For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?
  2. For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the do_this() case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?

The anonymous function that you have from lines 66--79. Inline functions are generally okay, but once they get longer than 5 lines, they really should stand alone otherwise it's difficult to debug them down the road.

  1. For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the do_this() case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?

I'm so sorry, think I over-engineered my answer 😞. The thing I'm describing here is called a guard clause.

Your function currently consists of an if statement that allows the function to do something. When that statement is FALSE, then the function does nothing. Instead, if you add a guard clause at the beginning that returns early if the function shouldn't do anything to the input, then it helps reduce indentation levels, which makes things easier to read and debug.

# good
  if (nrow(thing) == 0) {
    return(thing)
  }
  # do stuff

# -----------------------

# bad
  if (nrow(thing) > 0) {
    # do stuff
  }

Let me know if you want to do some pairing on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications! I just wanted to make sure I was fully understanding everything

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I appreciated the comments that provided context and I really appreciated the cleanup of test-linear_pool.R by adding helper-test_data.R, it makes it clear what is being tested 🙌🏽

I will admit that I don't fully grok how the compound task ID set is used. I get the concept of setting NULL to mean that samples are derived from joint distribution across all of the task IDs, but when it comes to NA or specifying a set that matches the task ID cols... I get kinda lost. It's not your fault, the sample thing is inherently confusing.

As with Nick, I had minor comments that would help improve the maintainability of the code.

And again, good work!

Comment on lines 61 to 81
# draw the target number of samples from each model for each unique
# combination of compound task ID set variables
split_compound_taskid_set <- model_out_tbl |>
split(f = model_out_tbl[, c("model_id", compound_taskid_set)])
model_out_tbl <- split_compound_taskid_set |>
purrr::map(.f = function(split_outputs) {
if (nrow(split_outputs) != 0) {
# current_compound_taskid_set has 1 row, where the column
# `target_samples` is the number of samples to draw for this
# combination of model_id and compound task ID set variables
current_compound_taskid_set <- split_outputs |>
dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |>
dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set))
provided_indices <- unique(split_outputs$output_type_id)

sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE)
dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx)
}
}) |>
purrr::list_rbind()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.

@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.

I agree with Nick that this could use some workshopping to make sure it's maintainable.

I have a few suggestions:

  1. A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!
  2. Use the control structure to have an early return instead of controlling when you do stuff:
    do_this <- function(thing) {
      if (nrow(thing) == 0) {
        return(thing)
      }
      # do stuff
    }
    not_this <- function(thing) {
      if (nrow(thing) > 0) {
        # do stuff
      }
    }
  3. I would rename the split_compound_taskid_set and split_outputs to split_model_task_tbls and model_task_tbl or something like that to clearly indicate what kind of objects these are.

R/linear_pool_sample.R Outdated Show resolved Hide resolved
tests/testthat/test-linear_pool.R Show resolved Hide resolved
R/linear_pool.R Outdated
Comment on lines 12 to 19
#' @param compound_taskid_set `character` vector of the compound task ID variable
#' set. This argument is only relevant for `output_type` `"sample"`. NULL means
#' that samples are from a multivariate joint distribution across all levels of
#' all task id variables, while equality to `task_id_cols` means that the samples
#' are from separate univariate distributions for each individual prediction task.
#' NA means the compound_taskid_set is not relevant for the current modeling task.
#' Defaults to NA. Derived task ids must be included if all of the task ids their
#' values depend on are part of the compound_taskid_set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to list the default case first.

Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.

@elray1 elray1 self-requested a review January 14, 2025 15:03
Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had one minor suggestion to add a comment, but overall it looks good!

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you, @lshandross! I appreciate you taking the time to refactor the inline functions and to add documentation.

I've added some suggestions for nice-to-haves and bringing your attention to the ordering of output type IDs, but I don't think any of them are dealbreakers.

R/linear_pool_sample.R Outdated Show resolved Hide resolved
Comment on lines +134 to +136
if (nrow(model_compound_set_tbl) == 0) {
return(model_compound_set_tbl)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love to see this guard clause!

Comment on lines +113 to +115
#' Helper function for drawing the requested number of samples from each model for
#' every unique combination of compound task ID set variables when requesting a
#' linear pool of the `sample` output type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this context! It's really helpful

R/linear_pool_sample.R Outdated Show resolved Hide resolved
R/validate_ensemble_inputs.R Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

When implementing weighted sampling, check that weights only vary by compound_taskid_set
4 participants