Ls/standardize compound unit for samples/144 #147

lshandross · 2024-11-27T19:42:38Z

Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return.

github-actions · 2024-11-27T19:44:22Z

🚀 Deployed on https://67914ae8b9a3f607d126f8f6--hubensembles-pr-previews.netlify.app

elray1

This is a good start on a challenging problem! I have asked some questions throughout.

R/linear_pool.R

R/linear_pool_sample.R

tests/testthat/test-linear_pool.R

R/validate_ensemble_inputs.R

R/linear_pool_sample.R

tests/testthat/test-linear_pool.R

Should only be for each unique combo of task ID vars, NOT for every unique combo (now covered by `validate_output_type_ids`)

elray1 · 2024-12-09T19:05:02Z

Double check that we don't ensemble predictions from any sets of models with different sets of values for task id variables that are not in the compound task id set (we should throw an error if this situation arises). Example:

hub is collecting trajectories (compound task id set doesn't include horizon or target date)
model A submits forecasts for horizons 1 through 4, model B only submits forecasts for horizons 1 and 2. We don't want to try to ensemble those.

Edit: this has been addressed and unit tested

Instead of task ID values

… sets

…erent locations

lshandross · 2025-01-09T22:37:12Z

The following commits fix a bug I identified while investigating when derived task ids should be included in the compound task id set. When derived task ids were not included in the compound task id set, validate_compound_taskid_set() assumed that there is (additional) dependency across the values for those derived tasks and through a false positive error. Thus, a derived_tasks parameter needed to be added for that validation, even though it's not used in the linear_pool_sample() calculation.

Addressing this bug also meant that I could remove the extra mutate() statement in the first linear_pool() test to fix the mismatched output type ids, which technically do not violate our requirement that sample forecasts must share output type ids across each unique combination of compound task id set variables for every model. (We may still be interested in updating the example data, though, for output type id consistency.)

…skid_set` error

lshandross · 2025-01-10T18:07:18Z

The following is a summary meant to get reviewers oriented with the functionality in this pull request. Please provide your feedback by Wednesday, January 15.

Context

The hubEnsembles package currently only supports the simplest case of ensembling samples using a linear pool, which adheres to the following conditions:

The ensemble weights all models equally
All component models provide an equal number of samples per compound unit
There is no limit on the number of samples to return (i.e., we simply pool all forecasts into an ensemble and return them)

We are interested in expanding to more complex cases, eliminating one condition at a time. Here, we start by eliminating condition 3. Note that we are NOT interested in eliminating conditions 1 and 2 at this time, though there is some built-in flexibility that will make eliminating them easier in the future.

High level summary:

Adds ability to subset provided samples, implemented by a map over groups defined by the columns making up the compound task id set and sampling the output type ids of the requested number of predictions to return.

Ensembling steps (excluding validations):

linear_pool() splits provided model inputs by output type and the helper linear_pool_sample() is called to act on the sample predictions
weights are validated to be equal for every model (throws an error if otherwise) OR set to be equal if not provided
If a subset of output samples is requested, do steps 3-4; else, skip to step 5.
Calculate the number of samples to output per model within each compound unit, implemented by a map. If the total number of output samples cannot be divided evenly among the models, the remainder is randomly distributed among them
a. We use the weights to help determine how many samples to output for each model for a single compound unit
b. If a model does not predict for a given compound unit, no output samples will be requested from it for that compound unit and will instead be split among the models that do predict for that compound unit
Draw the correct number of samples to output from each model for each compound unit (defined as a unique combination of task id set variable values) by sampling the model's output type ids, executed by a map. This grouping ensures that any samples that are joint across non-compound task id set variables will be connected and drawn together
Update the output type ids to be unique, so as not to falsely link any predictions that are not joint across. The data type of the output_type_id column is either a string or an integer, depending on the column's original data type.
Update the model_id to be that of the ensemble, which may be specified by the user

Validation functions for ensembling samples

validate_compound_taskid_set()
- Checks that component model outputs are compatible with the specified compound task id set (i.e., samples that are joint across can be combined by their shared output type id values; note that the output type id values only need to be the same within each model)
- Check that all component models forecast for the same set of joint across values (e.g. if samples are joint across horizons, all models must forecast for every horizon). There is an option to return a data.frame that summarizes which predictions are missing for a particular model
validate_sample_inputs()
- Checks validity of inputs to linear_pool_sample()
- Checks that for each compound unit, all models provide the same number of samples

nickreich

I have two minor, non-essential comments. I read through the code carefully, but did not run anything on my own. Basically, I think the changes look good, and the strategy of tackling one assumption at a time also seems reasonable.

nickreich · 2025-01-13T21:33:55Z

R/linear_pool.R

+#' @param compound_taskid_set `character` vector of the compound task ID variable
+#'   set. This argument is only relevant for `output_type` `"sample"`. NULL means
+#'   that samples are from a multivariate joint distribution across all levels of
+#'   all task id variables, while equality to `task_id_cols` means that the samples
+#'   are from separate univariate distributions for each individual prediction task.
+#'   NA means the compound_taskid_set is not relevant for the current modeling task.
+#'   Defaults to NA. Derived task ids must be included if all of the task ids their
+#'   values depend on are part of the compound_taskid_set.


I find this parameter definition dense and hard to wade through. I also note that no additions are made in the @details section below, where maybe they should be as all other output types are covered there. So perhaps a briefer definition here (or the same?) and then adding some additional longer-form explanations, maybe with examples and possibly links to the definitions of the concepts of compound task id and derived task id, below.

I would suggest to list the default case first.

Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.

I've taken Zhian's suggestion of making the expected inputs a list and reordering the cases, plus added a section in the @details about how linear_pool deals with samples. Let me know if this is clearer @nickreich

edit: changes have been made and pushed

nickreich · 2025-01-13T21:49:17Z

R/linear_pool_sample.R

+    # draw the target number of samples from each model for each unique
+    # combination of compound task ID set variables
+    split_compound_taskid_set <- model_out_tbl |>
+      split(f = model_out_tbl[, c("model_id", compound_taskid_set)])
+    model_out_tbl <- split_compound_taskid_set |>
+      purrr::map(.f = function(split_outputs) {
+        if (nrow(split_outputs) != 0) {
+          # current_compound_taskid_set has 1 row, where the column
+          # `target_samples` is the number of samples to draw for this
+          # combination of model_id and compound task ID set variables
+          current_compound_taskid_set <- split_outputs |>
+            dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |>
+            dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set))
+          provided_indices <- unique(split_outputs$output_type_id)
+
+          sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE)
+          dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx)
+        }
+      }) |>
+      purrr::list_rbind()
  }


I am not sure if the goal should be that the code is "readable" but I spent 10 minutes trying to parse this code (without running it) and can't understand what it is doing. Perhaps some more comments would help.

specifically:

I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.

I don't understand what the nested set of dplyr functions on line 72 is doing.

I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.

@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.

I agree with Nick that this could use some workshopping to make sure it's maintainable.

I have a few suggestions:

A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!

Use the control structure to have an early return instead of controlling when you do stuff:
do_this <- function(thing) { if (nrow(thing) == 0) { return(thing) } # do stuff } not_this <- function(thing) { if (nrow(thing) > 0) { # do stuff } }

I would rename the split_compound_taskid_set and split_outputs to split_model_task_tbls and model_task_tbl or something like that to clearly indicate what kind of objects these are.

@zkamvar I have a couple of questions about points (1) and (2):

For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?

For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the do_this() case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?

For the refactoring into a separate function, which lines should be included in the refactor? Everything from 61 to 80? A subset of that?

The anonymous function that you have from lines 66--79. Inline functions are generally okay, but once they get longer than 5 lines, they really should stand alone otherwise it's difficult to debug them down the road.

For the control structure, are you suggesting that I write two separate functions for the different cases? Or just the do_this() case? Also, is this refactoring supposed to be only for within the map, or for more which includes the map?

I'm so sorry, think I over-engineered my answer 😞. The thing I'm describing here is called a guard clause.

Your function currently consists of an if statement that allows the function to do something. When that statement is FALSE, then the function does nothing. Instead, if you add a guard clause at the beginning that returns early if the function shouldn't do anything to the input, then it helps reduce indentation levels, which makes things easier to read and debug.

# good if (nrow(thing) == 0) { return(thing) } # do stuff # ----------------------- # bad if (nrow(thing) > 0) { # do stuff }

Let me know if you want to do some pairing on this.

Thanks for the clarifications! I just wanted to make sure I was fully understanding everything

zkamvar

Good work! I appreciated the comments that provided context and I really appreciated the cleanup of test-linear_pool.R by adding helper-test_data.R, it makes it clear what is being tested 🙌🏽

I will admit that I don't fully grok how the compound task ID set is used. I get the concept of setting NULL to mean that samples are derived from joint distribution across all of the task IDs, but when it comes to NA or specifying a set that matches the task ID cols... I get kinda lost. It's not your fault, the sample thing is inherently confusing.

As with Nick, I had minor comments that would help improve the maintainability of the code.

And again, good work!

zkamvar · 2025-01-14T00:25:29Z

R/linear_pool_sample.R

+    # draw the target number of samples from each model for each unique
+    # combination of compound task ID set variables
+    split_compound_taskid_set <- model_out_tbl |>
+      split(f = model_out_tbl[, c("model_id", compound_taskid_set)])
+    model_out_tbl <- split_compound_taskid_set |>
+      purrr::map(.f = function(split_outputs) {
+        if (nrow(split_outputs) != 0) {
+          # current_compound_taskid_set has 1 row, where the column
+          # `target_samples` is the number of samples to draw for this
+          # combination of model_id and compound task ID set variables
+          current_compound_taskid_set <- split_outputs |>
+            dplyr::distinct(dplyr::across(dplyr::all_of(compound_taskid_set)), .keep_all = TRUE) |>
+            dplyr::left_join(samples_per_combo, by = c("model_id", compound_taskid_set))
+          provided_indices <- unique(split_outputs$output_type_id)
+
+          sample_idx <- sample(x = provided_indices, size = current_compound_taskid_set$target_samples, replace = FALSE)
+          dplyr::filter(split_outputs, .data[["output_type_id"]] %in% sample_idx)
+        }
+      }) |>
+      purrr::list_rbind()
  }


I don't understand what the split_compound_taskid_set object is, even after looking at the base::split() helpfile.

@nickreich, this is a list of data frames, where all the rows in each data frame belong to samples from a single compound task ID set and model.

I agree with Nick that this could use some workshopping to make sure it's maintainable.

I have a few suggestions:

A 10+ line anonymous function, which is an indication that we should give it a name and make it standalone. This is also an opportunity for us to give it explicit tests!

Use the control structure to have an early return instead of controlling when you do stuff:
do_this <- function(thing) { if (nrow(thing) == 0) { return(thing) } # do stuff } not_this <- function(thing) { if (nrow(thing) > 0) { # do stuff } }

I would rename the split_compound_taskid_set and split_outputs to split_model_task_tbls and model_task_tbl or something like that to clearly indicate what kind of objects these are.

R/linear_pool_sample.R

tests/testthat/test-linear_pool.R

zkamvar · 2025-01-14T01:01:46Z

R/linear_pool.R

+#' @param compound_taskid_set `character` vector of the compound task ID variable
+#'   set. This argument is only relevant for `output_type` `"sample"`. NULL means
+#'   that samples are from a multivariate joint distribution across all levels of
+#'   all task id variables, while equality to `task_id_cols` means that the samples
+#'   are from separate univariate distributions for each individual prediction task.
+#'   NA means the compound_taskid_set is not relevant for the current modeling task.
+#'   Defaults to NA. Derived task ids must be included if all of the task ids their
+#'   values depend on are part of the compound_taskid_set.


I would suggest to list the default case first.

Moreover, you can make this a list of the expected inputs and ROxygen will happily parse it as a list.

R/validate_ensemble_inputs.R

elray1

i had one minor suggestion to add a comment, but overall it looks good!

Co-authored-by: Zhian N. Kamvar <[email protected]>

Co-authored-by: Evan Ray <[email protected]>

…et_tbl`

…ed samples per compound unit

zkamvar

LGTM! Thank you, @lshandross! I appreciate you taking the time to refactor the inline functions and to add documentation.

I've added some suggestions for nice-to-haves and bringing your attention to the ordering of output type IDs, but I don't think any of them are dealbreakers.

R/linear_pool_sample.R

zkamvar · 2025-01-22T18:41:13Z

R/linear_pool_sample.R

+  if (nrow(model_compound_set_tbl) == 0) {
+    return(model_compound_set_tbl)
+  }


love to see this guard clause!

zkamvar · 2025-01-22T18:41:34Z

R/linear_pool_sample.R

+#' Helper function for drawing the requested number of samples from each model for
+#' every unique combination of compound task ID set variables when requesting a
+#' linear pool of the `sample` output type.


Thank you for adding this context! It's really helpful

R/linear_pool_sample.R

R/validate_ensemble_inputs.R

Co-authored-by: Zhian N. Kamvar <[email protected]>

lshandross added 4 commits November 26, 2024 17:04

Support target_samples < provided_samples for linear pool sample

591e5cc

Update existing tests and validations

ce4e81a

Minor fixes for failing validations

2673df4

Add tests for linear_pool_sample() with compound tasks

8479d10

lshandross marked this pull request as draft November 27, 2024 20:03

lshandross added 2 commits December 2, 2024 14:54

Remove debugging code

790863d

Fix warnings generated by tests

11f1935

lshandross marked this pull request as ready for review December 2, 2024 20:33

Refactor validate_ensemble_outputs() to reduce cyclomatic complexity

f0061e2

lshandross requested a review from elray1 December 3, 2024 16:54

elray1 requested changes Dec 4, 2024

View reviewed changes

lshandross added 4 commits December 5, 2024 10:51

validate_output_type_ids for sample predictions

d08c3fb

Fix bug in validate_output_type_ids()

74fcdce

Remove same num samples from each component model validation

810aa52

Should only be for each unique combo of task ID vars, NOT for every unique combo (now covered by `validate_output_type_ids`)

Fix linear_pool tests linting

6338d57

lshandross added 11 commits December 9, 2024 16:23

Group by compound_taskid_set values in linear_pool_sample

fb2312a

Instead of task ID values

Remove out-of-date comment

9d24bc7

Change arg comp_units_cols to compound_taskid_set for consistency

0fab6da

Fix compound_taskid_set param docs

c61c9fe

validate_output_type_ids for samples

a65ce4a

Write validate_compound_taskid_set() function

c685f2f

Update tests for validate_compound_taskid_set()

fc33b90

linear_pool_sample() handles component models diff compound task ID…

aaf04c9

… sets

Update existing tests for new linear_pool_sample() functionality

a00bd18

Add linear_pool_sample() test for component models forecasting diff…

2c89864

…erent locations

Minor formatting fixes

14fd277

This was referenced Dec 12, 2024

When implementing weighted sampling, check that weights only vary by compound_taskid_set #149

Open

Extend sample handling cases in linear_pool #143

Closed

Reformat tests to put expect_error() call at top of code

b3a9604

lshandross added 2 commits January 9, 2025 17:40

Add derived_tasks param to fix false positive `validate_compound_ta…

9779cd1

…skid_set` error

Remove fix for inconsistent sample indices in example forecasts

bfe2785

This was referenced Jan 10, 2025

Change compound_taskid_set arg value default to those specified in the hub's model tasks schema (tasks.json) #148

Closed

handle sample types in linear_pool #27

Closed

nickreich approved these changes Jan 13, 2025

View reviewed changes

zkamvar approved these changes Jan 14, 2025

View reviewed changes

elray1 self-requested a review January 14, 2025 15:03

elray1 reviewed Jan 14, 2025

View reviewed changes

R/validate_ensemble_inputs.R Show resolved Hide resolved

elray1 approved these changes Jan 14, 2025

View reviewed changes

lshandross and others added 13 commits January 16, 2025 15:30

Update compound_taskid_set param description

b151a34

Update linear_pool docs about samples

1fc2210

Simplify current_compound_taskid_set calculation

4fef980

Co-authored-by: Zhian N. Kamvar <[email protected]>

Fix linting, update docs

fe3452b

Update models forecast different dependent tasks test inline comments

b4d89e3

Add inline comments explaining dependent_tasks variable

f8aaca7

Co-authored-by: Evan Ray <[email protected]>

Rename split_compound_taskid_set, split_outputs to be more intuitive

515e902

Styling improvements and linting fixes

743d8cd

Refactor draw_target_samples()

c6259fa

Remove unneeded @details section

7c68d8d

Rename (split_)compound_taskid_set_tbl to `(split_)model_compound_s…

39335da

…et_tbl`

Test linear_pool_sample throws error if n_output_samples > provid…

d554d01

…ed samples per compound unit

Update NEWS.md

0ac7888

zkamvar approved these changes Jan 22, 2025

View reviewed changes

lshandross and others added 4 commits January 22, 2025 14:23

Order newly assigned unique indices

1898bc8

Co-authored-by: Zhian N. Kamvar <[email protected]>

Fix order of newly assigned unique indices for expected test outputs

6fe36a9

Reorder NULL vs not NULL cases for readability

2e536bc

Co-authored-by: Zhian N. Kamvar <[email protected]>

Implement guard clause in validate_compound_taskid_set()

10101da

Co-authored-by: Zhian N. Kamvar <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ls/standardize compound unit for samples/144 #147

Ls/standardize compound unit for samples/144 #147

lshandross commented Nov 27, 2024 •

edited

Loading

github-actions bot commented Nov 27, 2024 •

edited

Loading

elray1 left a comment

elray1 commented Dec 9, 2024 •

edited

Loading

lshandross commented Jan 9, 2025

lshandross commented Jan 10, 2025

nickreich left a comment

nickreich Jan 13, 2025

zkamvar Jan 14, 2025

lshandross Jan 16, 2025 •

edited

Loading

nickreich Jan 13, 2025

zkamvar Jan 14, 2025

lshandross Jan 21, 2025

zkamvar Jan 21, 2025

lshandross Jan 21, 2025

zkamvar left a comment

zkamvar Jan 14, 2025

zkamvar Jan 14, 2025

elray1 left a comment

zkamvar left a comment

zkamvar Jan 22, 2025

zkamvar Jan 22, 2025

Ls/standardize compound unit for samples/144 #147

Are you sure you want to change the base?

Ls/standardize compound unit for samples/144 #147

Conversation

lshandross commented Nov 27, 2024 • edited Loading

github-actions bot commented Nov 27, 2024 • edited Loading

elray1 left a comment

Choose a reason for hiding this comment

elray1 commented Dec 9, 2024 • edited Loading

lshandross commented Jan 9, 2025

lshandross commented Jan 10, 2025

Context

High level summary:

Ensembling steps (excluding validations):

Validation functions for ensembling samples

nickreich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lshandross Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zkamvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

zkamvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lshandross commented Nov 27, 2024 •

edited

Loading

github-actions bot commented Nov 27, 2024 •

edited

Loading

elray1 commented Dec 9, 2024 •

edited

Loading

lshandross Jan 16, 2025 •

edited

Loading