Coordinating on use of example data across hubverse packages and documentation #6

elray1 · 2024-02-02T16:06:51Z

elray1
Feb 2, 2024
Maintainer

This post collects ideas that we've discussed before in non-centralized locations.

Goal

It would be nice to have an example hub with all output types, suitable for use in examples and documentation throughout the hubverse. There are several advantages to this:

Centralized creation of example data means we don't have to think through examples and how to generate example data multiple times.
When we make changes to hubverse data formats, there are fewer places to make updates
It could be helpful to hubverse users to have a consistent set of examples that are used across different packages.

Existing work

We have existing examples in the example simple forecast hub and the example complex scenario hub. The main issue I see with these examples is that they do not have all output types. We would like to be able to demonstrate functionality with examples of all output types.

A second issue related to the complex scenario hub (which includes more output types than the simple forecast hub) is that it is slightly less natural to use those data to demonstrate standard forecast evaluation methods, as evaluating scenario projections carefully is more complex than evaluating forecasts. So I would like to find another example to use for hubEvals.

Example complex forecast hub

I've started putting examples together in the example complex forecast hub. It still needs some documentation describing what's in there.

Output types included

So far, the hub has been populated with example forecasts including the following output types: quantile, mean, median, pmf.
Remaining output types to add include sample and cdf (links to related issues)

Use of example data in hubverse documentation and packages

It seems like these data are candidates for use in the following places:

hubDocs
hubEnsembles
hubEvals
Maybe others? hubVis? hubData?

For hubData (currently hubUtils), we want to demonstrate a connect_hub() |> collect() workflow, so for that package it may be helpful to work with an actual copy of a full hub setup.

For other packages, it seems like we can bypass that part of the workflow and assume the user has data frames of model output data and target data (if relevant). For those purposes, we could include those data as data objects in the package. To facilitate creation of those data objects, we could mirror the example complex forecast hub to an S3 Bucket, allowing scripts that create the data objects in a particular package to run without requiring a local clone of the example hub in a specific location. This would make development easier.

nickreich · 2024-02-02T16:14:56Z

nickreich
Feb 2, 2024
Maintainer

Thank you so much for getting us started on this! Just to clarify your description above, are you saying that the stuff that existed before you started was not appropriate for a hubEvals example or that the stuff you put together is not appropriate?

2 replies

elray1 Feb 2, 2024
Maintainer Author

the stuff that existed before didn't work, and the goal of the new stuff in example-complex-forecast-hub is to address those limitations :)

nickreich Feb 2, 2024
Maintainer

got it. thanks for the clarification.

annakrystalli · 2024-02-03T09:16:14Z

annakrystalli
Feb 3, 2024
Maintainer

One thing we could also consider for example data is separating them out into a data package, a bit like the Long Term Ecological Research program (LTER) Network's lterdatasampler 📦 or Allison Horst's palmerpenguins 📦.

This would allow us to mix individual table data stored as rda in /data, data generating scripts and raw data in /data-raw as well as example hubs in the inst folder that are accessed via system.file().

Actually I think this is really the way to go and surprised it only just occured me given I really rate the above mentioned packages!

0 replies

elray1 · 2024-02-07T01:20:18Z

elray1
Feb 7, 2024
Maintainer Author

I'm bringing broader discussion from this thread back here.

I see three options for how to create and package up example data:

Create a single unifying example data set that is then used in all of the hubverse packages. It would be natural to house this in a data package, maybe hubExamples or something. (Or we could reclaim the hubData name for this).
Create example data sets specific to each hubverse package, perhaps all derived from example-complex-forecast-hub and/or example-complex-scenario-hub by taking different subsets of the data. These data might be housed within each specific package.
Create a suite of example data sets that we think is likely to suit the needs/desires for documentation in all individual packages within the hubverse ecosystem, but store them all in hubExamples. These would all be different subsets of the full example-complex-forecast-hub and example-complex-scenario-hub data sets.

Option 1 has the advantage of setting up a consistent example that is standardized and will be familiar to package users who are looking at help files across different package in the ecosystem. That has a certain kind of advantage: consistency can be helpful.

On the other hand, it seems likely that any single example data set that we might set up in option 1 would be "too complex" to be useful as an example for every purpose, and so almost every help file example would end up doing some filtering on it. For examples:

the functionality that's been scoped out for the hubEvals package can only handle one model output_type at a time. Any example of that functionality will have to filter on output_type at least. It might be nice to demonstrate the necessity of this, but on the other hand it might get clunky to have to do this filtering in every example.
in our first manuscript, and possibly documentation in hubDocs and/or hubUtils, we'd like to demonstrate different output types by actually displaying rows of data in model_output_tbls. This requires filtering on output type.
on the other hand, the functionality in hubEnsembles will create ensemble outputs when handed a data frame with multiple output types (the resulting ensembles for different output types may or may not be compatible with each other). it would be nice to be able to demonstrate this in action, so we'd like some example data that mixes data types.

If we think including these kinds of filters throughout the documentation is too clunky, options 2 or 3 are indicated. I don't love either of these options, but of them, I prefer option 2 (packages are expected to maintain their own example data, pulling subsets from the upstream example hub repos).

I think i could be persuaded to go with either option 1 or 2.

0 replies

nickreich · 2024-02-07T13:30:51Z

nickreich
Feb 7, 2024
Maintainer

I like @elray1 's suggestion above for option 2. Trying to flesh out how we would operationalize this a bit more:

a limited number of distinct hub repositories/buckets would house sets of example data. for starters, these would live in example-complex-forecast-hub and example-complex-scenario-hub.
most hubverse packages would store subsets of those examples as documented .rda data objects.
- this would be operationalized by having .R scripts in the data-raw folder that pull/filter example data from the example hubs and then saving them.
some hubverse packages might also want to demonstrate the connect() |> collect() functionality. I'm still not clear on the "right" way to do this.
- maybe now that some hub data are on S3 buckets we could connect to those cloud buckets and the examples would always work? (although that maybe feels fragile, e.g. would CRAN accept that as an example)
- maybe hubs could store a minimal example hub within the inst folder.
I suggest that all example data (at least the .rda kind) should be centrally documented (e.g. in a single table on the hubverse docs page). This would give us a clear standard for documenting what kind of examples we have, and make it easier to repurpose example data that already exist. I'm working on a prototype table for this right now, will share soon.

0 replies

elray1 · 2024-03-06T19:24:55Z

elray1
Mar 6, 2024
Maintainer Author

Following up on discussion in our hubverse dev meeting on Feb 28, in this comment I will try to outline what it would look like to use a hubverse data package, option 1 in my comment above. This does not represent a decision, just trying to articulate the option clearly to facilitate decision making about how we want to organize things.

A working name for the package is hubExamples.

In brief, the idea is that the package would make three examples of hub data available:

An example hub, including full file structure, admin and task config files, model outputs, etc. This could be used as an example for packages that want to demonstrate functionality for loading hub data via, e.g. hubData::connect() |> collect().
A set of (three?) data objects containing example data pulled from the example-complex-forecast-hub. There might be one data object with a subset of the model outputs in that hub, and 2 (or more?) data objects with target data. The number of objects with target data depends somewhat on a resolution to discussion Proposal for target data (a.k.a. "truth data") formats #9. This could be used in packages that want to demonstrate working with hub model outputs and/or target data directly, skipping steps that involve reading in data.
A set of (three?) data objects containing example data pulled from the example-complex-scenario-hub. There might be one object with a subset of the model outputs in that hub, and 2 (or more?) data objects with target data, again depending on decisions made elsewhere about the structure of target data. Use cases are similar to example 2, but allow for examples with modeling scenarios.

I'll describe these example and their use cases in more detail in the following subsections.

Example 1: full file structure

Data structures

In the package repository, the hub will be located in the inst folder, so that it is copied into the package's installation folder when the package is installed. We might draw from the example-complex-forecast-hub, putting a copy of that hub into the package, with a couple of modifications:

We could pull a subset of the model output files and/or a subset of the modeling tasks (e.g., just 2 locations, 2 horizons, 2 models)
To avoid warnings about non-portable file paths thrown by R CMD CHECK, we could just name it example-hub. Then the longest path used would be, e.g. "hubExamples/inst/example-hub/model-output/MOBS-GLEAM_FLUH/2022-10-22-MOBS-GLEAM_FLUH.csv", which is 88 characters, so R CMD CHECK would be happy.

Example uses in other hubverse packages

With this in place, here's an example of what downstream use could look like in hubData, when documenting the use of the connect_hub:

hub_path <- system.file("example-hub", package = "hubExamples")
hub_con <- connect_hub(hub_path)

Example 2: data objects for an example forecast hub

Data structures

The proposal is to have approximately 3 data objects that contain model outputs and target data derived from the example-complex-forecast-hub. Note that currently the data in this hub are derived from the 2022/23 FluSight hub, with step-ahead quantile, mean, and median forecasts of hospital admissions and pmf forecasts of a categorical intensity level. We don't currently have cdf or sample examples in this hub, but plan to add them.

A first data object might be called forecast_outputs, and it would be a model_output_tbl containing a subset of the model outputs from the example-complex-forecast-hub. The goal would be to identify a minimal set that's sufficient for illustrating data uses in all hubverse packages. One possible specification is:

2 or 3 model_ids: Examples in hubVis and hubEnsembles benefit from having multiple models. I would vote for including three models, since examples of visualizations and ensembles with only two models might not be that interesting. But 2 would probably be sufficient.
2 locations: Functionality in hubVis benefits from having multiple locations. Maybe choose locations with different populations or signal/noise ratios? TX and MA?
2 reference_dates
2-4 horizons
All output_types (to eventually include pmf, cdf, quantile, mean, median, sample)
For quantile forecasts, 7 quantile levels sufficient for computing some scores and creating plots: 0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975

This data set would have columns like the following:

  location reference_date horizon target_end_date target          output_type output_type_id value model_id         
   <chr>    <date>           <int> <date>          <chr>           <chr>       <chr>          <dbl> <chr>            
 1 US       2022-10-22           0 2022-10-22      wk inc flu hosp quantile    0.01             943 Flusight-baseline
 2 US       2022-10-22           0 2022-10-22      wk inc flu hosp quantile    0.025           1114 Flusight-baseline
 3 US       2022-10-22           0 2022-10-22      wk inc flu hosp quantile    0.05            1211 Flusight-baseline
 4 US       2022-10-22           0 2022-10-22      wk inc flu hosp quantile    0.1             1329 Flusight-baseline

A second data object might be called forecast_target_timeseries or forecast_target_ts, and would contain the example target data in time series format. We could subset to the same set of locations that was selected above. This would include columns like the following:

  date       location value
   <date>     <chr>    <dbl>
 1 2020-01-11 01           0
 2 2020-01-11 15           0
 3 2020-01-11 18           0
 4 2020-01-11 27           0

One or more additional data objects would include observed target values. We have not yet established the precise format for this/these objects. If a single data object, we might call it forecast_target_values.

Example uses in other hubverse packages

The hubEnsembles packages includes the linear_pool function, which can accept a model_output_tbl with a mix of multiple different output_types, including mean, quantile, and pmf. It currently does not support other output types such as median and sample (with potential future support for samples, but not for median). Documentation for that function could include code similar to the following:

hubExamples::forecast_outputs |>
  filter(output_type %in% c("mean", "quantile", "pmf")) |>
  linear_pool()

Planned functionality in the hubEvals package would allow for only one output_type to be provided in calls to score_mdl_output(). Documentation for that function could include code similar to the following:

hubExamples::forecast_outputs |>
  filter(output_type == "quantile") |>
  score_mdl_output()

[Note: On one hand, these filter statements feel kind of distracting. On the other hand, they may serve a useful documentation purpose, calling the reader's attention to the fact that these functions accept subsets of output_types.]

Example 3: data objects for an example scenario modeling hub

The setup and use cases here would be very similar to example 2, but we'd like to provide examples of scenario model outputs and targets.

8 replies

elray1 Mar 7, 2024
Maintainer Author

Yes, confirming: the data sets under examples 2 and 3 would be stored as rda (very similar to RData) files in the package.

Clarifying: the idea I was proposing was to include all three of the examples.

r.e. the "Example 1" setup with actual data files -- I think the main intended use of that is for documenting functions in hubData that are related to reading in data from files in a hub, like connect_hub and load_model_metadata. This may also be helpful for some of the functions in hubAdmin, like validate_config. It seems like for giving example of those kinds of functions, in some package it will be necessary to include the data files and worry about the file structure, path lengths, etc., no?

If we want to strongly encourage all other packages to just refer to the native data objects in their documentation, maybe one option could be to remove Example 1 from this hubExamples package, keeping it only in hubData. A downside to that would be that if we do end up with multiple packages that have functionality that works directly with hub file structures, they will both/all need to also replicate that setup. But maybe this is not a real concern, or there is another way around it?

bsweger Mar 7, 2024
Collaborator

Ah, I see--thanks for clarifying!

Apologies for being so behind in the conversation, but I think I'm with you now: the goal for option 1 is to provide a way to have working code snippets in hubData without requiring users to have (or clone) a hub?

If that's the case, would pointing those working code snippets to an S3 bucket serve the purpose (while also signaling the idea of cloud-enabled hubs)?

elray1 Mar 7, 2024
Maintainer Author

interesting thought. @annakrystalli, any opinion about skipping the inclusion of actual data files in this package and using S3 bucketed hubs in examples for hubData and hubAdmin?

annakrystalli Mar 7, 2024
Maintainer

For me that does not cover all needs of hub admins or teams submitting to hubs who for the time being will always work with local versions of the data. And for testing we want to be able to test both local and S3 behaviour.

annakrystalli Mar 7, 2024
Maintainer

In general I feel we should be able to demo both local and cloud functionality.

elray1 · 2024-03-13T19:17:00Z

elray1
Mar 13, 2024
Maintainer Author

I am closing this discussion in favor of new, more focused discussions over on the shiny new hubExamples repository.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hubverse

Coordinating on use of example data across hubverse packages and documentation #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

The Hubverse

Coordinating on use of example data across hubverse packages and documentation #6

elray1 Feb 2, 2024 Maintainer

Goal

Existing work

Example complex forecast hub

Output types included

Use of example data in hubverse documentation and packages

Replies: 6 comments · 10 replies

nickreich Feb 2, 2024 Maintainer

elray1 Feb 2, 2024 Maintainer Author

nickreich Feb 2, 2024 Maintainer

annakrystalli Feb 3, 2024 Maintainer

elray1 Feb 7, 2024 Maintainer Author

nickreich Feb 7, 2024 Maintainer

elray1 Mar 6, 2024 Maintainer Author

Example 1: full file structure

Data structures

Example uses in other hubverse packages

Example 2: data objects for an example forecast hub

Data structures

Example uses in other hubverse packages

Example 3: data objects for an example scenario modeling hub

elray1 Mar 7, 2024 Maintainer Author

bsweger Mar 7, 2024 Collaborator

elray1 Mar 7, 2024 Maintainer Author

annakrystalli Mar 7, 2024 Maintainer

annakrystalli Mar 7, 2024 Maintainer

elray1 Mar 13, 2024 Maintainer Author

elray1
Feb 2, 2024
Maintainer

Replies: 6 comments 10 replies

nickreich
Feb 2, 2024
Maintainer

elray1 Feb 2, 2024
Maintainer Author

nickreich Feb 2, 2024
Maintainer

annakrystalli
Feb 3, 2024
Maintainer

elray1
Feb 7, 2024
Maintainer Author

nickreich
Feb 7, 2024
Maintainer

elray1
Mar 6, 2024
Maintainer Author

elray1 Mar 7, 2024
Maintainer Author

bsweger Mar 7, 2024
Collaborator

elray1 Mar 7, 2024
Maintainer Author

annakrystalli Mar 7, 2024
Maintainer

annakrystalli Mar 7, 2024
Maintainer

elray1
Mar 13, 2024
Maintainer Author