Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extract_fit_time #191

Merged
merged 22 commits into from
Apr 5, 2024
Merged

add extract_fit_time #191

merged 22 commits into from
Apr 5, 2024

Conversation

EmilHvitfeldt
Copy link
Member

@EmilHvitfeldt EmilHvitfeldt commented Dec 16, 2022

This is part of a series of PRs to add a new generic extract_fit_time() to allow users to investigate how long different tidymodels objects takes to fit.

TODO:

  • make sure names are good.
    • right now id clashed with id returned from tune_grid()
  • Make work with non-recipe preprocessor
  • add tests
  • Make sure this change is backwards compatible
  • remove hardhat from Remotes

Related PRs
tidymodels/hardhat#218
tidymodels/recipes#1071
tidymodels/parsnip#853

# pak::pak(c(
#   "tidymodels/hardhat@extract_fit_time",
#   "tidymodels/recipes@extract_fit_time",
#   "tidymodels/parsnip@extract_fit_time",
#   "tidymodels/workflows@extract_fit_time"
# ))

library(tidymodels)

data(ames, package = "modeldata")

rec_ames <- recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_normalize(all_predictors())

lm_spec <- linear_reg()

wf_spec <- workflow(rec_ames, lm_spec)

wf_fit <- fit(wf_spec, ames)

# recipes
wf_fit |>
  extract_recipe() |>
  extract_fit_time()
#> # A tibble: 1 × 2
#>   id      time
#>   <chr>  <dbl>
#> 1 recipe 0.259

wf_fit |>
  extract_recipe() |>
  extract_fit_time(summarize = FALSE)
#> # A tibble: 6 × 2
#>   id                      time
#>   <chr>                  <dbl>
#> 1 prep.dummy_edFdo     0.0190 
#> 2 bake.dummy_edFdo     0.0350 
#> 3 prep.nzv_oSTvE       0.197  
#> 4 bake.nzv_oSTvE       0      
#> 5 prep.normalize_1D9e7 0.00500
#> 6 bake.normalize_1D9e7 0.00300

# parsnip
wf_fit |>
  extract_fit_parsnip() |>
  extract_fit_time() # defaults to summarize = FALSE
#> # A tibble: 1 × 2
#>   id           time
#>   <chr>       <dbl>
#> 1 linear_reg 0.0310

wf_fit |>
  extract_fit_parsnip() |>
  extract_fit_time(summarize = TRUE)
#> Error in `extract_fit_time()`:
#> ! `summarize = TRUE` is not supported for `model_fit` objects.

# workflow
wf_fit |>
  extract_fit_time()
#> # A tibble: 1 × 3
#>   stage    id        time
#>   <chr>    <chr>    <dbl>
#> 1 workflow workflow 0.290

wf_fit |>
  extract_fit_time(summarize = FALSE)
#> # A tibble: 7 × 3
#>   stage      id                      time
#>   <chr>      <chr>                  <dbl>
#> 1 preprocess prep.dummy_edFdo     0.0190 
#> 2 preprocess bake.dummy_edFdo     0.0350 
#> 3 preprocess prep.nzv_oSTvE       0.197  
#> 4 preprocess bake.nzv_oSTvE       0      
#> 5 preprocess prep.normalize_1D9e7 0.00500
#> 6 preprocess bake.normalize_1D9e7 0.00300
#> 7 model      linear_reg           0.0310

Created on 2023-06-15 with reprex v2.0.2

Merge commit '882b06bb27592055493b10065b3ee805399fad38'

#Conflicts:
#	DESCRIPTION
Copy link
Contributor

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interface is super solid, and the potential uses for this in stacks, workflow sets, and tuning results are many!

Just made a few small docs suggestions—I think, as often as we can, we should use the same term to refer to this value. I think "elapsed fit time" is a good cue, in that it self-documents system.time()$elapsed and aligns with the function names. :)

Pending those TODO's, thumbs up from me!

R/extract.R Outdated Show resolved Hide resolved
R/extract.R Outdated Show resolved Hide resolved
R/extract.R Outdated Show resolved Hide resolved
@simonpcouch
Copy link
Contributor

re:

right now id clashed with id returned from tune_grid()

workflowsets uses wflow_id, so we have some precedent for distinguishing different types of ids. there’s also id2 in iterative searches, unfortunately named with a different pattern (though that may be internal-only?).

I'm not sure I have any silver bullet recommendations. process_id? Probably something with *_id.

This may be a separate issue from naming, but there may also be some natural linkage to .config here. The processing for that will likely happen in tune, but perhaps something worth anticipating.

@EmilHvitfeldt
Copy link
Member Author

I like process_id!

@EmilHvitfeldt EmilHvitfeldt marked this pull request as ready for review June 16, 2023 19:31
@EmilHvitfeldt EmilHvitfeldt requested a review from hfrick June 16, 2023 20:00
Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if stage_id is a good alternative to process_id?

NEWS.md Show resolved Hide resolved
Merge branch 'main' into extract_fit_time

# Conflicts:
#	DESCRIPTION
@simonpcouch
Copy link
Contributor

I think a part of this interface that's been tough to wrap my head around is the degree of summarization. I have a hunch that getting summarization right will make deciding on column names a bit easier.

At the moment, output looks like:

library(tidymodels)

data(ames, package = "modeldata")

rec_ames <- 
  recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_normalize(all_predictors())

lm_spec <- linear_reg()

wf_fit <- fit(workflow(rec_ames, lm_spec), ames)
wf_fit2 <- fit(workflow(Sale_Price ~ ., lm_spec), ames)

extract_fit_time(wf_fit)
#> # A tibble: 1 × 3
#>   stage    process_id  time
#>   <chr>    <chr>      <dbl>
#> 1 workflow workflow   0.419
extract_fit_time(wf_fit, summarize = FALSE)
#> # A tibble: 7 × 3
#>   stage      process_id              time
#>   <chr>      <chr>                  <dbl>
#> 1 preprocess prep.dummy_wjSqo     0.0290 
#> 2 preprocess bake.dummy_wjSqo     0.156  
#> 3 preprocess prep.nzv_lUZwC       0.199  
#> 4 preprocess bake.nzv_lUZwC       0      
#> 5 preprocess prep.normalize_I9sdp 0.00400
#> 6 preprocess bake.normalize_I9sdp 0.00300
#> 7 model      linear_reg           0.0280

extract_fit_time(wf_fit2)
#> # A tibble: 1 × 3
#>   stage    process_id  time
#>   <chr>    <chr>      <dbl>
#> 1 workflow workflow   0.173
extract_fit_time(wf_fit2, summarize = FALSE)
#> # A tibble: 1 × 3
#>   stage process_id  time
#>   <chr> <chr>      <dbl>
#> 1 model linear_reg 0.173

With a recipe, the summarize argument reduces the number of rows, but with a formula preprocessor, the only change is the values of stage and process_id.

Workflows already have the built-in unit ("stage") pre (preprocess), fit (model), post (postprocess). I imagine, in the tuning context, this is likely the unit of observation I'm interesting in timing:

extract_fit_time(wf_fit, summarize = FALSE) %>%
  group_by(stage) %>%
  summarize(time = sum(time))
#> # A tibble: 2 × 2
#>   stage        time
#>   <chr>       <dbl>
#> 1 model      0.0280
#> 2 preprocess 0.392

It'd be nice if that was the default.

An argument like unit = c("workflow", "stage", "process") in favor of summarize might feel better for workflows. That would make the analogous arguments for the recipe method, though, "stage" and "process", which are totally unnatural compared to summarize = TRUE or FALSE or unit = "recipe" or "step". Hmph.

Created on 2024-03-27 with reprex v2.1.0

@simonpcouch
Copy link
Contributor

The other remaining question for me is whether we should be thinking of some functionality that measures predict elapsed time alongside this. extract_predict_time() is definitely a stranger interface, as the outputs of bake() and predict.model_fit() are just tibbles. Also, whatever that function is couldn't be passed as an extract control argument to resampling functions, because those can only be called on a fitted workflow.

My general leaning is that since 1) the current methods can point out when bakeing, at least on data, take a notable amount of time and 2) predict.model_fit() tends to take much less time than fit.model_spec(), we can probably live with just measuring fit time and not providing an analogous interface for prediction.

Copy link
Contributor

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking solid. :) Just one docs clarification.

R/extract.R Outdated Show resolved Hide resolved
@EmilHvitfeldt EmilHvitfeldt merged commit 78426ac into main Apr 5, 2024
10 checks passed
@EmilHvitfeldt EmilHvitfeldt deleted the extract_fit_time branch April 5, 2024 21:52
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants