Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from multisession parallelization to multicore in evaluate stage #53

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions pipeline/03-evaluate.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,17 @@ tictoc::tic("Evaluate")
# Load libraries, helpers, and recipes from files
purrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)

# Enable parallel backend for generating stats more quickly
plan(multisession, workers = num_threads)
# Enable parallel backend for generating stats faster.
# In the past we used the 'multisession' parallelization strategy, but this
# strategy exhibits diminishing returns (and eventually worse performance) past
# 5 workers on the server, and it's not particularly fast either (~10 mins to
# complete this stage). The 'multicore' strategy has a higher risk of hogging
# server resources for the duration of execution, but it executes much faster
# than the multisession strategy (~80 seconds to complete this stage), so
# ultimately we think it's worth the risk; plus, we only use half the available
# cores in order to ensure we don't block execution of other important tasks on
# the server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Since this is specific to our environment and not necessarily to the pipeline in general, I vote that we move this comment into the commit body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done in 583dd38.

plan(multicore, workers = ceiling(num_threads / 2))

# Renaming dictionary for input columns. We want the actual value of the column
# to become geography_id and the NAME of the column to become geography_name
Expand Down