Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from multisession parallelization to multicore in evaluate stage #53

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Oct 31, 2024

This PR updates the evaluate stage of the pipeline to switch from the multisession parallelization strategy to multicore. This change is intended to fix the behavior we've been seeing on the server whereby the evaluate stage takes so long to complete that it makes development difficult.

I'm not sure why this behavior appears to be different from last year, but my experiments with trying different numbers of workers using the "multisession" strategy on a minimal reprex revealed that execution begins to slow down when the number of workers increases past 5 (minimum runtime ~10 minutes). There must be some sort of overhead that the background R processes incur, but I couldn't find anything in the docs explaining it. Switching to the "multicore" strategy resolves this problem of diminishing returns, but incurs the risk of using more memory (due to forked process isolation) and more CPU resources (due to execution using logical cores rather than threads). In order to mitigate this risk, we reduce the number of workers to half the available logical cores on the machine that runs the pipeline. With 16 cores on the server, this causes the evaluate stage to execute fast enough (~80 seconds) that I'm not too worried about hogging resources.

Note that this change also decreases the execution time for this stage on Batch from 200s to 60s.

@jeancochrane jeancochrane marked this pull request as ready for review October 31, 2024 16:37
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good debugging.

If I recall, the historical reasons for using multisession over multicore were:

  1. The modeling used to be done on Windows laptops, not the current linux VM, and only multisession works on Windows.
  2. The linux VM was much smaller and memory-constrained when we first started using it, and multicore would crash the session (especially when we were still calculating CIs).
  3. multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.

Given that these aren't really constraints anymore, I'm fine with switching over to multicore.

Comment on lines 15 to 24
# Enable parallel backend for generating stats faster.
# In the past we used the 'multisession' parallelization strategy, but this
# strategy exhibits diminishing returns (and eventually worse performance) past
# 5 workers on the server, and it's not particularly fast either (~10 mins to
# complete this stage). The 'multicore' strategy has a higher risk of hogging
# server resources for the duration of execution, but it executes much faster
# than the multisession strategy (~80 seconds to complete this stage), so
# ultimately we think it's worth the risk; plus, we only use half the available
# cores in order to ensure we don't block execution of other important tasks on
# the server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Since this is specific to our environment and not necessarily to the pipeline in general, I vote that we move this comment into the commit body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done in 583dd38.

@jeancochrane
Copy link
Contributor Author

  1. multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.

Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check interactive() and use multisession if true and multicore if false?

@dfsnow
Copy link
Member

dfsnow commented Nov 5, 2024

  1. multicore didn't play well with RStudio, which mattered when we were running the model interactively back in the day.

Do you think this is an important enough consideration to continue to support running the script in RStudio @dfsnow? Maybe we could check interactive() and use multisession if true and multicore if false?

I haven't tried running multicore in awhile in RStudio. Maybe it works fine now? Let's test it. If it wigs out we can add an interactive() check.

@jeancochrane
Copy link
Contributor Author

RStudio does indeed seem to block multicore parallelization:

> plan(multicore, workers = ceiling(num_threads / 2))

Warning message:
In supportsMulticoreAndRStudio(...) :
  [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R
from RStudio because it is considered unstable. For more details, how to control forked
processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore

Checking htop confirms that this causes furrr to fall back to the sequential strategy. I updated the code in 583dd38 to use multicore if supportsMulticore is TRUE and multisession otherwise.

@jeancochrane jeancochrane merged commit 2dc6500 into master Nov 5, 2024
4 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/switch-from-multisession-paralellization-to-multicore branch November 5, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants