Review status of thread- and process-based local runners #2098

tcompa · 2024-11-21T12:00:53Z

This issue replaces the following attempts:

What needs to change

We currently have two local executors, based on either multithreading or multiprocessing, each one with its own test. We would rather maintain a single one, to simplify the code-base and reduce the CI duration.

The multiprocessing-based local_experimental executor is clearly the one we'd like to keep, because it also has the important feature of letting users stop a running job.

First complexity: V1 vs V2 (solved)

The multiprocessing-based local_experimental executor is only available for V2, meaning we would need to backport it to V1 to keep everything consistent. V1 is scheduled for full deprecation, thus we are clearly not adding a feature.
Simple workaround: just rename the executor from local to local_experimental, with no changes.

Second complexity: process-based executor is much slower

When running our benchmarks, we found that (for the use cases we are simulating!) the local_experimental executor is much slower. And it's also possible that this relates to #1772, although it was not explored further.
A benchmarks GHA goes from taking typically 5 minutes to 35/40 minutes. We tried a simple fix, by increasing max_workers, with no clear improvement.

We did not go too far into analyzing this issue, since the actual goal was to reduce the CI time rather than going deep into the threads/processes differences. But it's easy to guess that the overhead for creating/closing processes is much higher than the same overhead for threads - and perhaps our example workloads are strongly influenced by overheads (since the actual tasks are dummy).

Where to go from here

For the moment we cannot fully move from our threads-based executor to the processes-based one, until we learn more about how the two perform in a realistic scenario. Any choice in this area also needs to be based on what we expect in terms of use cases, and on which trade-offs we may accept for a local deployment.

The text was updated successfully, but these errors were encountered:

jluethi · 2024-11-26T14:50:48Z

Thanks for the overview. High-level, with the container-based approach taking over more and more of the how one would run local tests & do local development, I think we can limit the complexity of use cases that the local executors cover. A core area certainly is running the CI so we don't always have to simulate the slurm interactions.

If the switch to the process executor leads to much increased CI times, it's certainly not worth fully switching to it for now.

tcompa added the runner label Nov 21, 2024

jluethi added this to Fractal Project Management Nov 21, 2024

tcompa mentioned this issue Nov 21, 2024

Debug CI failures in fractal-web with local_experimental executor #1772

Open

github-project-automation bot moved this to TODO in Fractal Project Management Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review status of thread- and process-based local runners #2098

Review status of thread- and process-based local runners #2098

tcompa commented Nov 21, 2024 •

edited

Loading

jluethi commented Nov 26, 2024

Review status of thread- and process-based local runners #2098

Review status of thread- and process-based local runners #2098

Comments

tcompa commented Nov 21, 2024 • edited Loading

What needs to change

First complexity: V1 vs V2 (solved)

Second complexity: process-based executor is much slower

Where to go from here

jluethi commented Nov 26, 2024

tcompa commented Nov 21, 2024 •

edited

Loading