Fine-grained parallelism #3714

alexarchambault · 2024-10-10T13:46:30Z

alexarchambault
Oct 10, 2024
Collaborator

I'd like to propose to introduce fine-grained parallelization of tasks in Mill.

Goals

Make using a BSP server alongside running Mill commands in the terminal more stable

Right now, using Mill BSP from Metals and running Mill commands in watch mode at the same time often results in instabilities, with both commands trying to compile the same thing at the same time, and wiping each others compile output. This is especially a problem with the meta builds, with one Mill instance sometimes not able to load what it just compiled, and crashing with NoClassDefFound exceptions.

Fine-grained parallelism would ensure that whatever the BSP server makes doesn't conflict with commands users run on the side.

Make running several Mill commands in the terminal more stable

One runs into the same kind of problem as above when running several commands in watch mode at the same time.

Proposal

How evaluators lock targets to compute them

Mill evaluators acquire locks on targets via a file like out/mill-evaluators/evaluator-$(id)-targets. Each line in this file is a target name, like foo.compile.

In order to pick targets to compute, an evaluator:

as much as it can, computes input hashes, reads cached results on disk and retains only non-stale ones, so that it can determine which targets need and can be computed immediately:
- targets that don't have inputs
- targets whose inputs are cached or already computed
acquires a lock on out/mill-evaluators, and while it holds it:
- reads the targets other Mill evaluators have locked in their out/mill-evaluators/evaluator-*-targets files
- picks targets that aren't locked
- writes these targets in its out/mill-evaluators/evaluator-$(id)-targets to lock them
releases the lock on out/mill-evaluators

(Locks on out/mill-evaluators are short-lived overall.)

The evaluator can then compute its locked targets and write their results on disk.

Once it's done, it repeats that process, updating the list of targets that need and can be computed (other evaluators might have written new cached results on disk), and clearing / updating its own locked target list in out/mill-evaluators/evaluator-$(id)-targets.

Check for the aliveness of other evaluators

An out/mill-evaluators/evaluator-$(id)-targets file might be stale. It can have been written by an evaluator whose process was abruptly ended. That means other evaluators need to ignore and delete this file.

During an evaluator life-time, it needs to hold a lock on file like out/mill-evaluators/evaluator-$(id)-active.

When it holds a lock on out/mill-evaluators, if an evaluator cannot find targets to compute (all are locked by other evaluators), it checks for the aliveness of evaluators that hold targets it needs. It does so by trying to acquire a lock on the out/mill-evaluators/evaluator-$(id)-active file of the other evaluator. If it succeeds, that means the other evaluator isn't alive, and its out/mill-evaluators/evaluator-$(id)-{targets,active} files can be removed.

Writing results in an atomic fashion

In the steps above, we assumed cached results can be read without holding a lock. If a cached result file is large, it might be read partially, while it's being written.

To avoid that, cached results can be first written to a temporary file (same path, but adding a .part extension and a . prefix say), then moved atomically to its final destination.

Linear chains of targets

If a target that can be computed is the input of only one target, and is its sole input, that other target can be locked and computed at the same time. This can be done recursively, for the new target.

Indeed, given the second (and maybe following) target can only be computed after the first one, no other evaluator can attempt to compute it earlier than the current one. The current one might as well proceed with it, possibly lowering the amount of iterations needed to compute all requested targets.

Implementation concerns

Most changes to implement this proposal should happen in

In these files, "terminals" correspond to "targets" most of the time (and maybe workers too?), except if users run a command. The command runs last in that case, preceded by all the targets it depends on (IIUC).

Right now, the way targets are computed doesn't lend itself easily to these changes.

Currently, once an evaluator has a topologically-ordered list of targets to compute, it basically blindly throws futures for each of them at an ExecutionContext all at once. Each of these futures are handed the target inputs, and are in charge of checking if a cached result can be read and is valid. With fine-grained parallelism, we need to hold a lock prior to computing targets, and we might decide to favor targets we can immediately lock over those we can't.

The evaluator should do something along those lines instead:

keep former target results and hashes in a map in memory
update this map as much as it can with valid cached results
find targets that can be computed (all their inputs are in the map)
proceed with the steps above to lock them and compute them

This needs to be done iteratively, until all requested targets are computed.

lihaoyi · 2024-10-11T07:39:25Z

lihaoyi
Oct 11, 2024
Maintainer

Some thoughts on this:

Once we complete the process of locking the out folder for evaluation, this probably is no longer such a pressing need:
1. Rather than instability, we get blocked terminals, but all commands should eventually complete one after the other with a correct result
2. That is the status quo for other tools like Bazel, and despite its inconvenience works acceptably even for pretty large codebases/builds/teams with pretty complex workflows, even to much larger scales than I would expect Mill to be used, despite the obvious downsides (e.g. terminal being blocked due to IDE re-indexing)
3. Concurrency becomes a nice to have, but not a blocker for wide adoption
If we do do this, we should reconsider the multi-process model and see if we can move stuff into a single process. The original multi-process model was more a convenient shortcut rather than intentional design, and we want to avoid implementing complex inter-process coordination via intricate file locking protocols:
1. The various locks will be much easier to work with and much more performant than dealing with file locks. I suspect the performance overhead of aggressively juggling file locks may be prohibitive
2. File locks can behave surprisingly in complex systems, e.g. two threads in a single process can take the same file lock at the same time, violating mutual exclusion
3. File locks give up a ton of useful concurrency primitives: blocking queues, semaphores, countdownlatches, RW locks, etc. that may be convenient/useful/necessary to implement our locking protocol
4. We lose some isolation benefits of multi-processing, but that might be the right tradeoffs for the benefits above
We need to consider the recursive multi-stage nature of Mill's evaluation model in more detail. e.g.
1. We probably should only allow evaluations to happen at one level of the meta-build at a time, e.g. no parallel evaluations for meta-level-0 and meta-level1
2. Within each meta-build level, there are three phases: Compilation, Resolution, and Evaluation.
  1. Compilation is basically running the phases of the previous meta-level, and as mentioned probably shouldn't be parallel
  2. Resolution probably can be parallelizable: all this involves is lazily materializing portions of the Mill task graph. We have hit initialization deadlocks in the past and put a coarse-grained lock to mitigate them (BSP: importing occasionally gets stuck: mill seems deadlocked #3100), but IMO the deadlocks should be fixable even without locking
  3. Evaluation, which is what most of the proposal above discusses
During evaluation
1. If we move evaluations into a single process, we can probably use in-memory locking much more simply than the fancy two-level out/mill-evaluators locking protocol proposed above
2. We not only need to avoid concurrent writes, and allow concurrent reads, but we also need to avoid concurrent writes + reads:
  1. A downstream task relying on an upstream task result being present may crash if the upstream task deletes its .dest/ folder and is in the process of recomputing it
  2. This should be simple to implement using RW locks
3. We need to figure out how "far upstream" we need to lock when evaluating a task:
  1. Locking the all transitive upstream tasks with a R lock is correct, possibly overly conservative
  2. We may be able to do better by keeping track of which PathRefs the evaluating task depends upon, and only locking access to those transitive tasks
From a timing perspective, I'd expect that if we do end up doing this, it would be next year (2025) at earliest.
1. It would be a pretty big overhaul, and probably would need to come after Scala 3 support lands later this year (end-2024).
2. Probably should come after 0.13.0, since I expect it will take quite a while (months?) to experiment/implement/stabilize, and I don't want to hold up Scala 3 support.
3. That would likely mean it would land in the next breaking change after 0.13.0, 0.14.0, probably mid-late 2025

11 replies

lihaoyi Oct 27, 2024
Maintainer

Some prior art:

Bazel does the single shared daemon approach, with a coarse-grained lock
Gradle and Maven-mvnd uses the Mill multi-process approach identical to Mill today, AFAICT with no locking strategy at all

sake92 Dec 29, 2024

Did a tiny POC so it's easier to play with the design (since mill is a big codebase..).
https://github.com/sake92/mill-client-server-poc
It has:

one server per workspace/project, just a file lock
multiple simultaneous clients
in-memory task locking
shutdown command
scala native build for client

sake92 Dec 29, 2024

Another idea I had is something along the lines that Alex suggested.
But instead of having many file locks to have just one, and use tryLock(long position, long size, boolean shared) method to lock parts of a lock file.
Those parts would represent tasks in build.mill definition.
For example we could have a file mill_tasks.lock:

0 -> app.compile
1 -> app.run
... etc for every task

And then if we tryLock(0, 1, false), that'd mean that app.compile is busy, and others should wait.

This way we wouldn't have a client at all, just one big Mill JAR/process.
It would simplify things a lot, especially with terminals etc which are very fiddly, and client-server comms bugs...
The performance would be "questionable", but maybe good enough.
We could explore new JDK things like class-data-sharing https://docs.oracle.com/en/java/javase/21/vm/class-data-sharing.html (maybe a special "prepare" task that'd make a cache for class loading)
and other things from project Leyden https://openjdk.org/projects/leyden/

sake92 Dec 30, 2024

Played a bit with CDS and AOT class loading.. Not much of an improvement, or I am missing something here.. 😅
Maybe the bottleneck isn't classloading but something else.
Mill server has some really good caching.. 🚀

CDS and AOT class loading playground

VANILLA JDK 17

PS D:\projects\forks\mill> hyperfine "mill -i __.compile"
Benchmark 1: mill -i __.compile
  Time (mean ± σ):     17.168 s ±  2.322 s    [User: 36.820 s, System: 6.242 s]
  Range (min … max):   12.062 s … 20.865 s    10 runs


PS D:\projects\forks\mill> hyperfine "mill __.compile"
Benchmark 1: mill __.compile
  Time (mean ± σ):      6.320 s ±  2.374 s    [User: 3.019 s, System: 0.506 s]
  Range (min … max):    4.988 s … 12.780 s    10 runs

APP CLASS DATA SHARE JDK 17

\# make an archive (~40MB !)
java -XX:ArchiveClassesAtExit=mill.jsa -cp C:\Users\sake_\.mill\download\0.12.4-23-2ff492.bat mill.runner.MillMain D:\projects\forks\mill __.compile

\# export an env var for mill to use this archive
$env:JAVA_OPTS="-XX:SharedArchiveFile=mill.jsa -Xlog:cds,class+path"

PS D:\projects\forks\mill> hyperfine "mill -i __.compile"
Benchmark 1: mill -i __.compile
  Time (mean ± σ):     17.307 s ±  1.626 s    [User: 36.537 s, System: 6.446 s]
  Range (min … max):   14.006 s … 19.719 s    10 runs

AOT LOAD+LINK JDK24

https://openjdk.org/jeps/483

\# train to record AOT configuration
$env:JAVA_OPTS="-XX:AOTMode=record -XX:AOTConfiguration=app.aotconf"
java -XX:AOTMode=record -XX:AOTConfiguration=app.aotconf -cp C:\Users\sake_\.mill\download\0.12.4-23-2ff492.bat mill.runner.MillMain D:\projects\forks\mill __.compile

\# create the cache using recorder config (60+MB !)
java -XX:AOTMode=create -XX:AOTConfiguration=app.aotconf -XX:AOTCache=app.aot -cp C:\Users\sake_\.mill\download\0.12.4-23-2ff492.bat

\# run
$env:JAVA_OPTS="-XX:AOTCache=app.aot"
PS D:\projects\forks\mill> hyperfine "mill -i __.compile"
Benchmark 1: mill -i __.compile
  Time (mean ± σ):     16.152 s ±  0.953 s    [User: 33.627 s, System: 5.882 s]
  Range (min … max):   14.678 s … 17.646 s    10 runs

sake92 Dec 31, 2024

Just an update on my poc. 🙃

Added a small client-server protocol, with upickle messagepack. This allows 2way comms between the two. For example, server can tell client to run an interactive subprocess as Alex suggested.
We could use it for watch mode too, server could just send a message to client, so it restarts the process it was running.

Added inmemory per-task locks, independent tasks can run.. independently. :D

Added a small benchmark for jvm client vs native.
Native is ~4x faster!

Seems like with this approach we could completely remove the -i mode.
The choice is moved into the build itself, which feels much more natural.
I dont remember ever running maven/gradle/sbt in a special mode just to have terminal input working.
Let me know if I am missing something here, still learning about Mill and this stuff to be honest. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-grained parallelism #3714

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Fine-grained parallelism #3714

alexarchambault Oct 10, 2024 Collaborator

Goals

Make using a BSP server alongside running Mill commands in the terminal more stable

Make running several Mill commands in the terminal more stable

Proposal

How evaluators lock targets to compute them

Check for the aliveness of other evaluators

Writing results in an atomic fashion

Linear chains of targets

Implementation concerns

Replies: 1 comment · 11 replies

lihaoyi Oct 11, 2024 Maintainer

lihaoyi Oct 27, 2024 Maintainer

sake92 Dec 29, 2024

sake92 Dec 29, 2024

sake92 Dec 30, 2024

sake92 Dec 31, 2024

alexarchambault
Oct 10, 2024
Collaborator

Replies: 1 comment 11 replies

lihaoyi
Oct 11, 2024
Maintainer

lihaoyi Oct 27, 2024
Maintainer