You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to propose to introduce fine-grained parallelization of tasks in Mill.
Goals
Make using a BSP server alongside running Mill commands in the terminal more stable
Right now, using Mill BSP from Metals and running Mill commands in watch mode at the same time often results in instabilities, with both commands trying to compile the same thing at the same time, and wiping each others compile output. This is especially a problem with the meta builds, with one Mill instance sometimes not able to load what it just compiled, and crashing with NoClassDefFound exceptions.
Fine-grained parallelism would ensure that whatever the BSP server makes doesn't conflict with commands users run on the side.
Make running several Mill commands in the terminal more stable
One runs into the same kind of problem as above when running several commands in watch mode at the same time.
Proposal
How evaluators lock targets to compute them
Mill evaluators acquire locks on targets via a file like out/mill-evaluators/evaluator-$(id)-targets. Each line in this file is a target name, like foo.compile.
In order to pick targets to compute, an evaluator:
as much as it can, computes input hashes, reads cached results on disk and retains only non-stale ones, so that it can determine which targets need and can be computed immediately:
targets that don't have inputs
targets whose inputs are cached or already computed
acquires a lock on out/mill-evaluators, and while it holds it:
reads the targets other Mill evaluators have locked in their out/mill-evaluators/evaluator-*-targets files
picks targets that aren't locked
writes these targets in its out/mill-evaluators/evaluator-$(id)-targets to lock them
releases the lock on out/mill-evaluators
(Locks on out/mill-evaluators are short-lived overall.)
The evaluator can then compute its locked targets and write their results on disk.
Once it's done, it repeats that process, updating the list of targets that need and can be computed (other evaluators might have written new cached results on disk), and clearing / updating its own locked target list in out/mill-evaluators/evaluator-$(id)-targets.
Check for the aliveness of other evaluators
An out/mill-evaluators/evaluator-$(id)-targets file might be stale. It can have been written by an evaluator whose process was abruptly ended. That means other evaluators need to ignore and delete this file.
During an evaluator life-time, it needs to hold a lock on file like out/mill-evaluators/evaluator-$(id)-active.
When it holds a lock on out/mill-evaluators, if an evaluator cannot find targets to compute (all are locked by other evaluators), it checks for the aliveness of evaluators that hold targets it needs. It does so by trying to acquire a lock on the out/mill-evaluators/evaluator-$(id)-active file of the other evaluator. If it succeeds, that means the other evaluator isn't alive, and its out/mill-evaluators/evaluator-$(id)-{targets,active} files can be removed.
Writing results in an atomic fashion
In the steps above, we assumed cached results can be read without holding a lock. If a cached result file is large, it might be read partially, while it's being written.
To avoid that, cached results can be first written to a temporary file (same path, but adding a .part extension and a . prefix say), then moved atomically to its final destination.
Linear chains of targets
If a target that can be computed is the input of only one target, and is its sole input, that other target can be locked and computed at the same time. This can be done recursively, for the new target.
Indeed, given the second (and maybe following) target can only be computed after the first one, no other evaluator can attempt to compute it earlier than the current one. The current one might as well proceed with it, possibly lowering the amount of iterations needed to compute all requested targets.
Implementation concerns
Most changes to implement this proposal should happen in
In these files, "terminals" correspond to "targets" most of the time (and maybe workers too?), except if users run a command. The command runs last in that case, preceded by all the targets it depends on (IIUC).
Right now, the way targets are computed doesn't lend itself easily to these changes.
Currently, once an evaluator has a topologically-ordered list of targets to compute, it basically blindly throws futures for each of them at an ExecutionContext all at once. Each of these futures are handed the target inputs, and are in charge of checking if a cached result can be read and is valid. With fine-grained parallelism, we need to hold a lock prior to computing targets, and we might decide to favor targets we can immediately lock over those we can't.
The evaluator should do something along those lines instead:
keep former target results and hashes in a map in memory
update this map as much as it can with valid cached results
find targets that can be computed (all their inputs are in the map)
proceed with the steps above to lock them and compute them
This needs to be done iteratively, until all requested targets are computed.
Once we complete the process of locking the out folder for evaluation, this probably is no longer such a pressing need:
Rather than instability, we get blocked terminals, but all commands should eventually complete one after the other with a correct result
That is the status quo for other tools like Bazel, and despite its inconvenience works acceptably even for pretty large codebases/builds/teams with pretty complex workflows, even to much larger scales than I would expect Mill to be used, despite the obvious downsides (e.g. terminal being blocked due to IDE re-indexing)
Concurrency becomes a nice to have, but not a blocker for wide adoption
If we do do this, we should reconsider the multi-process model and see if we can move stuff into a single process. The original multi-process model was more a convenient shortcut rather than intentional design, and we want to avoid implementing complex inter-process coordination via intricate file locking protocols:
The various locks will be much easier to work with and much more performant than dealing with file locks. I suspect the performance overhead of aggressively juggling file locks may be prohibitive
File locks can behave surprisingly in complex systems, e.g. two threads in a single process can take the same file lock at the same time, violating mutual exclusion
File locks give up a ton of useful concurrency primitives: blocking queues, semaphores, countdownlatches, RW locks, etc. that may be convenient/useful/necessary to implement our locking protocol
We lose some isolation benefits of multi-processing, but that might be the right tradeoffs for the benefits above
We need to consider the recursive multi-stage nature of Mill's evaluation model in more detail. e.g.
We probably should only allow evaluations to happen at one level of the meta-build at a time, e.g. no parallel evaluations for meta-level-0 and meta-level1
Within each meta-build level, there are three phases: Compilation, Resolution, and Evaluation.
Compilation is basically running the phases of the previous meta-level, and as mentioned probably shouldn't be parallel
Resolution probably can be parallelizable: all this involves is lazily materializing portions of the Mill task graph. We have hit initialization deadlocks in the past and put a coarse-grained lock to mitigate them (BSP: importing occasionally gets stuck: mill seems deadlocked #3100), but IMO the deadlocks should be fixable even without locking
Evaluation, which is what most of the proposal above discusses
During evaluation
If we move evaluations into a single process, we can probably use in-memory locking much more simply than the fancy two-level out/mill-evaluators locking protocol proposed above
We not only need to avoid concurrent writes, and allow concurrent reads, but we also need to avoid concurrent writes + reads:
A downstream task relying on an upstream task result being present may crash if the upstream task deletes its .dest/ folder and is in the process of recomputing it
This should be simple to implement using RW locks
We need to figure out how "far upstream" we need to lock when evaluating a task:
Locking the all transitive upstream tasks with a R lock is correct, possibly overly conservative
We may be able to do better by keeping track of which PathRefs the evaluating task depends upon, and only locking access to those transitive tasks
From a timing perspective, I'd expect that if we do end up doing this, it would be next year (2025) at earliest.
It would be a pretty big overhaul, and probably would need to come after Scala 3 support lands later this year (end-2024).
Probably should come after 0.13.0, since I expect it will take quite a while (months?) to experiment/implement/stabilize, and I don't want to hold up Scala 3 support.
That would likely mean it would land in the next breaking change after 0.13.0, 0.14.0, probably mid-late 2025
Another idea I had is something along the lines that Alex suggested.
But instead of having many file locks to have just one, and use tryLock(long position, long size, boolean shared) method to lock parts of a lock file.
Those parts would represent tasks in build.mill definition.
For example we could have a file mill_tasks.lock:
0 -> app.compile
1 -> app.run
... etc for every task
And then if we tryLock(0, 1, false), that'd mean that app.compile is busy, and others should wait.
This way we wouldn't have a client at all, just one big Mill JAR/process.
It would simplify things a lot, especially with terminals etc which are very fiddly, and client-server comms bugs...
The performance would be "questionable", but maybe good enough.
We could explore new JDK things like class-data-sharing https://docs.oracle.com/en/java/javase/21/vm/class-data-sharing.html (maybe a special "prepare" task that'd make a cache for class loading)
and other things from project Leyden https://openjdk.org/projects/leyden/
Played a bit with CDS and AOT class loading.. Not much of an improvement, or I am missing something here.. 😅
Maybe the bottleneck isn't classloading but something else.
Mill server has some really good caching.. 🚀
CDS and AOT class loading playground
VANILLA JDK 17
PS D:\projects\forks\mill> hyperfine "mill -i __.compile"
Benchmark 1: mill -i __.compile
Time (mean ± σ): 17.168 s ± 2.322 s [User: 36.820 s, System: 6.242 s]
Range (min … max): 12.062 s … 20.865 s 10 runs
PS D:\projects\forks\mill> hyperfine "mill __.compile"
Benchmark 1: mill __.compile
Time (mean ± σ): 6.320 s ± 2.374 s [User: 3.019 s, System: 0.506 s]
Range (min … max): 4.988 s … 12.780 s 10 runs
APP CLASS DATA SHARE JDK 17
\# make an archive (~40MB !)
java -XX:ArchiveClassesAtExit=mill.jsa -cp C:\Users\sake_\.mill\download\0.12.4-23-2ff492.bat mill.runner.MillMain D:\projects\forks\mill __.compile
\# export an env var for mill to use this archive
$env:JAVA_OPTS="-XX:SharedArchiveFile=mill.jsa -Xlog:cds,class+path"
PS D:\projects\forks\mill> hyperfine "mill -i __.compile"
Benchmark 1: mill -i __.compile
Time (mean ± σ): 17.307 s ± 1.626 s [User: 36.537 s, System: 6.446 s]
Range (min … max): 14.006 s … 19.719 s 10 runs
Added a small client-server protocol, with upickle messagepack. This allows 2way comms between the two. For example, server can tell client to run an interactive subprocess as Alex suggested.
We could use it for watch mode too, server could just send a message to client, so it restarts the process it was running.
Added inmemory per-task locks, independent tasks can run.. independently. :D
Added a small benchmark for jvm client vs native.
Native is ~4x faster!
Seems like with this approach we could completely remove the -i mode.
The choice is moved into the build itself, which feels much more natural.
I dont remember ever running maven/gradle/sbt in a special mode just to have terminal input working.
Let me know if I am missing something here, still learning about Mill and this stuff to be honest. :)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'd like to propose to introduce fine-grained parallelization of tasks in Mill.
Goals
Make using a BSP server alongside running Mill commands in the terminal more stable
Right now, using Mill BSP from Metals and running Mill commands in watch mode at the same time often results in instabilities, with both commands trying to compile the same thing at the same time, and wiping each others compile output. This is especially a problem with the meta builds, with one Mill instance sometimes not able to load what it just compiled, and crashing with NoClassDefFound exceptions.
Fine-grained parallelism would ensure that whatever the BSP server makes doesn't conflict with commands users run on the side.
Make running several Mill commands in the terminal more stable
One runs into the same kind of problem as above when running several commands in watch mode at the same time.
Proposal
How evaluators lock targets to compute them
Mill evaluators acquire locks on targets via a file like
out/mill-evaluators/evaluator-$(id)-targets
. Each line in this file is a target name, likefoo.compile
.In order to pick targets to compute, an evaluator:
out/mill-evaluators
, and while it holds it:out/mill-evaluators/evaluator-*-targets
filesout/mill-evaluators/evaluator-$(id)-targets
to lock themout/mill-evaluators
(Locks on
out/mill-evaluators
are short-lived overall.)The evaluator can then compute its locked targets and write their results on disk.
Once it's done, it repeats that process, updating the list of targets that need and can be computed (other evaluators might have written new cached results on disk), and clearing / updating its own locked target list in
out/mill-evaluators/evaluator-$(id)-targets
.Check for the aliveness of other evaluators
An
out/mill-evaluators/evaluator-$(id)-targets
file might be stale. It can have been written by an evaluator whose process was abruptly ended. That means other evaluators need to ignore and delete this file.During an evaluator life-time, it needs to hold a lock on file like
out/mill-evaluators/evaluator-$(id)-active
.When it holds a lock on
out/mill-evaluators
, if an evaluator cannot find targets to compute (all are locked by other evaluators), it checks for the aliveness of evaluators that hold targets it needs. It does so by trying to acquire a lock on theout/mill-evaluators/evaluator-$(id)-active
file of the other evaluator. If it succeeds, that means the other evaluator isn't alive, and itsout/mill-evaluators/evaluator-$(id)-{targets,active}
files can be removed.Writing results in an atomic fashion
In the steps above, we assumed cached results can be read without holding a lock. If a cached result file is large, it might be read partially, while it's being written.
To avoid that, cached results can be first written to a temporary file (same path, but adding a
.part
extension and a.
prefix say), then moved atomically to its final destination.Linear chains of targets
If a target that can be computed is the input of only one target, and is its sole input, that other target can be locked and computed at the same time. This can be done recursively, for the new target.
Indeed, given the second (and maybe following) target can only be computed after the first one, no other evaluator can attempt to compute it earlier than the current one. The current one might as well proceed with it, possibly lowering the amount of iterations needed to compute all requested targets.
Implementation concerns
Most changes to implement this proposal should happen in
EvaluatorCore
GroupEvaluator
In these files, "terminals" correspond to "targets" most of the time (and maybe workers too?), except if users run a command. The command runs last in that case, preceded by all the targets it depends on (IIUC).
Right now, the way targets are computed doesn't lend itself easily to these changes.
Currently, once an evaluator has a topologically-ordered list of targets to compute, it basically blindly throws futures for each of them at an
ExecutionContext
all at once. Each of these futures are handed the target inputs, and are in charge of checking if a cached result can be read and is valid. With fine-grained parallelism, we need to hold a lock prior to computing targets, and we might decide to favor targets we can immediately lock over those we can't.The evaluator should do something along those lines instead:
This needs to be done iteratively, until all requested targets are computed.
Beta Was this translation helpful? Give feedback.
All reactions