-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Support for MPI apps #2905
Conversation
Nice! This has the general functionality I'm looking for, and I can see being able to work with the Parsl-generated mpirun invocations. Would you also mind storing the nodelist in an environment variable so I can still build my own mpirun? I'm think that's necessary, in general, because we haven't enumerated "all" of the possible launchers and some codes (e.g., deepspeed, Gaussian) use their own. I think I understand how things work on the internals and from a user perspective, but would you mind writing those docs before I dig more into reviewing? |
@WardLT from my perspective, I'm mostly interested in trying to use this branch for something like a real application - it's hard, at least for me, to understand the usability/unusability without concrete applications. |
@yadudoc , do you have any prototype applications sketched out? |
I don't understand from a brief read of the code what makes the code that tries launch commands in order fail over from the first one - the failure conditions aren't really clear to me: for example, what makes Launchers in parsl so far have been plugin-style configurable (i.e. you can supply arbitrary out-of-parsl codebase Launcher objects), and I feel like they should be here too: both because I think users should be able to plug in their own (that they wrote themselves or that someone else supplied) and because I think trying to do magic autodetect is probably wrong here. |
@WardLT You're right, I should've started with the docs. I copied over the current MPI apps doc and updated it. Here's a quick link -> https://github.com/Parsl/parsl/blob/mpi_experimental_1/docs/userguide/mpi_apps_updated.rst I do not have any real apps to test with, I've basically been doing Here are things I think we need to confirm :
|
@benclifford You are right about real applications, I could really use a real-ish application to test with. Without a good data model for resource_specification and validation, the prefix being composed can be junk if the resource_specification is incorrect. The alternative was to propagate the KeyError exceptions back to the user, maybe that is a better approach here until there's some validation. |
5556e62
to
1633f51
Compare
…erialize and ship resource_specification from the app * Better support for MPI functions * Manager to identify batch scheduler and available nodes in current batch job. * Manager places tokens for each node in a MPQueue nodes_q * Workers unpack tasks to get resource_specification * Workers provision nodes from the nodes_q and place ownership tokens into an inflight_q * Worker clears it's tokens from the inflight_q and pops node tokens into the nodes_q upon task completion
8f4cc28
to
c32b0bd
Compare
This Draft is now obsolete with most of the MPI work happening over on #3016 |
Description
Parsl currently has very limited support for certain MPI use-cases. The current pilot job model assumes that workers need to be launched onto cores to which the worker is bound for its walltime. However, MPI applications generally need to bind to a subset of nodes (nodes>1) from the batch job. Since the pilot job model fails here, we end up recommending using the executor+provider without a launcher, so that the workers can then use an multi-node launcher such as mpiexec/mpirun/srun/aprun to then launch MPI apps. This gets us into these new issues:
The solution here is to have a combination of:
Fixes # (issue)
Type of change
Choose which options apply, and delete the ones which do not apply.