[DRAFT] Support for MPI apps #2905

yadudoc · 2023-10-10T17:27:29Z

Description

Parsl currently has very limited support for certain MPI use-cases. The current pilot job model assumes that workers need to be launched onto cores to which the worker is bound for its walltime. However, MPI applications generally need to bind to a subset of nodes (nodes>1) from the batch job. Since the pilot job model fails here, we end up recommending using the executor+provider without a launcher, so that the workers can then use an multi-node launcher such as mpiexec/mpirun/srun/aprun to then launch MPI apps. This gets us into these new issues:

*  Apps end up hardcoding launcher prefixes into bash_apps which means apps are now machine specific. Ideally apps are machine agnostic
*  Apps might have varying resource requirements that are not easy to support. If the machine uses `mpiexec`, the hosts to launch the MPI apps need to be specified, which means the runtime has to known all available nodes, the apps needs, and do some internal resource bookkeeping to avoid overlapping placements.

The solution here is to have a combination of:

Support for MPI-specific resource specifications at the app level and the worker level
Identifying available resources at the manager level and allowing workers to share them based on resource requirements
Generate MPI launcher prefixes from 2. that specify appropriate resources and make this available via env vars.

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature (non-breaking change that adds functionality)

WardLT · 2023-10-10T18:26:11Z

Nice! This has the general functionality I'm looking for, and I can see being able to work with the Parsl-generated mpirun invocations.

Would you also mind storing the nodelist in an environment variable so I can still build my own mpirun? I'm think that's necessary, in general, because we haven't enumerated "all" of the possible launchers and some codes (e.g., deepspeed, Gaussian) use their own.

I think I understand how things work on the internals and from a user perspective, but would you mind writing those docs before I dig more into reviewing?

benclifford · 2023-10-10T18:28:50Z

@WardLT from my perspective, I'm mostly interested in trying to use this branch for something like a real application - it's hard, at least for me, to understand the usability/unusability without concrete applications.

WardLT · 2023-10-10T18:30:04Z

@yadudoc , do you have any prototype applications sketched out?

benclifford · 2023-10-10T20:29:18Z

Would you also mind storing the nodelist in an environment variable so I can still build my own mpirun? I'm think that's necessary, in general, because we haven't enumerated "all" of the possible launchers and some codes (e.g., deepspeed, Gaussian) use their own.

I don't understand from a brief read of the code what makes the code that tries launch commands in order fail over from the first one - the failure conditions aren't really clear to me: for example, what makes compose_aprun_launch_cmd, the first launcher to try, fail?

Launchers in parsl so far have been plugin-style configurable (i.e. you can supply arbitrary out-of-parsl codebase Launcher objects), and I feel like they should be here too: both because I think users should be able to plug in their own (that they wrote themselves or that someone else supplied) and because I think trying to do magic autodetect is probably wrong here.

yadudoc · 2023-10-11T16:42:39Z

@WardLT You're right, I should've started with the docs. I copied over the current MPI apps doc and updated it. Here's a quick link -> https://github.com/Parsl/parsl/blob/mpi_experimental_1/docs/userguide/mpi_apps_updated.rst

I do not have any real apps to test with, I've basically been doing mpiexec ... <hostname> and using a very bare bones mpi app that prints out how the ranks are distributed to confirm that the launcher prefix is working properly.

Here are things I think we need to confirm :

Right now we are only really using NUM_NODES and RANKS_PER_NODE which are the absolute bare minimum that every launcher supports. I do not yet know what other features we'd like. I'm copying over the same model that Balsam uses from here : https://balsam.readthedocs.io/en/latest/user-guide/jobs/#defining-compute-resources, but just enough for a MVP.
The new capability is to run multiple MPI apps with different node requirements. I do not have a good test-case for this beyond my toy app.
Handling of failures at the launcher step is untested. Assume we'll lose nodes.
There are likely unexpected interactions from default scheduler options in the provider that will set ranks per node to 1

yadudoc · 2023-10-11T16:46:29Z

@benclifford You are right about real applications, I could really use a real-ish application to test with.

Without a good data model for resource_specification and validation, the prefix being composed can be junk if the resource_specification is incorrect. The alternative was to propagate the KeyError exceptions back to the user, maybe that is a better approach here until there's some validation.

…erialize and ship resource_specification from the app * Better support for MPI functions * Manager to identify batch scheduler and available nodes in current batch job. * Manager places tokens for each node in a MPQueue nodes_q * Workers unpack tasks to get resource_specification * Workers provision nodes from the nodes_q and place ownership tokens into an inflight_q * Worker clears it's tokens from the inflight_q and pops node tokens into the nodes_q upon task completion

yadudoc · 2024-02-05T16:43:50Z

This Draft is now obsolete with most of the MPI work happening over on #3016

yadudoc requested review from benclifford and WardLT October 10, 2023 17:27

benclifford mentioned this pull request Oct 12, 2023

Running multiple MPI jobs on a single node has issues #2900

Open

benclifford mentioned this pull request Nov 1, 2023

RADICAL-Pilot Integration with Parsl #2923

Merged

yadudoc force-pushed the mpi_experimental_1 branch from 5556e62 to 1633f51 Compare November 3, 2023 18:32

yadudoc force-pushed the mpi_experimental_1 branch from 8f4cc28 to c32b0bd Compare December 18, 2023 16:44

yadudoc closed this Feb 5, 2024

yadudoc deleted the mpi_experimental_1 branch October 17, 2024 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Support for MPI apps #2905

[DRAFT] Support for MPI apps #2905

yadudoc commented Oct 10, 2023

WardLT commented Oct 10, 2023 •

edited

Loading

benclifford commented Oct 10, 2023

WardLT commented Oct 10, 2023

benclifford commented Oct 10, 2023

yadudoc commented Oct 11, 2023

yadudoc commented Oct 11, 2023

yadudoc commented Feb 5, 2024

[DRAFT] Support for MPI apps #2905

[DRAFT] Support for MPI apps #2905

Conversation

yadudoc commented Oct 10, 2023

Description

Type of change

WardLT commented Oct 10, 2023 • edited Loading

benclifford commented Oct 10, 2023

WardLT commented Oct 10, 2023

benclifford commented Oct 10, 2023

yadudoc commented Oct 11, 2023

yadudoc commented Oct 11, 2023

yadudoc commented Feb 5, 2024

WardLT commented Oct 10, 2023 •

edited

Loading