Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Support for MPI apps #2905

Closed
wants to merge 1 commit into from
Closed

[DRAFT] Support for MPI apps #2905

wants to merge 1 commit into from

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented Oct 10, 2023

Description

Parsl currently has very limited support for certain MPI use-cases. The current pilot job model assumes that workers need to be launched onto cores to which the worker is bound for its walltime. However, MPI applications generally need to bind to a subset of nodes (nodes>1) from the batch job. Since the pilot job model fails here, we end up recommending using the executor+provider without a launcher, so that the workers can then use an multi-node launcher such as mpiexec/mpirun/srun/aprun to then launch MPI apps. This gets us into these new issues:

*  Apps end up hardcoding launcher prefixes into bash_apps which means apps are now machine specific. Ideally apps are machine agnostic
*  Apps might have varying resource requirements that are not easy to support. If the machine uses `mpiexec`, the hosts to launch the MPI apps need to be specified, which means the runtime has to known all available nodes, the apps needs, and do some internal resource bookkeeping to avoid overlapping placements.

The solution here is to have a combination of:

  1. Support for MPI-specific resource specifications at the app level and the worker level
  2. Identifying available resources at the manager level and allowing workers to share them based on resource requirements
  3. Generate MPI launcher prefixes from 2. that specify appropriate resources and make this available via env vars.

Fixes # (issue)

Type of change

Choose which options apply, and delete the ones which do not apply.

  • New feature (non-breaking change that adds functionality)

@yadudoc yadudoc requested review from benclifford and WardLT October 10, 2023 17:27
@WardLT
Copy link
Contributor

WardLT commented Oct 10, 2023

Nice! This has the general functionality I'm looking for, and I can see being able to work with the Parsl-generated mpirun invocations.

Would you also mind storing the nodelist in an environment variable so I can still build my own mpirun? I'm think that's necessary, in general, because we haven't enumerated "all" of the possible launchers and some codes (e.g., deepspeed, Gaussian) use their own.

I think I understand how things work on the internals and from a user perspective, but would you mind writing those docs before I dig more into reviewing?

@benclifford
Copy link
Collaborator

@WardLT from my perspective, I'm mostly interested in trying to use this branch for something like a real application - it's hard, at least for me, to understand the usability/unusability without concrete applications.

@WardLT
Copy link
Contributor

WardLT commented Oct 10, 2023

@yadudoc , do you have any prototype applications sketched out?

@benclifford
Copy link
Collaborator

Would you also mind storing the nodelist in an environment variable so I can still build my own mpirun? I'm think that's necessary, in general, because we haven't enumerated "all" of the possible launchers and some codes (e.g., deepspeed, Gaussian) use their own.

I don't understand from a brief read of the code what makes the code that tries launch commands in order fail over from the first one - the failure conditions aren't really clear to me: for example, what makes compose_aprun_launch_cmd, the first launcher to try, fail?

Launchers in parsl so far have been plugin-style configurable (i.e. you can supply arbitrary out-of-parsl codebase Launcher objects), and I feel like they should be here too: both because I think users should be able to plug in their own (that they wrote themselves or that someone else supplied) and because I think trying to do magic autodetect is probably wrong here.

@yadudoc
Copy link
Member Author

yadudoc commented Oct 11, 2023

@WardLT You're right, I should've started with the docs. I copied over the current MPI apps doc and updated it. Here's a quick link -> https://github.com/Parsl/parsl/blob/mpi_experimental_1/docs/userguide/mpi_apps_updated.rst

I do not have any real apps to test with, I've basically been doing mpiexec ... <hostname> and using a very bare bones mpi app that prints out how the ranks are distributed to confirm that the launcher prefix is working properly.

Here are things I think we need to confirm :

  1. Right now we are only really using NUM_NODES and RANKS_PER_NODE which are the absolute bare minimum that every launcher supports. I do not yet know what other features we'd like. I'm copying over the same model that Balsam uses from here : https://balsam.readthedocs.io/en/latest/user-guide/jobs/#defining-compute-resources, but just enough for a MVP.
  2. The new capability is to run multiple MPI apps with different node requirements. I do not have a good test-case for this beyond my toy app.
  3. Handling of failures at the launcher step is untested. Assume we'll lose nodes.
  4. There are likely unexpected interactions from default scheduler options in the provider that will set ranks per node to 1

@yadudoc
Copy link
Member Author

yadudoc commented Oct 11, 2023

@benclifford You are right about real applications, I could really use a real-ish application to test with.

Without a good data model for resource_specification and validation, the prefix being composed can be junk if the resource_specification is incorrect. The alternative was to propagate the KeyError exceptions back to the user, maybe that is a better approach here until there's some validation.

…erialize and ship resource_specification from the app

* Better support for MPI functions
* Manager to identify batch scheduler and available nodes in current batch job.
* Manager places tokens for each node in a MPQueue nodes_q
* Workers unpack tasks to get resource_specification
* Workers provision nodes from the nodes_q and place ownership tokens into an inflight_q
* Worker clears it's tokens from the inflight_q and pops node tokens into the nodes_q upon task completion
@yadudoc
Copy link
Member Author

yadudoc commented Feb 5, 2024

This Draft is now obsolete with most of the MPI work happening over on #3016

@yadudoc yadudoc closed this Feb 5, 2024
@yadudoc yadudoc deleted the mpi_experimental_1 branch October 17, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants