Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Slurm agent #3005

Draft
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Dec 16, 2024

Tracking issue

flyteorg/flyte#5634

Why are the changes needed?

What changes were proposed in this pull request?

Implement the Slurm agent, which submits the user-defined flytekit task to a remote Slurm cluster to run. Following describe three core methods:

  1. create: Submit a Slurm job with sbatch to run a batch script on Slurm cluster
  2. get: Check the Slurm job state
  3. delete (haven't been tested): Cancel the Slurm job

How was this patch tested?

We test create and get in the development environment described as follows:

  • Local: MacBook with flytekit installed
    • Test slurm agent locally following this guide
  • Remote: Single Ubuntu server with slurmctld and slurmd running
    • We plan to write a single-host setup tutorial and organize useful resources here
  • Communication is done with asyncssh

Suppose we have a batch script to run on Slurm cluster:

#!/bin/bash

echo "Working!" >> ./remote_touch.txt

We use the following python script to test Slurm agent on the client side:

import os

from flytekit import workflow
from flytekitplugins.slurm import Slurm, SlurmTask


echo_job = SlurmTask(
    name="echo-job-name",
    task_config=Slurm(
        slurm_host="<host_alias>",
        batch_script_path="<path_to_batch_script_within_slurm_cluster>",
        sbatch_conf={
            "partition": "debug",
            "job-name": "tiny-slurm",
        }
    )
)


@workflow
def wf() -> None:
    echo_job()


if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner

    runner = CliRunner()
    path = os.path.realpath(__file__)

    # Local run
    print(f">>> LOCAL EXEC <<<")
    result = runner.invoke(pyflyte.main, ["run", path, "wf"])
    print(result.output)

The test result is shown as follows:
slurm_basic_result

Setup process

As stated above

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 76.63%. Comparing base (f99d50e) to head (9e6d8a6).
Report is 18 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/extend/backend/utils.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3005       +/-   ##
===========================================
+ Coverage   51.08%   76.63%   +25.55%     
===========================================
  Files         201      201               
  Lines       21231    21274       +43     
  Branches     2731     2733        +2     
===========================================
+ Hits        10846    16304     +5458     
+ Misses       9787     4219     -5568     
- Partials      598      751      +153     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Successfully submit and run the user-defined task as a normal python
function on a remote Slurm cluster.

1. Inherit from PythonFunctionTask instead of PythonTask
2. Transfer the task module through sftp
3. Interact with amazon s3 bucket on both localhost and Slurm cluster

Signed-off-by: JiaWei Jiang <[email protected]>
Specifying `--raw-output-data-prefix` option handles task_module download.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Add `ssh_conf` filed to let users specify connection secret

Note that reconnection is done in both `get` and `delete`. This is just
a temporary workaround.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

For data scientists and MLEs developing flyte wf with Slurm agent,
they don't actually need to know ssh connection details. We assume
they only need to specify which Slurm cluster to use by hostname.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

1. Write user-defined batch script to a tmp file
2. Transfer the batch script through sftp
3. Construct sbatch command to run on Slurm cluster

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

1. Remove SFTP for batch script transfer
    * Assume Slurm batch script is present on Slurm cluster
2. Support directly specifying a remote batch script path

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@Future-Outlier Future-Outlier self-assigned this Jan 13, 2025
@@ -326,7 +326,7 @@ class AsyncAgentExecutorMixin:

def execute(self: PythonTask, **kwargs) -> LiteralMap:
ctx = FlyteContext.current_context()
ss = ctx.serialization_settings or SerializationSettings(ImageConfig())
ss = ctx.serialization_settings or SerializationSettings(ImageConfig.auto_default_image())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?
is this for shell task?

Copy link
Contributor Author

@JiangJiaWei1103 JiangJiaWei1103 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define a SlurmTask without specifying container_image (as the example python script provided above), ctx.serialization_settings will be None. Then, an error is raised which describes that PythonAutoContainerTask needs an image.

I think this is just a temporary workaround for local test and I'm still pondering how to better handle this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing Graph.

`SlurmTask` and `SlurmShellTask` now share the same agent.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

4 participants