-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Slurm agent #3005
base: master
Are you sure you want to change the base?
[WIP] Slurm agent #3005
Conversation
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3005 +/- ##
===========================================
+ Coverage 51.08% 76.63% +25.55%
===========================================
Files 201 201
Lines 21231 21274 +43
Branches 2731 2733 +2
===========================================
+ Hits 10846 16304 +5458
+ Misses 9787 4219 -5568
- Partials 598 751 +153 ☔ View full report in Codecov by Sentry. |
Successfully submit and run the user-defined task as a normal python function on a remote Slurm cluster. 1. Inherit from PythonFunctionTask instead of PythonTask 2. Transfer the task module through sftp 3. Interact with amazon s3 bucket on both localhost and Slurm cluster Signed-off-by: JiaWei Jiang <[email protected]>
Specifying `--raw-output-data-prefix` option handles task_module download. Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Add `ssh_conf` filed to let users specify connection secret Note that reconnection is done in both `get` and `delete`. This is just a temporary workaround. Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
For data scientists and MLEs developing flyte wf with Slurm agent, they don't actually need to know ssh connection details. We assume they only need to specify which Slurm cluster to use by hostname. Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
1. Write user-defined batch script to a tmp file 2. Transfer the batch script through sftp 3. Construct sbatch command to run on Slurm cluster Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
1. Remove SFTP for batch script transfer * Assume Slurm batch script is present on Slurm cluster 2. Support directly specifying a remote batch script path Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: pryce-turner <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
@@ -326,7 +326,7 @@ class AsyncAgentExecutorMixin: | |||
|
|||
def execute(self: PythonTask, **kwargs) -> LiteralMap: | |||
ctx = FlyteContext.current_context() | |||
ss = ctx.serialization_settings or SerializationSettings(ImageConfig()) | |||
ss = ctx.serialization_settings or SerializationSettings(ImageConfig.auto_default_image()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
is this for shell task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define a SlurmTask
without specifying container_image
(as the example python script provided above), ctx.serialization_settings
will be None
. Then, an error is raised which describes that PythonAutoContainerTask
needs an image.
I think this is just a temporary workaround for local test and I'm still pondering how to better handle this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing Graph.
`SlurmTask` and `SlurmShellTask` now share the same agent. Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Tracking issue
flyteorg/flyte#5634
Why are the changes needed?
What changes were proposed in this pull request?
Implement the Slurm agent, which submits the user-defined flytekit task to a remote Slurm cluster to run. Following describe three core methods:
create
: Submit a Slurm job withsbatch
to run a batch script on Slurm clusterget
: Check the Slurm job statedelete
(haven't been tested): Cancel the Slurm jobHow was this patch tested?
We test
create
andget
in the development environment described as follows:flytekit
installedslurmctld
andslurmd
runningasyncssh
Suppose we have a batch script to run on Slurm cluster:
We use the following python script to test Slurm agent on the client side:
The test result is shown as follows:
Setup process
As stated above
Check all the applicable boxes
Related PRs
Docs link