diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index 8a39904..fae59f5 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -18,7 +18,7 @@ jobs: - name: Check Spelling uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0 with: - files: ./README.md ./spec.md + files: ./README.md ./spec-1.md - name: Lint and format Python code run: | diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml index 4102937..409a880 100644 --- a/.github/workflows/test.yaml +++ b/.github/workflows/test.yaml @@ -34,5 +34,6 @@ jobs: run: | which jobspec flux start jobspec run ./examples/hello-world-jobspec.yaml - flux start jobspec run ./examples/hello-world-batch.yaml + flux start jobspec run ./examples/group-with-group.yaml + flux start jobspec run ./examples/task-with-group.yaml flux start python3 ./examples/flux/receive-job.py diff --git a/README.md b/README.md index 2baee13..70a282d 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,9 @@ It is a transformational layer, or a simple language that converts steps needed for a specific clusters scheduler. We are currently prototyping off of the Flux JobSpec, and intent to derive some variant between that and something more. It is JobSpec... the next generation! 🚀️ -⭐️ [Read the specification](spec.md) ⭐️ +⭐️ [Read the specification](spec-1.md) ⭐️ +Some drafts are included in [docs/drafts](docs/drafts) ## Usage @@ -59,18 +60,21 @@ it there for the time being, mostly because it looks nicer. I'm sure someone wil jobspec run ./examples/hello-world-jobspec.yaml ``` ```console +=> flux workload => flux submit ƒDjkLvNF9 OK => flux submit ƒDjzAyfhh OK ``` Add debug to see commands submit + ```bash jobspec --debug run ./examples/hello-world-jobspec.yaml ``` ```console -=> flux submit ƒJYffwYkP OK +=> flux workload +=> flux submit ƒ2i6n8XHSP OK flux submit --job-name task-1 -N 1 bash -c echo Starting task 1; sleep 3; echo Finishing task 1 -=> flux submit ƒJYw8t4F1 OK +=> flux submit ƒ2i6qafcUw OK flux submit --job-name task-2 -N 1 bash -c echo Starting task 2; sleep 3; echo Finishing task 2 ``` @@ -81,7 +85,78 @@ jobspec run -t flux ./examples/hello-world-jobspec.yaml jobspec run --transformer flux ./examples/hello-world-jobspec.yaml ``` -#### 3. Python Examples +#### 3. Nested Examples + +Try running some advanced examples. Here is a group within a task. + +```bash +jobspec --debug run ./examples/task-with-group.yaml +``` +```console +=> flux workload +=> flux submit ƒ2iiMFBqxT OK + flux submit --job-name task-1 -N 1 bash -c echo Starting task 1; sleep 3; echo Finishing task 1 +=> flux batch ƒ2iiQpk7Qj OK + #!/bin/bash + flux submit --job-name task-2-task-0 --flags=waitable bash -c echo Starting task 2; sleep 3; echo Finishing task 2 + flux job wait --all + flux job submit /tmp/jobspec-.bvu1v7vk/jobspec-5y9n9u0y +``` + +That's pretty intuitive, because we see that there is a flux submit first, followed by a batch that has a single task run. The last line "flux submit" shows how we are submitting the script that was just shown. +What about a group within a group? + +```bash +$ jobspec --debug run ./examples/group-with-group.yaml +``` +```console +=> flux workload +=> flux batch ƒ2jEE7NPXM OK + #!/bin/bash + flux submit --job-name group-1-task-0 --flags=waitable bash -c echo Starting task 1 in group 1; sleep 3; echo Finishing task 1 in group 1 + flux job submit --flags=waitable /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl + # rm -rf /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl + flux job wait --all + flux job submit /tmp/jobspec-.45jezez5/jobspec-8dr1udhx +``` + +The UI here needs some work, but here is what we see above. + +```console +# This is the start of the workload - the entire next gen jobspec always produces one workload +=> flux workload + +# This is the top level group that has the other group within - it's the top level "flux batch" that we submit +=> flux batch ƒ2e7Ay6jvo OK + + # This is showing the first script that is written + #!/bin/bash + + # Here is the first job submit, now namespaced to group-1 (if the user, me, didn't give it a name) + flux submit --job-name group-1-task-0 --flags=waitable bash -c echo Starting task 1 in group 1; sleep 3; echo Finishing task 1 in group 1 + + # This is submitting group-2 - the jobspec is written in advance + flux job submit --flags=waitable /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl + + # And this will be how we clean it up as we go - always after it's submit. I'm commenting it out for now because rm -rf makes me nervous! + # rm -rf /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl + + # This is the actual end of the batch script + flux job wait --all + + # This is showing submitting the batch script above, kind of confusing because it looks like it's within it (it's not, just a bad UI for now) + flux job submit /tmp/jobspec-.45jezez5/jobspec-8dr1udhx +``` + +And because I didn't clean it up, here is the contents of the batch in the batch for group-2 + +```bash +#!/bin/bash +flux submit --job-name group-2-task-0 --flags=waitable bash -c echo Starting task 1 in group 2; sleep 3; echo Finishing task 1 in group 2 +flux job wait --all +``` + +#### 4. Python Examples It could also be the case that you want something running inside a lead broker instance to receive Jobspecs incrementally and then run them. This Python example can help with that by showing how to accomplish the same, but from within Python. @@ -90,6 +165,7 @@ run them. This Python example can help with that by showing how to accomplish th python3 ./examples/flux/receive-job.py ``` ```console +=> flux workload => flux submit ƒKCJG2ESB OK => flux submit ƒKCa5iZsd OK ``` diff --git a/spec.md b/docs/drafts/README.md similarity index 100% rename from spec.md rename to docs/drafts/README.md diff --git a/examples/group-with-group.yaml b/examples/group-with-group.yaml new file mode 100644 index 0000000..680bece --- /dev/null +++ b/examples/group-with-group.yaml @@ -0,0 +1,27 @@ +version: 1 + +# This is an example of a group with a nested group, +# which means we would have a flux batch within a flux batch! +resources: + common: + count: 1 + type: node + +groups: +- name: group-1 + resources: common + tasks: + - command: + - bash + - -c + - "echo Starting task 1 in group 1; sleep 3; echo Finishing task 1 in group 1" + - group: group-2 + +- name: group-2 + depends_on: ["group-1"] + resources: common + tasks: + - command: + - bash + - -c + - "echo Starting task 1 in group 2; sleep 3; echo Finishing task 1 in group 2" \ No newline at end of file diff --git a/examples/hello-world-batch.yaml b/examples/hello-world-batch.yaml deleted file mode 100644 index 83c8f2d..0000000 --- a/examples/hello-world-batch.yaml +++ /dev/null @@ -1,28 +0,0 @@ -# This is a batch because we have resources at the top level -# They are inherited by tasks -version: 1 -requires: - io.archspec: - cpu.target: amd64 - -resources: - count: 2 - type: node - -tasks: -- name: task-1 - command: - - bash - - -c - - "echo Starting task 1; sleep 3; echo Finishing task 1" - -- name: task-2 - depends_on: ["task-1"] - - # This is an example of a custom "task-level' requires - requires: - hardware.gpu.available: "yes" - command: - - bash - - -c - - "echo Starting task 2; sleep 3; echo Finishing task 2" \ No newline at end of file diff --git a/jobspec/runner.py b/jobspec/runner.py index 71e1649..75ef544 100644 --- a/jobspec/runner.py +++ b/jobspec/runner.py @@ -64,6 +64,9 @@ def parse(self, jobspec): """ raise NotImplementedError + def announce(self): + pass + def run(self, filename): """ Run the transformer @@ -74,6 +77,7 @@ def run(self, filename): # Get validated transformation steps # These will depend on the transformer logic steps = self.parse(jobspec) + self.announce() # Run each step to submit the job, and that's it. for step in steps: diff --git a/jobspec/transformer/flux/steps.py b/jobspec/transformer/flux/steps.py index 41cb6c1..f5fa6cb 100644 --- a/jobspec/transformer/flux/steps.py +++ b/jobspec/transformer/flux/steps.py @@ -1,3 +1,4 @@ +import copy import json import os import uuid @@ -160,18 +161,20 @@ def write_tasks_script(self): """ Generate the batch script. """ - data = script_prefix + data = copy.deepcopy(script_prefix) for task in self.tasks: if task.name == "batch": - cmd, _ = self.generate_command(waitable=True) + cmd, _ = task.generate_command(waitable=True) data.append(" ".join(cmd)) # This is the jobspec - data.append("# rm -rf {cmd[-1]}") - data.append(" ".join(task.generate_command(waitable=True))) + data.append(f"# rm -rf {cmd[-1]}") + else: + data.append(" ".join(task.generate_command(waitable=True))) # Ensure all jobs are waited on data.append("flux job wait --all") - return {"mode": 33216, "data": "\n".join(data), "encoding": "utf-8"} + script = "\n".join(data) + return {"mode": 33216, "data": script, "encoding": "utf-8"} def generate_command(self, waitable=False): """ diff --git a/jobspec/transformer/flux/workload.py b/jobspec/transformer/flux/workload.py index 39448fe..c557644 100644 --- a/jobspec/transformer/flux/workload.py +++ b/jobspec/transformer/flux/workload.py @@ -2,6 +2,7 @@ import jobspec.core as js import jobspec.core.resources as rcore +from jobspec.logger import LogColors from jobspec.runner import TransformerBase from .steps import batch, stage, submit @@ -20,6 +21,13 @@ class FluxWorkload(TransformerBase): name = "flux" description = "Flux Framework workload" + def announce(self): + """ + Announce prints an additional prefix during run + """ + prefix = "flux workload".ljust(15) + print(f"=> {LogColors.OKCYAN}{prefix}{LogColors.ENDC}") + def parse(self, jobspec): """ Parse the jobspec into tasks for flux. @@ -47,6 +55,11 @@ def parse(self, jobspec): # We copy because otherwise the dict changes size when the task parser removes groups = copy.deepcopy(self.group_lookup) for name, group in groups.items(): + # Check if the group was already removed by another group task, + # and don't run if it was! + if name not in self.group_lookup: + continue + self.tasks.insert( 0, self.parse_group(group, name, self.resources, requires=self.requires) ) @@ -161,7 +174,7 @@ def parse_tasks(self, tasks, resources=None, attributes=None, requires=None, nam steps = [] for i, task in enumerate(tasks): # Create a name based on the index or the task name - name = task.get("name") or f"task-{i}" + name = task.get("name") or f"{name_prefix}task-{i}" # Case 1: We found a group! Parse here and add to steps group_name = task.get("group") diff --git a/spec-1.md b/spec-1.md index 31239d2..39d4f6e 100644 --- a/spec-1.md +++ b/spec-1.md @@ -132,7 +132,7 @@ For this example, let's look at the "Mini Mummi" workflow, which requires: 4. Within the top level batch, submitting other jobs to run training 5. And doing the same for testing. -While we could imagine another level of nexting (for example, the machine learning tasks each being a group with a subset of tasks) let's start with this design for now. It might look like this. Note that the requires (and other) sections are removed for brevity: +While we could imagine another level of nesting (for example, the machine learning tasks each being a group with a subset of tasks) let's start with this design for now. It might look like this. Note that the requires (and other) sections are removed for brevity: ```yaml version: 1 @@ -383,7 +383,7 @@ groups: - command: ["pennant", "/opt/pennant/test/params.pnt"] depends_on: ["build"] attributes: - envirionment: + environment: LD_LIBRARY_PATH: /usr/local/cuda/lib ``` @@ -624,6 +624,7 @@ Additional properties and attributes we might want to consider - Next step - PR to add support for dependency.name to flux-core. I see where this needs to be added (and the logic) and would learn a lot from a hackathon (where I can lead and write the code) - Would like to get it supported in flux core first, but if not possible need to consider using jobids here (not ideal) +- Environment still needs to be added to the implementation, along with using requires and attributes meaningfully. - for groups in groups (batches) along with submit we need logic to wait for completion before the instance cleans up. - user: variables specific to the user - parameters: task parameters or state that inform resource selection