compspec · vsoch · Apr 29, 2024 · Apr 27, 2024 · Apr 27, 2024 · Apr 28, 2024
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -10,7 +10,8 @@
   },
       "extensions": [
         "ms-vscode.cmake-tools",
-        "golang.go"
+        "golang.go",
+        "ms-python.python"
       ]
     }
   },

diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -18,7 +18,7 @@ jobs:
     - name: Check Spelling
       uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0
       with:
-        files: ./README.md
+        files: ./README.md ./spec-1.md
 
     - name: Lint and format Python code
       run: |

diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -34,5 +34,6 @@ jobs:
         run: |
           which jobspec
           flux start jobspec run ./examples/hello-world-jobspec.yaml
-          flux start jobspec run ./examples/hello-world-batch.yaml
+          flux start jobspec run ./examples/group-with-group.yaml
+          flux start jobspec run ./examples/task-with-group.yaml
           flux start python3 ./examples/flux/receive-job.py
diff --git a/README.md b/README.md
@@ -8,42 +8,18 @@
 
 This library includes a cluster agnostic language to setup a job (one unit of work in a jobspec).
 It is a transformational layer, or a simple language that converts steps needed to prepare a job
-for a specific clusters scheduler. If you think it looks too simple then I'd say it's a success,
+for a specific clusters scheduler. We are currently prototyping off of the Flux JobSpec, and intent
+to derive some variant between that and something more. It is JobSpec... the next generation! 🚀️
 
-## Usage
-
-A transformer provides one or more steps for the jobpsec to be transformed and understood for a particular
-execution environment.
-
-### Steps
-
-Steps include:
-
-| Name   | Description |
-|--------|-------------|
-| write  | write a file in the staging directory |
-| set    | a step to define a global setting  |
-| copy   | copy a file into staging (currently just local) |
-| submit | submit the job |
-| batch  | submit the job with a batch command (more common in HPC) |
-| auth   | authenticate with some service |
+⭐️ [Read the specification](spec-1.md) ⭐️
 
-Note that for the above, we assume a shared filesystem unless stage directs that this isn't the case.
-These are the basic steps that @vsoch needs now for scheduling experiments, and more can be added (or tweaked) if needed.
+Some drafts are included in [docs/drafts](docs/drafts)
 
-### Settings
-
-Any "set" directive can be used to set a more global setting on the transform. For example:
-
- - stage: defines the staging directory. If not set, will be a temporary directory that is created
- - sharedfs: true or false to say that the filesystem is shared or not (defaults to false)
-
-For `sharedfs` it would be ideal to have a setting that is specific to the transformer, but unfortunately this could be true or false
-for flux, so it has to be set. But this might be an interesting compatibility thing to test.
-
-### Example
+## Usage
 
-This example will assume receiving a Jobspec on a flux cluster.
+A JobSpec consists of one or more tasks that have dependencies. This level of dependency is what can be represented in a scheduler.
+The JobSpec library here reads in the JobSpec and can map that into specific cluster submit commands.
+Here is an example that assumes receiving a Jobspec on a flux cluster.
 
 #### 1. Start Flux
 
@@ -80,136 +56,147 @@ jobspec - it can be a file that the jobspec writes, and then the command is issu
 it there for the time being, mostly because it looks nicer. I'm sure someone will disagree with me about that.
 
 ```bash
-# Example showing without watching (waiting) and showing output
+# Submit a basic set of jobs with dependencies
 jobspec run ./examples/hello-world-jobspec.yaml
+```
+```console
+=> flux workload
+=> flux submit    ƒDjkLvNF9                 OK
+=> flux submit    ƒDjzAyfhh                 OK
+```
 
-# Example that shows waiting for output
-jobspec run ./examples/hello-world-wait-jobspec.yaml
+Add debug to see commands submit
 
-# Example with batch using flux
-jobspec run ./examples/hello-world-batch.yaml
+```bash
+jobspec --debug run ./examples/hello-world-jobspec.yaml
+```
+```console
+=> flux workload
+=> flux submit    ƒ2i6n8XHSP OK
+   flux submit --job-name task-1 -N 1 bash -c echo Starting task 1; sleep 3; echo Finishing task 1
+=> flux submit    ƒ2i6qafcUw OK
+   flux submit --job-name task-2 -N 1 bash -c echo Starting task 2; sleep 3; echo Finishing task 2
 ```
 
 Note that the default transformer is flux, so the above are equivalent to:
 
 ```bash
-jobspec run -t flux ./examples/hello-world-wait-jobspec.yaml
-jobspec run --transformer flux ./examples/hello-world-wait-jobspec.yaml
+jobspec run -t flux ./examples/hello-world-jobspec.yaml
+jobspec run --transformer flux ./examples/hello-world-jobspec.yaml
 ```
 
-#### 3. Python Examples
+#### 3. Nested Examples
 
-It could also be the case that you want something running inside a lead broker instance to receive Jobspecs incrementally and then
-run them. This Python example can help with that by showing how to accomplish the same, but from within Python.
+Try running some advanced examples. Here is a group within a task.
 
 ```bash
-python3 ./examples/flux/receive-job.py
+jobspec --debug run ./examples/task-with-group.yaml
 ```
 ```console
-$ python3 examples/flux/receive-job.py
-=> step write                               OK
-=> step submit    f7aChzM3u                 OK
-=> step write                               OK
-=> step submit    f7aDYuwMH                 OK
+=> flux workload
+=> flux submit    ƒ2iiMFBqxT OK
+   flux submit --job-name task-1 -N 1 bash -c echo Starting task 1; sleep 3; echo Finishing task 1
+=> flux batch     ƒ2iiQpk7Qj OK
+   #!/bin/bash
+   flux submit --job-name task-2-task-0 --flags=waitable bash -c echo Starting task 2; sleep 3; echo Finishing task 2
+   flux job wait --all
+   flux job submit /tmp/jobspec-.bvu1v7vk/jobspec-5y9n9u0y
 ```
 
-Just for fun (posterity) I briefly tried having emoji here:
+That's pretty intuitive, because we see that there is a flux submit first, followed by a batch that has a single task run. The last line "flux submit" shows how we are submitting the script that was just shown.
+What about a group within a group?
 
-![img/emoji.png](img/emoji.png)
+```bash
+$ jobspec --debug run ./examples/group-with-group.yaml
+```
+```console
+=> flux workload
+=> flux batch     ƒ2jEE7NPXM OK
+   #!/bin/bash
+   flux submit --job-name group-1-task-0 --flags=waitable bash -c echo Starting task 1 in group 1; sleep 3; echo Finishing task 1 in group 1
+   flux job submit --flags=waitable /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl
+   # rm -rf /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl
+   flux job wait --all
+   flux job submit /tmp/jobspec-.45jezez5/jobspec-8dr1udhx
+```
 
-### Details
+The UI here needs some work, but here is what we see above.
 
-As an example, although you *could* submit a job with a command ready to go - assuming your cluster has the
-software needed and files, and you just want to run it, assuming submission to a cluster you haven't
-setup on, you might need the following logic:
+```console
+# This is the start of the workload - the entire next gen jobspec always produces one workload
+=> flux workload
 
-1. Write a script to file that is intended to install something.
-2. Stage this file across nodes.
-3. Submit the script to all nodes to do the install.
-4. Write a script to file for your actual job.
-5. Again, stage this file across nodes (assuming no share filesystem)
-6. Submit the job, either as a submit or batch directive to a workload manager.
+# This is the top level group that has the other group within - it's the top level "flux batch" that we submit
+=> flux batch     ƒ2e7Ay6jvo OK
 
-The way that you do this with every workload manager (or cluster, more generally) is going to vary
-quite a bit. However, with a transformation - a mapping of abstract steps to a specific cluster
-workload manager, you can write those steps out very simply:
+   # This is showing the first script that is written
+   #!/bin/bash
 
-```yaml
-transform:
+   # Here is the first job submit, now namespaced to group-1 (if the user, me, didn't give it a name)
+   flux submit --job-name group-1-task-0 --flags=waitable bash -c echo Starting task 1 in group 1; sleep 3; echo Finishing task 1 in group 1
 
-  - step: write
-    filename: install.sh
-    executable: true
+   # This is submitting group-2 - the jobspec is written in advance
+   flux job submit --flags=waitable /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl
 
-  - step: submit
-    filename: install.sh
-    wait: true
+   # And this will be how we clean it up as we go - always after it's submit. I'm commenting it out for now because rm -rf makes me nervous!
+   # rm -rf /tmp/jobspec-.ljjiywaa/jobspec-kb5y5lsl
 
-  - step: write
-    filename: job.sh
-    executable: true
+   # This is the actual end of the batch script
+   flux job wait --all
 
-  - step: submit
-    filename: job.sh
-    wait: true
+   # This is showing submitting the batch script above, kind of confusing because it looks like it's within it (it's not, just a bad UI for now)
+   flux job submit /tmp/jobspec-.45jezez5/jobspec-8dr1udhx
 ```
 
-The above assumes we have a shared filesystem, and by not setting the stage manually:
+And because I didn't clean it up, here is the contents of the batch in the batch for group-2
 
-```yaml
-- step: set
-  key: stage
-  value: /tmp/path-for-workflow
+```bash
+#!/bin/bash
+flux submit --job-name group-2-task-0 --flags=waitable bash -c echo Starting task 1 in group 2; sleep 3; echo Finishing task 1 in group 2
+flux job wait --all
 ```
 
-We will use a custom one. If we didn't have a shared filesystem we would need to provide that detail. It's really akin
-to a subsystem detail, because a job that assumes a shared fs won't be compatible.
+#### 4. Python Examples
+
+It could also be the case that you want something running inside a lead broker instance to receive Jobspecs incrementally and then
+run them. This Python example can help with that by showing how to accomplish the same, but from within Python.
 
-```yaml
-- step: set
-  key: sharedfs
-  value: false
+```bash
+python3 ./examples/flux/receive-job.py
 ```
+```console
+=> flux workload
+=> flux submit    ƒKCJG2ESB OK
+=> flux submit    ƒKCa5iZsd OK
+```
+
+Just for fun (posterity) I briefly tried having emoji here:
+
+![img/emoji.png](img/emoji.png)
 
-Whenever there is a copy (not shown) this assumes the receiving cluster has some cluster-specific method for copy or
-file mapping, even in the case without a shared filesystem. It could be ssh, or a filemap, or something else.
-For an ephemeral cluster API, it might be an interaction with a storage provider, or just adding the file to an API call that
-will (in and of itself) do that creation, akin to a startup script for an instance in Terraform. It really doesn't matter -
-the user can expect the file to be written and shared across nodes. This is not intended to be a workflow or build tool -
-it simply is a transformational layer that a jobspec can provide to setup a specific cluster environment. It works with a
-jobspec in that you define your filenames (scripts) in the tasks->scripts directive. It also uses a plugin design, so a
-cluster or institution can write a custom transformer to install, and it will be discovered
-by name. This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
-Jobspec is an entity of [flux-framework](https://flux-framework.org).
 
 ### Frequently Asked Questions
 
-#### Why not rely on Flux internals?
+#### Is this a Flux jobspec?
 
-If we lived in a universe of just flux, sure we wouldn't need this. But the world is more than Flux, and we want to extend our Jobspec to that world.
-So we want a Jobspec to be able to handle a transformation of some logic (the above) into an execution that might not involve flux at all. It could be another workload manager (e.g., Slurm),
-Kubernetes, or it could be a service that submits to some cloud batch API.
+Despite the shared name, this is not a Flux jobspec. Type `man bash` to see that the term "jobspec" predates flux. If we lived in a universe of just Flux, sure we wouldn't need this. But the world is more than Flux, and we want to extend our Jobspec to that - providing an abstraction that works with Flux, but also works with other workload managers and compute environments and application programming interfaces.
 
-#### What are all the steps allowed?
+#### What are steps?
 
-They are currently shown in the example above, and better documentation will be written. Arguably, any transformation backend does not
-need to support every kind of step, however if you provide a Jobspec to a transformer with a step not supported, you'll get an error.
+A step is a custom setup or staging command that might be allowed for a specific environment. For example, workload managers that know how to map or stage files can use the "stage" step. General steps to write scripts can arguably used anywhere with some form of filesystem, shared or not. The steps that are allowed for a task are shown in the [spec](spec.md). At the onset we will make an effort to only add steps that can be supported across transformer types.
 
 #### Where are the different transformers defined?
 
 We currently have our primary (core) transformers here in [jobspec/transformer](jobspec/transformer), however a registry that discovers jobspec-* named Python modules can allow an out of tree install and use of a transfomrmer. This use case is anticipating clusters with some custom or private logic that cannot be shared in a public GitHub repository.
 
-#### How do you know this is a good idea?
-
-I don't, or won't until I try it for experiments. I decided to try something like it after several days of preparing for experiments,and realizing that this transformation layer was entirely missing.
 
 ### Means of Interaction
 
 There are several likely means of interacting with this library:
 
 - As a service that runs at some frequency to receive jobs (written as a loop in Python in some context)
 - As a cron job that does the same (an entry to crontab to run "jobspec" at some frequency)
-- As a one off run (a single run of the above)
+- As a one off run (an example above)
 
 For the example usage here, and since the project I am working on is concerned with Flux, we will start with the simplest case - a client that is running inside a flux instance (meaning it can import flux) that reads in a jobspec with a section that defines a set of transforms, and then issues the commands to stage the setup and use flux to run the work defined by the jobspec.
 
@@ -243,11 +230,10 @@ just register the empty step with the name you want to skip. As an example, let'
 ```python
 import jobspec.steps as steps
 
-# This will not fail validation that the step is unknowb, but skip it
+# This will not fail validation that the step is unknown, but skip it
 Transformer.register_step(steps.EmptyStep, name="stage")
 ```
 
-
 ## License
 
 HPCIC DevTools is distributed under the terms of the MIT license.