core: add core of transform layer with steps

This is for Flux, which probably is not as important as other use cases since it understands a jobspec, but it demonstrates the basic staging, writing, and submit of a script. I will add other steps I need now, and then likely other transformers for things that are not flux. Signed-off-by: vsoch <[email protected]>
compspec · Mar 18, 2024 · 3ad8df9 · 3ad8df9
1 parent 52f8b7a
commit 3ad8df9
Show file tree

Hide file tree

Showing 33 changed files with 1,282 additions and 70 deletions.
diff --git a/.devcontainer/Dockerfile b/.devcontainer/Dockerfile
@@ -0,0 +1,20 @@
+FROM fluxrm/flux-sched:jammy
+
+LABEL maintainer="Vanessasaurus <@vsoch>"
+
+# Match the default user id for a single system so we aren't root
+ARG USERNAME=vscode
+ARG USER_UID=1000
+ARG USER_GID=1000
+ENV USERNAME=${USERNAME}
+ENV USER_UID=${USER_UID}
+ENV USER_GID=${USER_GID}
+USER root
+RUN apt-get update && apt-get install -y less python3-pip
+
+# Add the group and user that match our ids
+RUN groupadd -g ${USER_GID} ${USERNAME} && \
+    adduser --disabled-password --uid ${USER_UID} --gid ${USER_GID} --gecos "" ${USERNAME} && \
+    echo "${USERNAME} ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers
+
+USER $USERNAME
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -0,0 +1,18 @@
+{
+  "name": "Jobspec Developer Environment",
+  "dockerFile": "Dockerfile",
+  "context": "../",
+
+  "customizations": {
+    "vscode": {
+      "settings": {
+        "terminal.integrated.defaultProfile.linux": "bash"
+  },
+      "extensions": [
+        "ms-vscode.cmake-tools",
+        "golang.go"
+      ]
+    }
+  },
+  "postStartCommand": "git config --global --add safe.directory /workspaces/jobspec"
+}
diff --git a/README.md b/README.md
@@ -6,10 +6,143 @@
 
 ![https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true](https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true)
 
-This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
+This library includes a cluster agnostic language to setup a job (one unit of work in a jobspec).
+It is a transformational layer, or a simple language that converts steps needed to prepare a job
+for a specific clusters scheduler. If you think it looks too simple then I'd say it's a success,
+
+## Usage
+
+### Example
+
+Start up the development environment to find yourself in a container with flux. Start a test instance:
+
+```bash
+flux start --test-size=4
+```
+
+Note that we have 4 faux nodes and 40 faux cores.
+
+```bash
+flux resource list
+```
+```console
+     STATE NNODES   NCORES    NGPUS NODELIST
+      free      4       40        0 194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c
+ allocated      0        0        0
+      down      0        0        0
+```
+
+Ensure you have jobspec installed! Yes, we are vscode, installing to the container, so we use sudo. YOLO.
+
+```bash
+sudo pip install -e .
+```
+
+We are going to run the [examples/hello-world-jobspec.yaml](examples/hello-world-jobspec.yaml). This setup is way overly
+complex for this because we don't actually need to do any staging or special work, but it's an example, so intended to be so.
+Also note that the design of this file is subject to change. For example, we don't have to include the transform directly in the
+jobspec - it can be a file that the jobspec writes, and then the command is issued. I like it better as a piece of it, so am putting
+it there for the time being, mostly because it looks nicer. I'm sure someone will disagree with me about that.
+
+```bash
+
+```
+
+### Details
+
+As an example, although you *could* submit a job with a command ready to go - assuming your cluster has the
+software needed and files, and you just want to run it, assuming submission to a cluster you haven't
+setup on, you might need the following logic:
+
+1. Write a script to file that is intended to install something.
+2. Stage this file across nodes.
+3. Submit the script to all nodes to do the install.
+4. Write a script to file for your actual job.
+5. Again, stage this file across nodes (assuming no share filesystem)
+6. Submit the job, either as a submit or batch directive to a workload manager.
+
+The way that you do this with every workload manager (or cluster, more generally) is going to vary
+quite a bit. However, with a transformation - a mapping of abstract steps to a specific cluster
+workload manager, you can write those steps out very simply:
+
+```yaml
+transform:
+
+  - step: write
+    filename: install.sh
+    executable: true
+
+  - step: stage
+    filename: install.sh
+
+  - step: submit
+    filename: install.sh
+    wait: true
+
+  - step: write
+    filename: job.sh
+    executable: true
+
+  - step: stage
+    filename: job.sh
+
+  - step: submit
+    filename: job.sh
+    wait: true
+```
+
+The above assumes we don't have a shared filesystem, and the receiving cluster has some cluster-specific method for staging or
+file mapping. It could be ssh, or a filemap, or something else. For an ephemeral cluster API, it might be an interaction with
+a storage provider, or just adding the file to an API call that will (in and of itself) do that creation, akin to a startup script for
+an instance in Terraform. It really doesn't matter - the user can expect the file to be written and shared across nodes.
+This is not intended to be a workflow or build tool - it simply is a transformational layer that a jobspec can provide
+to setup a specific cluster environment. It works with a jobspec in that you define your filenames (scripts) in the tasks->scripts
+directive. It also uses a plugin design, so a cluster or institution can write a custom transformer to install, and it will be discovered
+by name. This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
 Jobspec is an entity of [flux-framework](https://flux-framework.org).
 
-**under development**
+### Frequently Asked Questions
+
+#### Why not rely on Flux internals?
+
+We want a Jobspec to be able to handle a transformation of some logic (the above) into an execution that might not involve flux at all. It could be another workload manager (e.g., Slurm) or it could be a service that submits to some cloud batch API.
+
+#### What are all the steps allowed?
+
+They are currently shown in the example above, and better documentation will be written. Arguably, any transformation backend does not
+need to support every kind of step, however if you provide a Jobspec to a transformer with a step not supported, you'll get an error.
+
+#### Where are the different transformers defined?
+
+We currently have our primary (core) transformers here in [jobspec/transformer](jobspec/transformer), however a registry that discovers jobspec-* named Python modules can allow an out of tree install and use of a transfomrmer. This use case is anticipating clusters with some custom or private logic that cannot be shared in a public GitHub repository.
+
+#### How do you know this is a good idea?
+
+I don't, or won't until I try it for experiments. I decided to try something like it after several days of preparing for experiments,and realizing that this transformation layer was entirely missing.
+
+### Means of Interaction
+
+There are several likely means of interacting with this library:
+
+- As a service that runs at some frequency to receive jobs (written as a loop in Python in some context)
+- As a cron job that does the same (an entry to crontab to run "jobspec" at some frequency)
+- As a one off run (a single run of the above)
+
+For the example usage here, and since the project I am working on is concerned with Flux, we will start with the simplest case - a client that is running inside a flux instance (meaning it can import flux) that reads in a jobspec with a section that defines a set of transforms, and then issues the commands to stage the setup and use flux to run the work defined by the jobspec.
+
+## TODO
+
+ - write the hello world example with flux
+ - add the staging example
+ - write the same, but using batch
+
+## Developer
+
+### Organization
+
+While you can write an external transformer (as a plugin) a set of core transformers are provided here:
+
+ - [jobspec/transformer](jobspec/transformer): core transformer classes that ship internally here.
 
 ## License
 

diff --git a/examples/hello-world-jobspec.yaml b/examples/hello-world-jobspec.yaml
@@ -0,0 +1,40 @@
+# TODO test these out on on x86, then create arm specs
+version: 1
+resources:
+- count: 4
+  type: node
+  with:
+  - count: 1
+    label: hello-world
+    type: slot
+    with:
+    - count: 4
+      type: core
+
+task:
+  command: ["/bin/bash", "job.sh"]
+  transform:
+  - step: write
+    filename: job.sh
+    executable: true
+
+  # This is only provided as an example - in the devcontainer it's just one physicap machine!
+  #- step: stage
+  #  filename: install.sh
+
+  - step: submit
+    filename: job.sh
+
+  scripts: 
+    - name: job.sh
+      content: |
+        #!/bin/bash
+        echo hello world from $(hostname)
+  count:
+    per_slot: 1
+  resources:
+    hardware:
+      hardware.gpu.available: 'no'
+    io.archspec:
+      cpu.target: amd64
+  slot: hello-world
diff --git a/examples/hello-world-wait-jobspec.yaml b/examples/hello-world-wait-jobspec.yaml
@@ -0,0 +1,41 @@
+# TODO test these out on on x86, then create arm specs
+version: 1
+resources:
+- count: 4
+  type: node
+  with:
+  - count: 1
+    label: hello-world
+    type: slot
+    with:
+    - count: 4
+      type: core
+
+task:
+  command: ["/bin/bash", "job.sh"]
+  transform:
+  - step: write
+    filename: job.sh
+    executable: true
+
+  # This is only provided as an example - in the devcontainer it's just one physicap machine!
+  #- step: stage
+  #  filename: install.sh
+
+  - step: submit
+    filename: job.sh
+    wait: true
+
+  scripts: 
+    - name: job.sh
+      content: |
+        #!/bin/bash
+        echo hello world from $(hostname)
+  count:
+    per_slot: 1
+  resources:
+    hardware:
+      hardware.gpu.available: 'no'
+    io.archspec:
+      cpu.target: amd64
+  slot: hello-world
diff --git a/jobspec/cli/__init__.py b/jobspec/cli/__init__.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python
+
+import argparse
+import os
+import sys
+
+import jobspec
+from jobspec.logger import setup_logger
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="Jobspec",
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+
+    # Global Variables
+    parser.add_argument(
+        "--debug",
+        dest="debug",
+        help="use verbose logging to debug.",
+        default=False,
+        action="store_true",
+    )
+
+    parser.add_argument(
+        "--quiet",
+        dest="quiet",
+        help="suppress additional output.",
+        default=False,
+        action="store_true",
+    )
+    parser.add_argument(
+        "--version",
+        dest="version",
+        help="show software version.",
+        default=False,
+        action="store_true",
+    )
+
+    subparsers = parser.add_subparsers(
+        help="actions",
+        title="actions",
+        description="actions",
+        dest="command",
+    )
+    subparsers.add_parser("version", description="show software version")
+
+    # Maybe this warrants a better name, but this seems to be what we'd want to do -
+    # run a jobspec
+    run = subparsers.add_parser(
+        "run",
+        formatter_class=argparse.RawTextHelpFormatter,
+        description="receive and run a jobpsec",
+    )
+    run.add_argument("-t", "--transform", help="transformer to use", default="flux")
+    run.add_argument("jobspec", help="jobspec yaml file", default="jobspec.yaml")
+    return parser
+
+
+def run_jobspec():
+    """
+    this is the main entrypoint.
+    """
+    parser = get_parser()
+
+    def help(return_code=0):
+        """print help, including the software version and active client
+        and exit with return code.
+        """
+        version = jobspec.__version__
+
+        print("\nJobspec v%s" % version)
+        parser.print_help()
+        sys.exit(return_code)
+
+    # If the user didn't provide any arguments, show the full help
+    if len(sys.argv) == 1:
+        help()
+
+    # If an error occurs while parsing the arguments, the interpreter will exit with value 2
+    args, extra = parser.parse_known_args()
+
+    if args.debug is True:
+        os.environ["MESSAGELEVEL"] = "DEBUG"
+
+    # Show the version and exit
+    if args.command == "version" or args.version:
+        print(jobspec.__version__)
+        sys.exit(0)
+
+    setup_logger(
+        quiet=args.quiet,
+        debug=args.debug,
+    )
+
+    # Here we can assume instantiated to get args
+    if args.command == "run":
+        from .run import main
+    else:
+        help(1)
+    main(args, extra)
+
+
+if __name__ == "__main__":
+    run_jobspec()
diff --git a/jobspec/cli/__pycache__/__init__.cpython-310.pyc b/jobspec/cli/__pycache__/__init__.cpython-310.pyc
diff --git a/jobspec/cli/__pycache__/run.cpython-310.pyc b/jobspec/cli/__pycache__/run.cpython-310.pyc