Skip to content

Commit

Permalink
core: add core of transform layer with steps
Browse files Browse the repository at this point in the history
This is for Flux, which probably is not as important as other
use cases since it understands a jobspec, but it demonstrates
the basic staging, writing, and submit of a script. I will
add other steps I need now, and then likely other transformers
for things that are not flux.

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch committed Mar 18, 2024
1 parent 52f8b7a commit 3ad8df9
Show file tree
Hide file tree
Showing 33 changed files with 1,282 additions and 70 deletions.
20 changes: 20 additions & 0 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM fluxrm/flux-sched:jammy

LABEL maintainer="Vanessasaurus <@vsoch>"

# Match the default user id for a single system so we aren't root
ARG USERNAME=vscode
ARG USER_UID=1000
ARG USER_GID=1000
ENV USERNAME=${USERNAME}
ENV USER_UID=${USER_UID}
ENV USER_GID=${USER_GID}
USER root
RUN apt-get update && apt-get install -y less python3-pip

# Add the group and user that match our ids
RUN groupadd -g ${USER_GID} ${USERNAME} && \
adduser --disabled-password --uid ${USER_UID} --gid ${USER_GID} --gecos "" ${USERNAME} && \
echo "${USERNAME} ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers

USER $USERNAME
18 changes: 18 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "Jobspec Developer Environment",
"dockerFile": "Dockerfile",
"context": "../",

"customizations": {
"vscode": {
"settings": {
"terminal.integrated.defaultProfile.linux": "bash"
},
"extensions": [
"ms-vscode.cmake-tools",
"golang.go"
]
}
},
"postStartCommand": "git config --global --add safe.directory /workspaces/jobspec"
}
137 changes: 135 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,143 @@

![https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true](https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true)

This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
This library includes a cluster agnostic language to setup a job (one unit of work in a jobspec).
It is a transformational layer, or a simple language that converts steps needed to prepare a job
for a specific clusters scheduler. If you think it looks too simple then I'd say it's a success,

## Usage

### Example

Start up the development environment to find yourself in a container with flux. Start a test instance:

```bash
flux start --test-size=4
```

Note that we have 4 faux nodes and 40 faux cores.

```bash
flux resource list
```
```console
STATE NNODES NCORES NGPUS NODELIST
free 4 40 0 194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c
allocated 0 0 0
down 0 0 0
```

Ensure you have jobspec installed! Yes, we are vscode, installing to the container, so we use sudo. YOLO.

```bash
sudo pip install -e .
```

We are going to run the [examples/hello-world-jobspec.yaml](examples/hello-world-jobspec.yaml). This setup is way overly
complex for this because we don't actually need to do any staging or special work, but it's an example, so intended to be so.
Also note that the design of this file is subject to change. For example, we don't have to include the transform directly in the
jobspec - it can be a file that the jobspec writes, and then the command is issued. I like it better as a piece of it, so am putting
it there for the time being, mostly because it looks nicer. I'm sure someone will disagree with me about that.

```bash

```

### Details

As an example, although you *could* submit a job with a command ready to go - assuming your cluster has the
software needed and files, and you just want to run it, assuming submission to a cluster you haven't
setup on, you might need the following logic:

1. Write a script to file that is intended to install something.
2. Stage this file across nodes.
3. Submit the script to all nodes to do the install.
4. Write a script to file for your actual job.
5. Again, stage this file across nodes (assuming no share filesystem)
6. Submit the job, either as a submit or batch directive to a workload manager.

The way that you do this with every workload manager (or cluster, more generally) is going to vary
quite a bit. However, with a transformation - a mapping of abstract steps to a specific cluster
workload manager, you can write those steps out very simply:

```yaml
transform:

- step: write
filename: install.sh
executable: true

- step: stage
filename: install.sh

- step: submit
filename: install.sh
wait: true

- step: write
filename: job.sh
executable: true

- step: stage
filename: job.sh

- step: submit
filename: job.sh
wait: true
```
The above assumes we don't have a shared filesystem, and the receiving cluster has some cluster-specific method for staging or
file mapping. It could be ssh, or a filemap, or something else. For an ephemeral cluster API, it might be an interaction with
a storage provider, or just adding the file to an API call that will (in and of itself) do that creation, akin to a startup script for
an instance in Terraform. It really doesn't matter - the user can expect the file to be written and shared across nodes.
This is not intended to be a workflow or build tool - it simply is a transformational layer that a jobspec can provide
to setup a specific cluster environment. It works with a jobspec in that you define your filenames (scripts) in the tasks->scripts
directive. It also uses a plugin design, so a cluster or institution can write a custom transformer to install, and it will be discovered
by name. This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
Jobspec is an entity of [flux-framework](https://flux-framework.org).
**under development**
### Frequently Asked Questions
#### Why not rely on Flux internals?
We want a Jobspec to be able to handle a transformation of some logic (the above) into an execution that might not involve flux at all. It could be another workload manager (e.g., Slurm) or it could be a service that submits to some cloud batch API.
#### What are all the steps allowed?
They are currently shown in the example above, and better documentation will be written. Arguably, any transformation backend does not
need to support every kind of step, however if you provide a Jobspec to a transformer with a step not supported, you'll get an error.
#### Where are the different transformers defined?
We currently have our primary (core) transformers here in [jobspec/transformer](jobspec/transformer), however a registry that discovers jobspec-* named Python modules can allow an out of tree install and use of a transfomrmer. This use case is anticipating clusters with some custom or private logic that cannot be shared in a public GitHub repository.
#### How do you know this is a good idea?
I don't, or won't until I try it for experiments. I decided to try something like it after several days of preparing for experiments,and realizing that this transformation layer was entirely missing.
### Means of Interaction
There are several likely means of interacting with this library:
- As a service that runs at some frequency to receive jobs (written as a loop in Python in some context)
- As a cron job that does the same (an entry to crontab to run "jobspec" at some frequency)
- As a one off run (a single run of the above)
For the example usage here, and since the project I am working on is concerned with Flux, we will start with the simplest case - a client that is running inside a flux instance (meaning it can import flux) that reads in a jobspec with a section that defines a set of transforms, and then issues the commands to stage the setup and use flux to run the work defined by the jobspec.
## TODO
- write the hello world example with flux
- add the staging example
- write the same, but using batch
## Developer
### Organization
While you can write an external transformer (as a plugin) a set of core transformers are provided here:
- [jobspec/transformer](jobspec/transformer): core transformer classes that ship internally here.
## License
Expand Down
40 changes: 40 additions & 0 deletions examples/hello-world-jobspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# TODO test these out on on x86, then create arm specs
version: 1
resources:
- count: 4
type: node
with:
- count: 1
label: hello-world
type: slot
with:
- count: 4
type: core

task:
command: ["/bin/bash", "job.sh"]
transform:
- step: write
filename: job.sh
executable: true

# This is only provided as an example - in the devcontainer it's just one physicap machine!
#- step: stage
# filename: install.sh

- step: submit
filename: job.sh

scripts:
- name: job.sh
content: |
#!/bin/bash
echo hello world from $(hostname)
count:
per_slot: 1
resources:
hardware:
hardware.gpu.available: 'no'
io.archspec:
cpu.target: amd64
slot: hello-world
41 changes: 41 additions & 0 deletions examples/hello-world-wait-jobspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# TODO test these out on on x86, then create arm specs
version: 1
resources:
- count: 4
type: node
with:
- count: 1
label: hello-world
type: slot
with:
- count: 4
type: core

task:
command: ["/bin/bash", "job.sh"]
transform:
- step: write
filename: job.sh
executable: true

# This is only provided as an example - in the devcontainer it's just one physicap machine!
#- step: stage
# filename: install.sh

- step: submit
filename: job.sh
wait: true

scripts:
- name: job.sh
content: |
#!/bin/bash
echo hello world from $(hostname)
count:
per_slot: 1
resources:
hardware:
hardware.gpu.available: 'no'
io.archspec:
cpu.target: amd64
slot: hello-world
106 changes: 106 additions & 0 deletions jobspec/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#!/usr/bin/env python

import argparse
import os
import sys

import jobspec
from jobspec.logger import setup_logger


def get_parser():
parser = argparse.ArgumentParser(
description="Jobspec",
formatter_class=argparse.RawTextHelpFormatter,
)

# Global Variables
parser.add_argument(
"--debug",
dest="debug",
help="use verbose logging to debug.",
default=False,
action="store_true",
)

parser.add_argument(
"--quiet",
dest="quiet",
help="suppress additional output.",
default=False,
action="store_true",
)
parser.add_argument(
"--version",
dest="version",
help="show software version.",
default=False,
action="store_true",
)

subparsers = parser.add_subparsers(
help="actions",
title="actions",
description="actions",
dest="command",
)
subparsers.add_parser("version", description="show software version")

# Maybe this warrants a better name, but this seems to be what we'd want to do -
# run a jobspec
run = subparsers.add_parser(
"run",
formatter_class=argparse.RawTextHelpFormatter,
description="receive and run a jobpsec",
)
run.add_argument("-t", "--transform", help="transformer to use", default="flux")
run.add_argument("jobspec", help="jobspec yaml file", default="jobspec.yaml")
return parser


def run_jobspec():
"""
this is the main entrypoint.
"""
parser = get_parser()

def help(return_code=0):
"""print help, including the software version and active client
and exit with return code.
"""
version = jobspec.__version__

print("\nJobspec v%s" % version)
parser.print_help()
sys.exit(return_code)

# If the user didn't provide any arguments, show the full help
if len(sys.argv) == 1:
help()

# If an error occurs while parsing the arguments, the interpreter will exit with value 2
args, extra = parser.parse_known_args()

if args.debug is True:
os.environ["MESSAGELEVEL"] = "DEBUG"

# Show the version and exit
if args.command == "version" or args.version:
print(jobspec.__version__)
sys.exit(0)

setup_logger(
quiet=args.quiet,
debug=args.debug,
)

# Here we can assume instantiated to get args
if args.command == "run":
from .run import main
else:
help(1)
main(args, extra)


if __name__ == "__main__":
run_jobspec()
Binary file not shown.
Binary file added jobspec/cli/__pycache__/run.cpython-310.pyc
Binary file not shown.
Loading

0 comments on commit 3ad8df9

Please sign in to comment.