Skip to content

Commit

Permalink
WIP to add consistent run (#21)
Browse files Browse the repository at this point in the history
* WIP to add consistent run

the server is currently too flakey with the port forward to be reliable for communication.
I need to rethink how to do this because I am not happy with it.

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Jan 23, 2023
1 parent 1c49436 commit 2e73ab6
Show file tree
Hide file tree
Showing 53 changed files with 3,592 additions and 247 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
- support for submit and batch, to run jobs on the same MiniCluster (0.1.15)
- minikube docker pull needs message, update tests and typo (0.1.14)
- wait until pods terminated and removed between applies (0.1.13)
- add support for custom placement group name (0.1.12)
Expand Down
68 changes: 65 additions & 3 deletions docs/getting_started/commands.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Commands

The following commands are provided by Flux Cloud.
The following commands are provided by Flux Cloud. For running jobs, you can either do:

- **apply**/**run**: A single/multi job submission intended for different containers to re-create pods each time.
- **batch**/**submit**: A single/multi job submission intended for a common container base where we use the same set of pods.

Both are described in the following sections.

## list

Expand Down Expand Up @@ -43,6 +48,8 @@ $ flux-cloud apply -e k8s-size-8-m5.large --size 2

## run

> Up, apply, down in one command, ideal for completely headless runs and jobs with different containers.

The main command is a "run" that is going to, for each cluster:

1. Create the cluster
Expand Down Expand Up @@ -131,7 +138,9 @@ $ flux-cloud up -e n1-standard-1-2 --force-cluster

## apply

And then run experiments (as you feel) with "apply."
> Ideal for running multiple jobs with different containers.

After "up" you can choose to run experiments (as you feel) with "apply."

```bash
$ flux-cloud apply
Expand All @@ -150,9 +159,61 @@ To force overwrite of existing results (by default they are skipped)
$ flux-cloud apply -e n1-standard-1-2 --force
```

Note that by default, we always wait for a previous run to be cleaned up
Apply is going to be creating on CRD per job, so that's a lot of
pod creation and deletion. This is in comparison to "submit" that
brings up a MiniCluster once, and then executes commands to it, allowing
Flux to serve as the scheduler. Note that by default, we always wait for a previous run to be cleaned up
before continuing.

## submit

> Ideal for one or more commands across the same container(s) and MiniCluster size.

```bash
$ flux-cloud up --cloud minikube
$ flux-cloud submit --cloud minikube
$ flux-cloud down --cloud minikube
```

The submit will always check if the MiniCluster is already created, and if not, create it
to submit jobs. For submit (and the equivalent to bring it up and down with batch)
your commands aren't provided in the CRD,
but rather to the Flux Restful API. Submit / batch will also generate one CRD
per MiniCluster size, but use the same MiniCluster across jobs. This is different
from apply, which generates one CRD per job to run.

## batch

> Up, submit, down in one command, ideal for jobs with the same container(s)

The "batch" command is comparable to "run" except we are running commands
across the same set of containers. We don't need to bring pods up/down each time,
and we are using Flux in our cluster to handle scheduling.
This command is going to:

1. Create the cluster
2. Run each of the experiments, saving output and timing, on the same pods
3. Bring down the cluster

The output is organized in the same way, and as before, you can choose to run a single
command with "submit"

```bash
$ flux-cloud batch --cloud aws
```

Note that since we are communicating with the FluxRestful API, you are required to
provide a `FLUX_USER` and `FLUX_TOKEN` for the API. If you are running this programmatically,
the Flux Restful Client will handle this, however if you, for example, press control C to
cancel a run, you'll need to copy paste the username and token that was previously shown
before running submit again to continue where you left off. Batch is equivalent to:

```bash
$ flux-cloud up
$ flux-cloud submit
$ flux-cloud down
```

## down

And then bring down your first (or named) cluster:
Expand All @@ -174,6 +235,7 @@ You can also use `--force-cluster` here:
$ flux-cloud down --force-cluster
```


## debug

For any command, you can add `--debug` as a main client argument to see additional information. E.g.,
Expand Down
3 changes: 2 additions & 1 deletion docs/getting_started/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
The easiest thing to do is arguably to start with an example,
and then customize it. Here we will add examples as we create them.

- [up-apply-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-apply-down)
- [up-apply-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-apply-down): shows using `flux-cloud apply` for individual CRD submission.
- [osu-benchmarks](https://github.com/converged-computing/flux-cloud/tree/main/examples/osu-benchmarks)
- [up-submit-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-submit-down): shows using `flux-cloud submit` for batch submission.

The above example runs a single command in a single Kubernetes cluster and MiniCluster,
and it's lammps!
Expand Down
35 changes: 31 additions & 4 deletions docs/getting_started/minikube.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,17 @@

> Running on a local MiniKube cluster
Flux Cloud (as of version 0.1.0) can run on MiniKube! The main steps of running experiments are:
Flux Cloud (as of version 0.1.0) can run on MiniKube! The main steps of running experiments with
different container bases are:

- **up** to bring up a cluster
- **apply** to apply one or more experiments defined by an experiments.yaml
- **apply** to apply one or more CRDs from experiments defined by an experiments.yaml
- **down** to destroy a cluster

or one or more commands with the same container base(s):

- **up** to bring up a cluster
- **submit** to submit one or more experiments to the same set of pods defined by an experiments.yaml
- **down** to destroy a cluster

Each of these commands can be run in isolation, and we provide a single command **run** to
Expand All @@ -19,7 +26,6 @@ want to remove the abstraction at any point and run the commands on your own, yo
You should first [install minikube](https://minikube.sigs.k8s.io/docs/start/)
and kubectl.


## Run Experiments

Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
Expand All @@ -29,7 +35,11 @@ provide this library for you to easily edit and use! Take a look at the [example
directory for a few that we provide. We will walk through a generic one here to launch
an experiment on a MiniKube Kubernetes cluster. Note that before doing this step you should
have installed flux-cloud, along with kubectl and minikube. Note that if it's not the default,
you'll need to specify using MiniKube:
you'll need to specify using MiniKube

### Apply / Run

> Ideal if you need to run multiple jobs on different containers
```bash
$ flux-cloud run --cloud minikube experiments.yaml
Expand Down Expand Up @@ -108,3 +118,20 @@ spec:
workingDir: /home/flux/examples/reaxff/HNS
command: {{ job.command }}
```
### Submit
> Ideal for one or more commands across the same container(s) and MiniCluster size.
```bash
$ flux-cloud up --cloud minikube
$ flux-cloud submit --cloud minikube
$ flux-cloud down --cloud minikube
```

The submit will always check if the MiniCluster is already created, and if not, create it
to submit jobs. For submit (and the equivalent to bring it up and down with batch)
your commands aren't provided in the CRD,
but rather to the Flux Restful API. Submit / batch will also generate one CRD
per MiniCluster size, but use the same MiniCluster across jobs. This is different
from apply, which generates one CRD per job to run.
18 changes: 16 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,30 @@ and save the output, and bring it down. This is what flux cloud does! With Flux
4. Run the experiments (each a MiniCluster) and save output and timings.
5. Bring down the cluster as soon as you are done.

For all of the above, you can either run with one command `flux-cloud run` or break into three:
For all of the above, there are two modes of execution. If you have different containers you want to run for jobs,
then you would want to use **run** or **apply** to create separate sets of pods, each time bringing them up and down.
That can be done with either run with one command `flux-cloud run` or broken into three:

.. code-block:: console
$ flux-cloud up
$ flux-cloud apply
$ flux-cloud down
If you want to instead run one or more commands *across the same set of pods* meaning that your container(s)
base(s) do not need to change, you can use **submit**:

And given any failure of a command, you are given the option to try again or exit and cancel. E.g.,
.. code-block:: console
$ flux-cloud up
$ flux-cloud submit
$ flux-cloud down
And for the single command equivalent, do `flux-cloud batch`. The difference in the latter is that we will actually
be using Flux as a scheduler, and have much more efficient runs in that we don't need to bring down pods and bring them
back up each time.

For either approach, given any failure of a command, you are given the option to try again or exit and cancel. E.g.,
when you are developing, you can run "apply" and then easily debug until you are done and ready to bring the cluster
down.

Expand Down
62 changes: 62 additions & 0 deletions examples/up-submit-down/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
``# Up, Submit, Down

This is an example of using flux cloud to bring up a cluster, install the Flux Operator
(and then you would use it as you please) and run jobs with submit (on the same
MiniCluster) and then bring it down.
You should have kubectl and gcloud OR minikube installed for this demo. Note that
we use the [experiments.yaml](experiments.yaml) file as a default,
and we only provide basic metadata needed for a single experiment.

## Up

```bash
$ flux-cloud up
```

This will bring up your cluster, per the size and machine type defined
in your experiments file, and install the operator.

## Submit

A "submit" means running the single (or multiple) experiments defined in your
experiments.yaml on the same MiniCluster, without bringing it down between jobs.
This means we are using Flux as the scheduler proper, and we don't need to bring pods
up and down unecessarily (and submit a gazillion YAML files). There is only the number
of YAML CRD needed to correspond to the sizes of MiniClusters you run across.

```bash
$ flux-cloud submit --cloud minikube
$ flux-cloud submit --cloud google
```

## Down

To bring it down:

```bash
$ flux-cloud down
```

## Batch

Run all three with one command:

```bash
$ flux-cloud batch --cloud minikube
$ flux-cloud batch --cloud google
```


## Plot

I threw together a script to compare running times with info and output times,
where:

running time < info < output

```bash
$ pip install pandas matplotlib seaborn
```
```bash
$ python plot_results.py data/k8s-size-4-n1-standard-1/meta.json
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

NAMESPACE="flux-operator"
JOB="lammps-job"
brokerPrefix="${JOB}-0"

for pod in $(kubectl get pods --namespace ${NAMESPACE} --field-selector=status.phase=Running --output=jsonpath='{.items[*].metadata.name}'); do
if [[ "${pod}" == ${brokerPrefix}* ]]; then
echo ${pod}
break
fi
done
Loading

0 comments on commit 2e73ab6

Please sign in to comment.