WIP to add consistent run (#21)

* WIP to add consistent run the server is currently too flakey with the port forward to be reliable for communication. I need to rethink how to do this because I am not happy with it. Signed-off-by: vsoch <[email protected]>
converged-computing · Jan 23, 2023 · 2e73ab6 · 2e73ab6
1 parent 1c49436
commit 2e73ab6
Show file tree

Hide file tree

Showing 53 changed files with 3,592 additions and 247 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
 The versions coincide with releases on pip. Only major versions will be released as tags on Github.
 
 ## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
+ - support for submit and batch, to run jobs on the same MiniCluster (0.1.15)
  - minikube docker pull needs message, update tests and typo (0.1.14)
  - wait until pods terminated and removed between applies (0.1.13)
  - add support for custom placement group name (0.1.12)

diff --git a/docs/getting_started/commands.md b/docs/getting_started/commands.md
@@ -1,6 +1,11 @@
 # Commands
 
-The following commands are provided by Flux Cloud.
+The following commands are provided by Flux Cloud. For running jobs, you can either do:
+
+- **apply**/**run**: A single/multi job submission intended for different containers to re-create pods each time.
+- **batch**/**submit**: A single/multi job submission intended for a common container base where we use the same set of pods.
+
+Both are described in the following sections.
 
 ## list
 
@@ -43,6 +48,8 @@ $ flux-cloud apply -e k8s-size-8-m5.large --size 2
 
 ## run
 
+> Up, apply, down in one command, ideal for completely headless runs and jobs with different containers.
+
 The main command is a "run" that is going to, for each cluster:
 
 1. Create the cluster
@@ -131,7 +138,9 @@ $ flux-cloud up -e n1-standard-1-2 --force-cluster
 
 ## apply
 
-And then run experiments (as you feel) with "apply."
+> Ideal for running multiple jobs with different containers.
+
+After "up" you can choose to run experiments (as you feel) with "apply."
 
 ```bash
 $ flux-cloud apply
@@ -150,9 +159,61 @@ To force overwrite of existing results (by default they are skipped)
 $ flux-cloud apply -e n1-standard-1-2 --force
 ```
 
-Note that by default, we always wait for a previous run to be cleaned up
+Apply is going to be creating on CRD per job, so that's a lot of
+pod creation and deletion. This is in comparison to "submit" that
+brings up a MiniCluster once, and then executes commands to it, allowing
+Flux to serve as the scheduler. Note that by default, we always wait for a previous run to be cleaned up
 before continuing.
 
+## submit
+
+> Ideal for one or more commands across the same container(s) and MiniCluster size.
+
+```bash
+$ flux-cloud up --cloud minikube
+$ flux-cloud submit --cloud minikube
+$ flux-cloud down --cloud minikube
+```
+
+The submit will always check if the MiniCluster is already created, and if not, create it
+to submit jobs. For submit (and the equivalent to bring it up and down with batch)
+your commands aren't provided in the CRD,
+but rather to the Flux Restful API. Submit / batch will also generate one CRD
+per MiniCluster size, but use the same MiniCluster across jobs. This is different
+from apply, which generates one CRD per job to run.
+
+## batch
+
+> Up, submit, down in one command, ideal for jobs with the same container(s)
+
+The "batch" command is comparable to "run" except we are running commands
+across the same set of containers. We don't need to bring pods up/down each time,
+and we are using Flux in our cluster to handle scheduling.
+This command is going to:
+
+1. Create the cluster
+2. Run each of the experiments, saving output and timing, on the same pods
+3. Bring down the cluster
+
+The output is organized in the same way, and as before, you can choose to run a single
+command with "submit"
+
+```bash
+$ flux-cloud batch --cloud aws
+```
+
+Note that since we are communicating with the FluxRestful API, you are required to
+provide a `FLUX_USER` and `FLUX_TOKEN` for the API. If you are running this programmatically,
+the Flux Restful Client will handle this, however if you, for example, press control C to
+cancel a run, you'll need to copy paste the username and token that was previously shown
+before running submit again to continue where you left off. Batch is equivalent to:
+
+```bash
+$ flux-cloud up
+$ flux-cloud submit
+$ flux-cloud down
+```
+
 ## down
 
 And then bring down your first (or named) cluster:
@@ -174,6 +235,7 @@ You can also use `--force-cluster` here:
 $ flux-cloud down --force-cluster
 ```
 
+
 ## debug
 
 For any command, you can add `--debug` as a main client argument to see additional information. E.g.,

diff --git a/docs/getting_started/examples.md b/docs/getting_started/examples.md
@@ -3,8 +3,9 @@
 The easiest thing to do is arguably to start with an example,
 and then customize it. Here we will add examples as we create them.
 
-- [up-apply-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-apply-down)
+- [up-apply-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-apply-down): shows using `flux-cloud apply` for individual CRD submission.
 - [osu-benchmarks](https://github.com/converged-computing/flux-cloud/tree/main/examples/osu-benchmarks)
+- [up-submit-down](https://github.com/converged-computing/flux-cloud/tree/main/examples/up-submit-down): shows using `flux-cloud submit` for batch submission.
 
 The above example runs a single command in a single Kubernetes cluster and MiniCluster,
 and it's lammps!

diff --git a/docs/getting_started/minikube.md b/docs/getting_started/minikube.md
@@ -2,10 +2,17 @@
 
 > Running on a local MiniKube cluster
 
-Flux Cloud (as of version 0.1.0) can run on MiniKube! The main steps of running experiments are:
+Flux Cloud (as of version 0.1.0) can run on MiniKube! The main steps of running experiments with
+different container bases are:
 
  - **up** to bring up a cluster
- - **apply** to apply one or more experiments defined by an experiments.yaml
+ - **apply** to apply one or more CRDs from experiments defined by an experiments.yaml
+ - **down** to destroy a cluster
+
+or one or more commands with the same container base(s):
+
+ - **up** to bring up a cluster
+ - **submit** to submit one or more experiments to the same set of pods defined by an experiments.yaml
  - **down** to destroy a cluster
 
 Each of these commands can be run in isolation, and we provide a single command **run** to
@@ -19,7 +26,6 @@ want to remove the abstraction at any point and run the commands on your own, yo
 You should first [install minikube](https://minikube.sigs.k8s.io/docs/start/)
 and kubectl.
 
-
 ## Run Experiments
 
 Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
@@ -29,7 +35,11 @@ provide this library for you to easily edit and use! Take a look at the [example
 directory for a few that we provide. We will walk through a generic one here to launch
 an experiment on a MiniKube Kubernetes cluster. Note that before doing this step you should
 have installed flux-cloud, along with kubectl and minikube. Note that if it's not the default,
-you'll need to specify using MiniKube:
+you'll need to specify using MiniKube
+
+### Apply / Run
+
+> Ideal if you need to run multiple jobs on different containers
 
 ```bash
 $ flux-cloud run --cloud minikube experiments.yaml
@@ -108,3 +118,20 @@ spec:
       workingDir: /home/flux/examples/reaxff/HNS
       command: {{ job.command }}
 ```
+
+### Submit
+
+> Ideal for one or more commands across the same container(s) and MiniCluster size.
+
+```bash
+$ flux-cloud up --cloud minikube
+$ flux-cloud submit --cloud minikube
+$ flux-cloud down --cloud minikube
+```
+
+The submit will always check if the MiniCluster is already created, and if not, create it
+to submit jobs. For submit (and the equivalent to bring it up and down with batch)
+your commands aren't provided in the CRD,
+but rather to the Flux Restful API. Submit / batch will also generate one CRD
+per MiniCluster size, but use the same MiniCluster across jobs. This is different
+from apply, which generates one CRD per job to run.
diff --git a/docs/index.rst b/docs/index.rst
@@ -18,16 +18,30 @@ and save the output, and bring it down. This is what flux cloud does! With Flux
 4. Run the experiments (each a MiniCluster) and save output and timings.
 5. Bring down the cluster as soon as you are done.
 
-For all of the above, you can either run with one command `flux-cloud run` or break into three:
+For all of the above, there are two modes of execution. If you have different containers you want to run for jobs,
+then you would want to use **run** or **apply** to create separate sets of pods, each time bringing them up and down.
+That can be done with either run with one command `flux-cloud run` or broken into three:
 
 .. code-block:: console
 
     $ flux-cloud up
     $ flux-cloud apply
     $ flux-cloud down
 
+If you want to instead run one or more commands *across the same set of pods* meaning that your container(s)
+base(s) do not need to change, you can use **submit**:
 
-And given any failure of a command, you are given the option to try again or exit and cancel. E.g.,
+.. code-block:: console
+
+    $ flux-cloud up
+    $ flux-cloud submit
+    $ flux-cloud down
+
+And for the single command equivalent, do `flux-cloud batch`. The difference in the latter is that we will actually
+be using Flux as a scheduler, and have much more efficient runs in that we don't need to bring down pods and bring them
+back up each time.
+
+For either approach, given any failure of a command, you are given the option to try again or exit and cancel. E.g.,
 when you are developing, you can run "apply" and then easily debug until you are done and ready to bring the cluster
 down.
 

diff --git a/examples/up-submit-down/README.md b/examples/up-submit-down/README.md
@@ -0,0 +1,62 @@
+``# Up, Submit, Down
+
+This is an example of using flux cloud to bring up a cluster, install the Flux Operator
+(and then you would use it as you please) and run jobs with submit (on the same
+MiniCluster) and then bring it down.
+You should have kubectl and gcloud OR minikube installed for this demo. Note that
+we use the [experiments.yaml](experiments.yaml) file as a default,
+and we only provide basic metadata needed for a single experiment.
+
+## Up
+
+```bash
+$ flux-cloud up
+```
+
+This will bring up your cluster, per the size and machine type defined
+in your experiments file, and install the operator.
+
+## Submit
+
+A "submit" means running the single (or multiple) experiments defined in your
+experiments.yaml on the same MiniCluster, without bringing it down between jobs.
+This means we are using Flux as the scheduler proper, and we don't need to bring pods
+up and down unecessarily (and submit a gazillion YAML files). There is only the number
+of YAML CRD needed to correspond to the sizes of MiniClusters you run across.
+
+```bash
+$ flux-cloud submit --cloud minikube
+$ flux-cloud submit --cloud google
+```
+
+## Down
+
+To bring it down:
+
+```bash
+$ flux-cloud down
+```
+
+## Batch
+
+Run all three with one command:
+
+```bash
+$ flux-cloud batch --cloud minikube
+$ flux-cloud batch --cloud google
+```
+
+
+## Plot
+
+I threw together a script to compare running times with info and output times,
+where:
+
+running time < info < output
+
+```bash
+$ pip install pandas matplotlib seaborn
+```
+```bash
+$ python plot_results.py data/k8s-size-4-n1-standard-1/meta.json
+```
diff --git a/examples/up-submit-down/data/k8s-size-4-n1-standard-1/.scripts/broker-id.sh b/examples/up-submit-down/data/k8s-size-4-n1-standard-1/.scripts/broker-id.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+NAMESPACE="flux-operator"
+JOB="lammps-job"
+brokerPrefix="${JOB}-0"
+
+for pod in $(kubectl get pods --namespace ${NAMESPACE} --field-selector=status.phase=Running --output=jsonpath='{.items[*].metadata.name}'); do
+    if [[ "${pod}" == ${brokerPrefix}* ]]; then
+        echo ${pod}
+        break
+    fi
+done