add support for job repeats (#6)

* add support for job repeats: this works by way of adding a suffix to the command name * add support for aws eks * tweak helpers path in minicluster-run and help of script * generate cluster from config instead * default coloring instead of fabulous - too much! * repeats should not be deleted, otherwise only works for one run * run missing experiment id option * raise value errors instead, add debugging to show template Signed-off-by: vsoch <[email protected]>
converged-computing · Jan 4, 2023 · 7fd2e0a · 7fd2e0a
1 parent f17e0d6
commit 7fd2e0a
Show file tree

Hide file tree

Showing 30 changed files with 783 additions and 142 deletions.
diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml
@@ -0,0 +1,31 @@
+name: release cloud-select
+
+on:
+  release:
+    types: [created]
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Install
+      run: conda create --quiet --name fc twine
+
+    - name: Install dependencies
+      run: |
+        export PATH="/usr/share/miniconda/bin:$PATH"
+        source activate fc
+        pip install -e .[all]
+        pip install setuptools wheel twine
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USER }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASS }}
+      run: |
+        export PATH="/usr/share/miniconda/bin:$PATH"
+        source activate fc
+        python setup.py sdist bdist_wheel
+        twine upload dist/*
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+flux_cloud.egg-info
 .eggs
 build
 vendor

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
 The versions coincide with releases on pip. Only major versions will be released as tags on Github.
 
 ## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
+ - support for Amazon EKS and running commands over iterations (0.0.12)
  - better control of exit codes, addition of force cluster (0.0.11)
  - support for experiment id selection, addition of osu-benchmarks example (0.0.1)
  - initial skeleton release of project (0.0.0)
diff --git a/docs/getting_started/aws.md b/docs/getting_started/aws.md
@@ -0,0 +1,83 @@
+# AWS
+
+> Running on Amazon Elastic Kubernetes Service EKS
+
+The flux-cloud software provides are easy wrappers (and templates) to running
+the Flux Operator on Amazon. The main steps of running experiments are:
+
+ - **up** to bring up a cluster
+ - **apply** to apply one or more experiments defined by an experiments.yaml
+ - **down** to destroy a cluster
+
+Each of these commands can be run in isolation, and we provide a single command **run** to
+automate the entire thing. We emphasize the term "wrapper" as we are using scripts on your
+machine to do the work (e.g., kubectl and gcloud) and importantly, for every step we show
+you the command, and if it fails, give you a chance to bail out. We do this so if you
+want to remove the abstraction at any point and run the commands on your own, you can.
+
+## Pre-requisites
+
+You should first [install eksctrl](https://github.com/weaveworks/eksctl) and make sure you have access to an AWS cloud (e.g.,
+with credentials or similar in your environment). E.g.,:
+
+```bash
+export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxx
+export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+export AWS_SESSION_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+```
+
+The last session token may not be required depending on your setup.
+We assume you also have [kubectl](https://kubernetes.io/docs/tasks/tools/).
+
+### Setup SSH
+
+You'll need an ssh key for EKS. Here is how to generate it:
+
+```bash
+ssh-keygen
+# Ensure you enter the path to ~/.ssh/id_eks
+```
+
+This is used so you can ssh (connect) to your workers!
+
+### Cloud
+
+Finally, ensure that aws is either your default cloud (the `default_cloud` in your settings.yml)
+or you specify it with `--cloud` when you do run.
+
+## Run Experiments
+
+**IMPORTANT** for any experiment when you choose an instance type, you absolutely
+need to choose a size that has [IsTrunkingCompatible](https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go)
+true. E.g., `m5.large` has it set to true so it would work.
+
+Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
+populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
+library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
+provide this library for you to easily edit and use! Take a look at the [examples](../examples)
+directory for a few that we provide. We will walk through a generic one here to launch
+an experiment on a Kubernetes cluster. Note that before doing this step you should
+have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)
+in your settings.
+
+```bash
+$ flux-cloud run experiments.yaml
+```
+
+Note that since the experiments file defaults to that name, you can also just do:
+
+```bash
+$ flux-cloud run
+```
+
+Given an experiments.yaml in the present working directory. Take a look at an `experients.yaml` in an example directory.
+Note that machines and size are required for the matrix, and variables get piped into all experiments (in full). Under variables,
+both "commands" and "ids" are required, and must be equal in length (each command is assigned to one id
+for output). To just run the first entry in the matrix (test mode) do:
+
+```bash
+$ flux-cloud run experiments.yaml --test
+```
+
+Note that you can also use the other commands in place of a single run, notably "up" "apply" and "down."
+By default, results will be written to a temporary output directory, but you can customize this with `--outdir`.
diff --git a/docs/getting_started/commands.md b/docs/getting_started/commands.md
@@ -150,5 +150,33 @@ You can also use `--force-cluster` here:
 $ flux-cloud down --force-cluster
 ```
 
+## debug
+
+For any command, you can add `--debug` as a main client argument to see additional information. E.g.,
+the cluster config created for eksctl:
+
+```bash
+$ flux-cloud --debug up
+```
+```console
+No experiment ID provided, assuming first experiment m5.large-2.
+apiVersion: eksctl.io/v1alpha5
+kind: ClusterConfig
+
+metadata:
+  name: flux-cluster
+  region: us-east-1
+  version: 1.23
+
+# availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1d"]
+managedNodeGroups:
+  - name: workers
+    instanceType: m5.large
+    minSize: 2
+    maxSize: 2
+    labels: { "fluxoperator": "true" }
+...
+```
+
 And that's it! I think there might be a more elegant way to determine what cluster is running,
 however if the user decides to launch more than one, it might be harder. More thinking / docs / examples coming soon.
diff --git a/docs/getting_started/experiments.md b/docs/getting_started/experiments.md
@@ -70,6 +70,30 @@ minicluster:
   namespace: flux-operator
 ```
 
+### Kubernetes
+
+While it's recommended to define defaults for Kubernetes (e.g., version) in your `settings.yml`, you can one-off edit them
+via a "cluster" attribute in your `experiments.yaml`. Unlike settings, this supports a field for "tags" that should be a list of strings:
+
+```yaml
+cluster:
+  version: "1.23"
+  tags:
+    - lammps
+```
+
+Note that the above is for a Google GKE cluster - tags is a single list of tags. For AWS EKS, you need to provide key value pairs:
+
+```yaml
+cluster:
+  version: "1.22"
+  tags:
+    - analysis=lammps
+```
+
+This is validated at runtime when you create the cluster. For both, they are converted to comma separated values to provide
+to the command line client.
+
 ### Jobs
 
 The jobs specification defines what commands (required) you want run across each Kubernetes cluster.
@@ -86,17 +110,21 @@ jobs:
 ```
 
 If you have different working directories or container images, you can define that here:
+Note that each job can have a command (required) and working directory, image,
+and repeats (optional).
 
 ```yaml
-# Each job can have a command and working directory
 jobs:
   osu_get_latency:
     command: './osu_get_latency'
     image: ghcr.io/awesome/science:latest
     workdir: /path/to/science
+    repeats: 3
 ```
 
-Note that likely in the future we can provide a default template and require all these variables
+For repeats, we add another level to the output directory, and represent the result data as
+subdirectories of the machine and size from 1..N. Note also that likely in the future we
+can provide a default template and require all these variables
 defined. For now we require you to provide the template.
 
 ### Custom Resource Definition

diff --git a/docs/getting_started/google.md b/docs/getting_started/google.md
@@ -31,6 +31,12 @@ $ gcloud components install kubectl
 ```
 or just [on your own](https://kubernetes.io/docs/tasks/tools/).
 
+## Cloud
+
+Finally, ensure that google is either your default cloud (the `default_cloud` in your settings.yml)
+or you specify it with `--cloud` when you do run.
+
+
 ## Run Experiments
 
 Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to

diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md
@@ -12,4 +12,5 @@ examples
 experiments
 settings
 google
+aws
 ```
diff --git a/docs/getting_started/install.md b/docs/getting_started/install.md
@@ -25,6 +25,17 @@ $ flux-cloud config get google:project
 google:project                 dinosaur
 ```
 
+Ensure your default cloud is set to the one you want!
+
+```bash
+$ flux-cloud config get default_cloud
+default_cloud                 aws
+
+$ flux-cloud config set default_cloud google
+default_cloud                 google
+```
+
+We don't discriminate or judge about clouds, we like them all!
 Also set your editor of choice, and then you can edit in it (it defaults to vim)
 
 ```bash

diff --git a/docs/getting_started/settings.md b/docs/getting_started/settings.md
@@ -50,6 +50,10 @@ The following settings are available for Flux Cloud
 | google.zone | string | The default zone to use in Google Cloud | us-central1-a | true |
 | google.machine | string | The default machine to use | n2-standard-1 | true |
 | google.project | string | The default google project to use | unset | true |
+| aws | object | A group of settings for Amazon EKS | NA | true |
+| aws.region | string | The default region to use in Amazon EKS | us-east1 | true |
+| aws.machine | string | The default machine to use | m5.large | true |
+| aws.ssh_key | string | If ssh access is desired, provide an ssh key you've generated | unset | false |
 
 For the above, you'll notice the only setting you really need to define (per the user guide)
-is your Google Cloud project.
+is your Google Cloud project. AWS gets everything from the environment.
diff --git a/docs/index.md b/docs/index.md
@@ -28,8 +28,8 @@ when you are developing, you can run "apply" and then easily debug until you are
 down.
 
 This project is currently 🚧️ Under Construction! 🚧️ and optimized for the creator @vsoch's use case
-to run experiments in Google Cloud. We likely will add more features and clouds as they are needed
-or requested. This is a *converged computing* project that aims
+to run experiments in Google Cloud (GKS) and Amazon Web Services (EKS). We likely will add more features
+and clouds as they are needed or requested. This is a *converged computing* project that aims
 to unite the worlds and technologies typical of cloud computing and
 high performance computing.
 

diff --git a/fluxcloud/client/__init__.py b/fluxcloud/client/__init__.py
@@ -182,8 +182,6 @@ def get_parser():
             action="store_true",
             default=False,
         )
-
-    for command in apply, up, down:
         command.add_argument(
             "--id",
             "-e",
-Original file line number
+Diff line change
@@ -1,3 +1,4 @@
+    flux_cloud.egg-info
     .eggs
     build
     vendor
@@ Expand Down @@