Skip to content

Commit

Permalink
add support for job repeats (#6)
Browse files Browse the repository at this point in the history
* add support for job repeats: this works by way of adding a suffix to the command name
* add support for aws eks
* tweak helpers path in minicluster-run and help of script
* generate cluster from config instead
* default coloring instead of fabulous - too much!
* repeats should not be deleted, otherwise only works for one run
* run missing experiment id option
* raise value errors instead, add debugging to show template

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Jan 4, 2023
1 parent f17e0d6 commit 7fd2e0a
Show file tree
Hide file tree
Showing 30 changed files with 783 additions and 142 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: release cloud-select

on:
release:
types: [created]

jobs:
deploy:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Install
run: conda create --quiet --name fc twine

- name: Install dependencies
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate fc
pip install -e .[all]
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USER }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASS }}
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate fc
python setup.py sdist bdist_wheel
twine upload dist/*
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
flux_cloud.egg-info
.eggs
build
vendor
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
- support for Amazon EKS and running commands over iterations (0.0.12)
- better control of exit codes, addition of force cluster (0.0.11)
- support for experiment id selection, addition of osu-benchmarks example (0.0.1)
- initial skeleton release of project (0.0.0)
83 changes: 83 additions & 0 deletions docs/getting_started/aws.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# AWS

> Running on Amazon Elastic Kubernetes Service EKS
The flux-cloud software provides are easy wrappers (and templates) to running
the Flux Operator on Amazon. The main steps of running experiments are:

- **up** to bring up a cluster
- **apply** to apply one or more experiments defined by an experiments.yaml
- **down** to destroy a cluster

Each of these commands can be run in isolation, and we provide a single command **run** to
automate the entire thing. We emphasize the term "wrapper" as we are using scripts on your
machine to do the work (e.g., kubectl and gcloud) and importantly, for every step we show
you the command, and if it fails, give you a chance to bail out. We do this so if you
want to remove the abstraction at any point and run the commands on your own, you can.

## Pre-requisites

You should first [install eksctrl](https://github.com/weaveworks/eksctl) and make sure you have access to an AWS cloud (e.g.,
with credentials or similar in your environment). E.g.,:

```bash
export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
export AWS_SESSION_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

The last session token may not be required depending on your setup.
We assume you also have [kubectl](https://kubernetes.io/docs/tasks/tools/).

### Setup SSH

You'll need an ssh key for EKS. Here is how to generate it:

```bash
ssh-keygen
# Ensure you enter the path to ~/.ssh/id_eks
```

This is used so you can ssh (connect) to your workers!

### Cloud

Finally, ensure that aws is either your default cloud (the `default_cloud` in your settings.yml)
or you specify it with `--cloud` when you do run.

## Run Experiments

**IMPORTANT** for any experiment when you choose an instance type, you absolutely
need to choose a size that has [IsTrunkingCompatible](https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go)
true. E.g., `m5.large` has it set to true so it would work.

Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
provide this library for you to easily edit and use! Take a look at the [examples](../examples)
directory for a few that we provide. We will walk through a generic one here to launch
an experiment on a Kubernetes cluster. Note that before doing this step you should
have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)
in your settings.

```bash
$ flux-cloud run experiments.yaml
```

Note that since the experiments file defaults to that name, you can also just do:

```bash
$ flux-cloud run
```

Given an experiments.yaml in the present working directory. Take a look at an `experients.yaml` in an example directory.
Note that machines and size are required for the matrix, and variables get piped into all experiments (in full). Under variables,
both "commands" and "ids" are required, and must be equal in length (each command is assigned to one id
for output). To just run the first entry in the matrix (test mode) do:

```bash
$ flux-cloud run experiments.yaml --test
```

Note that you can also use the other commands in place of a single run, notably "up" "apply" and "down."
By default, results will be written to a temporary output directory, but you can customize this with `--outdir`.
28 changes: 28 additions & 0 deletions docs/getting_started/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,5 +150,33 @@ You can also use `--force-cluster` here:
$ flux-cloud down --force-cluster
```

## debug

For any command, you can add `--debug` as a main client argument to see additional information. E.g.,
the cluster config created for eksctl:

```bash
$ flux-cloud --debug up
```
```console
No experiment ID provided, assuming first experiment m5.large-2.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: flux-cluster
region: us-east-1
version: 1.23
# availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1d"]
managedNodeGroups:
- name: workers
instanceType: m5.large
minSize: 2
maxSize: 2
labels: { "fluxoperator": "true" }
...
```

And that's it! I think there might be a more elegant way to determine what cluster is running,
however if the user decides to launch more than one, it might be harder. More thinking / docs / examples coming soon.
32 changes: 30 additions & 2 deletions docs/getting_started/experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,30 @@ minicluster:
namespace: flux-operator
```
### Kubernetes
While it's recommended to define defaults for Kubernetes (e.g., version) in your `settings.yml`, you can one-off edit them
via a "cluster" attribute in your `experiments.yaml`. Unlike settings, this supports a field for "tags" that should be a list of strings:

```yaml
cluster:
version: "1.23"
tags:
- lammps
```

Note that the above is for a Google GKE cluster - tags is a single list of tags. For AWS EKS, you need to provide key value pairs:

```yaml
cluster:
version: "1.22"
tags:
- analysis=lammps
```

This is validated at runtime when you create the cluster. For both, they are converted to comma separated values to provide
to the command line client.

### Jobs

The jobs specification defines what commands (required) you want run across each Kubernetes cluster.
Expand All @@ -86,17 +110,21 @@ jobs:
```

If you have different working directories or container images, you can define that here:
Note that each job can have a command (required) and working directory, image,
and repeats (optional).

```yaml
# Each job can have a command and working directory
jobs:
osu_get_latency:
command: './osu_get_latency'
image: ghcr.io/awesome/science:latest
workdir: /path/to/science
repeats: 3
```

Note that likely in the future we can provide a default template and require all these variables
For repeats, we add another level to the output directory, and represent the result data as
subdirectories of the machine and size from 1..N. Note also that likely in the future we
can provide a default template and require all these variables
defined. For now we require you to provide the template.

### Custom Resource Definition
Expand Down
6 changes: 6 additions & 0 deletions docs/getting_started/google.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ $ gcloud components install kubectl
```
or just [on your own](https://kubernetes.io/docs/tasks/tools/).

## Cloud

Finally, ensure that google is either your default cloud (the `default_cloud` in your settings.yml)
or you specify it with `--cloud` when you do run.


## Run Experiments

Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
Expand Down
1 change: 1 addition & 0 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ examples
experiments
settings
google
aws
```
11 changes: 11 additions & 0 deletions docs/getting_started/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,17 @@ $ flux-cloud config get google:project
google:project dinosaur
```

Ensure your default cloud is set to the one you want!

```bash
$ flux-cloud config get default_cloud
default_cloud aws

$ flux-cloud config set default_cloud google
default_cloud google
```

We don't discriminate or judge about clouds, we like them all!
Also set your editor of choice, and then you can edit in it (it defaults to vim)

```bash
Expand Down
6 changes: 5 additions & 1 deletion docs/getting_started/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ The following settings are available for Flux Cloud
| google.zone | string | The default zone to use in Google Cloud | us-central1-a | true |
| google.machine | string | The default machine to use | n2-standard-1 | true |
| google.project | string | The default google project to use | unset | true |
| aws | object | A group of settings for Amazon EKS | NA | true |
| aws.region | string | The default region to use in Amazon EKS | us-east1 | true |
| aws.machine | string | The default machine to use | m5.large | true |
| aws.ssh_key | string | If ssh access is desired, provide an ssh key you've generated | unset | false |

For the above, you'll notice the only setting you really need to define (per the user guide)
is your Google Cloud project.
is your Google Cloud project. AWS gets everything from the environment.
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ when you are developing, you can run "apply" and then easily debug until you are
down.

This project is currently 🚧️ Under Construction! 🚧️ and optimized for the creator @vsoch's use case
to run experiments in Google Cloud. We likely will add more features and clouds as they are needed
or requested. This is a *converged computing* project that aims
to run experiments in Google Cloud (GKS) and Amazon Web Services (EKS). We likely will add more features
and clouds as they are needed or requested. This is a *converged computing* project that aims
to unite the worlds and technologies typical of cloud computing and
high performance computing.

Expand Down
2 changes: 0 additions & 2 deletions fluxcloud/client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,6 @@ def get_parser():
action="store_true",
default=False,
)

for command in apply, up, down:
command.add_argument(
"--id",
"-e",
Expand Down
Loading

0 comments on commit 7fd2e0a

Please sign in to comment.