diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml new file mode 100644 index 0000000..b2d8743 --- /dev/null +++ b/.github/workflows/release.yaml @@ -0,0 +1,31 @@ +name: release cloud-select + +on: + release: + types: [created] + +jobs: + deploy: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v3 + + - name: Install + run: conda create --quiet --name fc twine + + - name: Install dependencies + run: | + export PATH="/usr/share/miniconda/bin:$PATH" + source activate fc + pip install -e .[all] + pip install setuptools wheel twine + - name: Build and publish + env: + TWINE_USERNAME: ${{ secrets.PYPI_USER }} + TWINE_PASSWORD: ${{ secrets.PYPI_PASS }} + run: | + export PATH="/usr/share/miniconda/bin:$PATH" + source activate fc + python setup.py sdist bdist_wheel + twine upload dist/* diff --git a/.gitignore b/.gitignore index 98c66c6..50fc771 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ +flux_cloud.egg-info .eggs build vendor diff --git a/CHANGELOG.md b/CHANGELOG.md index 6eb9c94..0d36bc4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are: The versions coincide with releases on pip. Only major versions will be released as tags on Github. ## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x) + - support for Amazon EKS and running commands over iterations (0.0.12) - better control of exit codes, addition of force cluster (0.0.11) - support for experiment id selection, addition of osu-benchmarks example (0.0.1) - initial skeleton release of project (0.0.0) diff --git a/docs/getting_started/aws.md b/docs/getting_started/aws.md new file mode 100644 index 0000000..5652996 --- /dev/null +++ b/docs/getting_started/aws.md @@ -0,0 +1,83 @@ +# AWS + +> Running on Amazon Elastic Kubernetes Service EKS + +The flux-cloud software provides are easy wrappers (and templates) to running +the Flux Operator on Amazon. The main steps of running experiments are: + + - **up** to bring up a cluster + - **apply** to apply one or more experiments defined by an experiments.yaml + - **down** to destroy a cluster + +Each of these commands can be run in isolation, and we provide a single command **run** to +automate the entire thing. We emphasize the term "wrapper" as we are using scripts on your +machine to do the work (e.g., kubectl and gcloud) and importantly, for every step we show +you the command, and if it fails, give you a chance to bail out. We do this so if you +want to remove the abstraction at any point and run the commands on your own, you can. + +## Pre-requisites + +You should first [install eksctrl](https://github.com/weaveworks/eksctl) and make sure you have access to an AWS cloud (e.g., +with credentials or similar in your environment). E.g.,: + +```bash +export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxx +export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +export AWS_SESSION_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +``` + +The last session token may not be required depending on your setup. +We assume you also have [kubectl](https://kubernetes.io/docs/tasks/tools/). + +### Setup SSH + +You'll need an ssh key for EKS. Here is how to generate it: + +```bash +ssh-keygen +# Ensure you enter the path to ~/.ssh/id_eks +``` + +This is used so you can ssh (connect) to your workers! + +### Cloud + +Finally, ensure that aws is either your default cloud (the `default_cloud` in your settings.yml) +or you specify it with `--cloud` when you do run. + +## Run Experiments + +**IMPORTANT** for any experiment when you choose an instance type, you absolutely +need to choose a size that has [IsTrunkingCompatible](https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go) +true. E.g., `m5.large` has it set to true so it would work. + +Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to +populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the +library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to +provide this library for you to easily edit and use! Take a look at the [examples](../examples) +directory for a few that we provide. We will walk through a generic one here to launch +an experiment on a Kubernetes cluster. Note that before doing this step you should +have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone) +in your settings. + +```bash +$ flux-cloud run experiments.yaml +``` + +Note that since the experiments file defaults to that name, you can also just do: + +```bash +$ flux-cloud run +``` + +Given an experiments.yaml in the present working directory. Take a look at an `experients.yaml` in an example directory. +Note that machines and size are required for the matrix, and variables get piped into all experiments (in full). Under variables, +both "commands" and "ids" are required, and must be equal in length (each command is assigned to one id +for output). To just run the first entry in the matrix (test mode) do: + +```bash +$ flux-cloud run experiments.yaml --test +``` + +Note that you can also use the other commands in place of a single run, notably "up" "apply" and "down." +By default, results will be written to a temporary output directory, but you can customize this with `--outdir`. diff --git a/docs/getting_started/commands.md b/docs/getting_started/commands.md index 5047eee..64da8ca 100644 --- a/docs/getting_started/commands.md +++ b/docs/getting_started/commands.md @@ -150,5 +150,33 @@ You can also use `--force-cluster` here: $ flux-cloud down --force-cluster ``` +## debug + +For any command, you can add `--debug` as a main client argument to see additional information. E.g., +the cluster config created for eksctl: + +```bash +$ flux-cloud --debug up +``` +```console +No experiment ID provided, assuming first experiment m5.large-2. +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig + +metadata: + name: flux-cluster + region: us-east-1 + version: 1.23 + +# availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1d"] +managedNodeGroups: + - name: workers + instanceType: m5.large + minSize: 2 + maxSize: 2 + labels: { "fluxoperator": "true" } +... +``` + And that's it! I think there might be a more elegant way to determine what cluster is running, however if the user decides to launch more than one, it might be harder. More thinking / docs / examples coming soon. diff --git a/docs/getting_started/experiments.md b/docs/getting_started/experiments.md index a66e44d..7c9f19f 100644 --- a/docs/getting_started/experiments.md +++ b/docs/getting_started/experiments.md @@ -70,6 +70,30 @@ minicluster: namespace: flux-operator ``` +### Kubernetes + +While it's recommended to define defaults for Kubernetes (e.g., version) in your `settings.yml`, you can one-off edit them +via a "cluster" attribute in your `experiments.yaml`. Unlike settings, this supports a field for "tags" that should be a list of strings: + +```yaml +cluster: + version: "1.23" + tags: + - lammps +``` + +Note that the above is for a Google GKE cluster - tags is a single list of tags. For AWS EKS, you need to provide key value pairs: + +```yaml +cluster: + version: "1.22" + tags: + - analysis=lammps +``` + +This is validated at runtime when you create the cluster. For both, they are converted to comma separated values to provide +to the command line client. + ### Jobs The jobs specification defines what commands (required) you want run across each Kubernetes cluster. @@ -86,17 +110,21 @@ jobs: ``` If you have different working directories or container images, you can define that here: +Note that each job can have a command (required) and working directory, image, +and repeats (optional). ```yaml -# Each job can have a command and working directory jobs: osu_get_latency: command: './osu_get_latency' image: ghcr.io/awesome/science:latest workdir: /path/to/science + repeats: 3 ``` -Note that likely in the future we can provide a default template and require all these variables +For repeats, we add another level to the output directory, and represent the result data as +subdirectories of the machine and size from 1..N. Note also that likely in the future we +can provide a default template and require all these variables defined. For now we require you to provide the template. ### Custom Resource Definition diff --git a/docs/getting_started/google.md b/docs/getting_started/google.md index 10ab010..ae38586 100644 --- a/docs/getting_started/google.md +++ b/docs/getting_started/google.md @@ -31,6 +31,12 @@ $ gcloud components install kubectl ``` or just [on your own](https://kubernetes.io/docs/tasks/tools/). +## Cloud + +Finally, ensure that google is either your default cloud (the `default_cloud` in your settings.yml) +or you specify it with `--cloud` when you do run. + + ## Run Experiments Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to diff --git a/docs/getting_started/index.md b/docs/getting_started/index.md index c1efa95..6ea1334 100644 --- a/docs/getting_started/index.md +++ b/docs/getting_started/index.md @@ -12,4 +12,5 @@ examples experiments settings google +aws ``` diff --git a/docs/getting_started/install.md b/docs/getting_started/install.md index 02a3436..17d8cd7 100644 --- a/docs/getting_started/install.md +++ b/docs/getting_started/install.md @@ -25,6 +25,17 @@ $ flux-cloud config get google:project google:project dinosaur ``` +Ensure your default cloud is set to the one you want! + +```bash +$ flux-cloud config get default_cloud +default_cloud aws + +$ flux-cloud config set default_cloud google +default_cloud google +``` + +We don't discriminate or judge about clouds, we like them all! Also set your editor of choice, and then you can edit in it (it defaults to vim) ```bash diff --git a/docs/getting_started/settings.md b/docs/getting_started/settings.md index e77af83..291d6b5 100644 --- a/docs/getting_started/settings.md +++ b/docs/getting_started/settings.md @@ -50,6 +50,10 @@ The following settings are available for Flux Cloud | google.zone | string | The default zone to use in Google Cloud | us-central1-a | true | | google.machine | string | The default machine to use | n2-standard-1 | true | | google.project | string | The default google project to use | unset | true | +| aws | object | A group of settings for Amazon EKS | NA | true | +| aws.region | string | The default region to use in Amazon EKS | us-east1 | true | +| aws.machine | string | The default machine to use | m5.large | true | +| aws.ssh_key | string | If ssh access is desired, provide an ssh key you've generated | unset | false | For the above, you'll notice the only setting you really need to define (per the user guide) -is your Google Cloud project. +is your Google Cloud project. AWS gets everything from the environment. diff --git a/docs/index.md b/docs/index.md index 03025dc..907c147 100644 --- a/docs/index.md +++ b/docs/index.md @@ -28,8 +28,8 @@ when you are developing, you can run "apply" and then easily debug until you are down. This project is currently 🚧️ Under Construction! 🚧️ and optimized for the creator @vsoch's use case -to run experiments in Google Cloud. We likely will add more features and clouds as they are needed -or requested. This is a *converged computing* project that aims +to run experiments in Google Cloud (GKS) and Amazon Web Services (EKS). We likely will add more features +and clouds as they are needed or requested. This is a *converged computing* project that aims to unite the worlds and technologies typical of cloud computing and high performance computing. diff --git a/fluxcloud/client/__init__.py b/fluxcloud/client/__init__.py index 8e7d47e..1ba8882 100644 --- a/fluxcloud/client/__init__.py +++ b/fluxcloud/client/__init__.py @@ -182,8 +182,6 @@ def get_parser(): action="store_true", default=False, ) - - for command in apply, up, down: command.add_argument( "--id", "-e", diff --git a/fluxcloud/main/client.py b/fluxcloud/main/client.py index 3590b82..a544bc5 100644 --- a/fluxcloud/main/client.py +++ b/fluxcloud/main/client.py @@ -3,7 +3,9 @@ # # SPDX-License-Identifier: Apache-2.0 +import copy import os +import shutil import fluxcloud.utils as utils from fluxcloud.logger import logger @@ -30,25 +32,37 @@ def __repr__(self): return str(self) @timed - def run_timed(self, name, cmd): + def run_timed(self, name, cmd, cleanup_func=None): """ Run a timed command, and handle nonzero exit codes. """ res = utils.run_command(cmd) + + # An optional cleanup function (also can run if not successful) + if cleanup_func is not None: + cleanup_func() + if res.returncode != 0: raise ValueError("nonzero exit code, exiting.") def __str__(self): return "[flux-cloud-client]" - def get_script(self, name): + def get_script(self, name, cloud=None): """ Get a named script from the cloud's script folder """ - script = os.path.join(here, "clouds", self.name, "scripts", name) + cloud = cloud or self.name + script = os.path.join(here, "clouds", cloud, "scripts", name) if os.path.exists(script): return script + def get_shared_script(self, name): + """ + Get a named shared script + """ + return self.get_script(name, cloud="shared") + def experiment_is_run(self, setup, experiment): """ Determine if all jobs are already run in an experiment @@ -102,11 +116,93 @@ def down(self, *args, **kwargs): """ raise NotImplementedError - def apply(self, *args, **kwargs): + def apply(self, setup, experiment): """ - Apply (run) one or more experiments. + Apply a CRD to run the experiment and wait for output. + + This is really just running the setup! """ - raise NotImplementedError + # Here is where we need a template! + if setup.template is None or not os.path.exists(setup.template): + raise ValueError( + "You cannot run experiments without a minicluster-template.yaml" + ) + apply_script = self.get_shared_script("minicluster-run") + + jobs = experiment.get("jobs", []) + minicluster = setup.get_minicluster(experiment) + if not jobs: + logger.warning(f"Experiment {experiment} has no jobs, nothing to run.") + return + + # The experiment is defined by the machine type and size + experiment_dir = os.path.join(setup.outdir, experiment["id"]) + + # Jobname is used for output + for jobname, job in jobs.items(): + + job_output = os.path.join(experiment_dir, jobname) + logfile = os.path.join(job_output, "log.out") + + # Do we have output? + if os.path.exists(logfile) and not setup.force: + logger.warning( + f"{logfile} already exists and force is False, skipping." + ) + continue + elif os.path.exists(logfile) and setup.force: + logger.warning(f"Cleaning up previous run in {job_output}.") + shutil.rmtree(job_output) + + # Create job directory anew + utils.mkdir_p(job_output) + + # Generate the populated crd from the template + template = setup.generate_crd(experiment, job) + + # Write to a temporary file + crd = utils.get_tmpfile(prefix="minicluster-", suffix=".yaml") + utils.write_file(template, crd) + + # Apply the job, and save to output directory + cmd = [ + apply_script, + "--apply", + crd, + "--logfile", + logfile, + "--namespace", + minicluster["namespace"], + "--job", + minicluster["name"], + ] + self.run_timed(f"{self.job_prefix}-{jobname}", cmd) + + # Clean up temporary crd if we get here + if os.path.exists(crd): + os.remove(crd) + + # Save times and experiment metadata to file + # TODO we could add cost estimation here - data from cloud select + meta = copy.deepcopy(experiment) + meta["times"] = self.times + meta_file = os.path.join(experiment_dir, "meta.json") + utils.write_json(meta, meta_file) + self.clear_minicluster_times() + return meta + + def clear_minicluster_times(self): + """ + Update times to not include jobs + """ + times = {} + for key, value in self.times.items(): + + # Don't add back a job that was already saved + if key.startswith(self.job_prefix): + continue + times[key] = value + self.times = times def up(self, *args, **kwargs): """ diff --git a/fluxcloud/main/clouds/__init__.py b/fluxcloud/main/clouds/__init__.py index ef38ca2..a1cc065 100644 --- a/fluxcloud/main/clouds/__init__.py +++ b/fluxcloud/main/clouds/__init__.py @@ -1,8 +1,8 @@ -# from .aws import AmazonCloud +from .aws import AmazonCloud from .google import GoogleCloud # backends = {"aws": AmazonCloud} -clouds = {"google": GoogleCloud, "gcp": GoogleCloud} +clouds = {"google": GoogleCloud, "gcp": GoogleCloud, "aws": AmazonCloud} cloud_names = list(clouds) diff --git a/fluxcloud/main/clouds/aws/__init__.py b/fluxcloud/main/clouds/aws/__init__.py new file mode 100644 index 0000000..c1f6fc1 --- /dev/null +++ b/fluxcloud/main/clouds/aws/__init__.py @@ -0,0 +1 @@ +from .client import AmazonCloud diff --git a/fluxcloud/main/clouds/aws/client.py b/fluxcloud/main/clouds/aws/client.py new file mode 100644 index 0000000..a3a0e7a --- /dev/null +++ b/fluxcloud/main/clouds/aws/client.py @@ -0,0 +1,138 @@ +# Copyright 2022 Lawrence Livermore National Security, LLC and other +# This is part of Flux Framework. See the COPYRIGHT file for details. +# +# SPDX-License-Identifier: Apache-2.0 + +import os + +import jinja2 + +import fluxcloud.utils as utils +from fluxcloud.logger import logger +from fluxcloud.main.client import ExperimentClient + +here = os.path.dirname(os.path.abspath(__file__)) + + +class AmazonCloud(ExperimentClient): + """ + An Amazon EKS (Elastic Kubernetes Service) experiment runner. + """ + + name = "aws" + + def __init__(self, **kwargs): + super(AmazonCloud, self).__init__(**kwargs) + self.region = kwargs.get("region") or "us-east-1" + + # This could eventually just be provided + self.config_template = os.path.join(here, "templates", "cluster-config.yaml") + + def up(self, setup, experiment=None): + """ + Bring up a cluster + """ + experiment = experiment or setup.get_single_experiment() + create_script = self.get_script("cluster-create") + + # ssh key if provided must exist + ssh_key = self.settings.aws.get("ssh_key") + if ssh_key and not os.path.exists(ssh_key): + raise ValueError("ssh_key defined and does not exist: {ssh_key}") + + tags = self.get_tags(experiment) + + # Create the cluster with creation script, write to temporary file + template = self.generate_config(setup, experiment) + config_file = utils.get_tmpfile(prefix="eksctl-config", suffix=".yaml") + utils.write_file(template, config_file) + + # Most of these are not needed, but provided for terminal printing + # and consistent output with Google GKE runner + cmd = [ + create_script, + "--region", + self.region, + "--machine", + setup.get_machine(experiment), + "--cluster", + setup.get_cluster_name(experiment), + "--cluster-version", + setup.settings.kubernetes["version"], + "--config", + config_file, + "--size", + setup.get_size(experiment), + ] + if setup.force_cluster: + cmd.append("--force-cluster") + if tags: + cmd += ["--tags", ",".join(tags)] + + # Cleanup function to remove temporary file + def cleanup(): + if os.path.exists(config_file): + os.remove(config_file) + + return self.run_timed("create-cluster", cmd, cleanup) + + def get_tags(self, experiment): + """ + Convert cluster tags into list of key value pairs + """ + tags = {} + for tag in experiment.get("cluster", {}).get("tags") or []: + if "=" not in tag: + raise ValueError( + f"Cluster tags must be provided in format key=value, found {tag}" + ) + key, value = tag.split("=", 1) + tags[key] = value + return tags + + def generate_config(self, setup, experiment): + """ + Generate the config to create the cluster. + + Note that we could use the command line client alone but it doesn't + support all options. Badoom fzzzz. + """ + template = jinja2.Template(utils.read_file(self.config_template)) + values = {} + + # Cluster name, kubernetes version, and region + values["cluster_name"] = setup.get_cluster_name(experiment) + values["region"] = self.region + values["machine"] = setup.get_machine(experiment) + values["kubernetes_version"] = setup.settings.kubernetes["version"] + values["size"] = setup.get_size(experiment) + values["ssh_key"] = self.settings.aws.get("ssh_key") + + # Optional booleans + for key in ["private_networking", "efa_enabled"]: + value = self.settings.aws.get("private_networking") + if value is True: + values[key] = value + + result = template.render(**values) + logger.debug(result) + return result + + def down(self, setup, experiment=None): + """ + Destroy a cluster + """ + experiment = experiment or setup.get_single_experiment() + destroy_script = self.get_script("cluster-destroy") + + # Create the cluster with creation script + cmd = [ + destroy_script, + "--region", + self.region, + "--cluster", + setup.get_cluster_name(experiment), + ] + if setup.force_cluster: + cmd.append("--force-cluster") + return self.run_timed("destroy-cluster", cmd) diff --git a/fluxcloud/main/clouds/aws/scripts/cluster-create b/fluxcloud/main/clouds/aws/scripts/cluster-create new file mode 100755 index 0000000..9824b79 --- /dev/null +++ b/fluxcloud/main/clouds/aws/scripts/cluster-create @@ -0,0 +1,145 @@ +#!/bin/bash + +SHORT="p:,c:,r:,v:,m:,t:,s:,b:,e:,f:,o:,h" +LONG="cluster:,region:,cluster-version:,machine:,tags:,size:,branch:,repository:,force-cluster,config:,help" +OPTS=$(getopt -a -n create --options $SHORT --longoptions $LONG -- "$@") + +eval set -- "$OPTS" + +HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) +ROOT=$(dirname $(dirname ${HERE})) + +# Source shared helper scripts +. $ROOT/shared/scripts/helpers.sh + +# Defaults - these are in the config but left here for information +CLUSTER_NAME="flux-cluster" +REGION="us-east-1" +CLUSTER_VERSION="1.23" +MACHINE_TYPE="m5.large" +FORCE_CLUSTER="false" +SIZE=4 +TAGS="creator=flux-cloud" +REPOSITORY="flux-framework/flux-operator" +BRANCH="main" + +function usage() { + echo "This is the Amazon EKS (elastic kubernetes service) cluster creator." + echo "usage: cluster-create --config /path/to/cluster-config.yaml" +} + +while : +do + case "$1" in + --config) + CONFIG_FILE=$2 + shift 2 + ;; + -c | --cluster) + CLUSTER_NAME=$2 + shift 2 + ;; + -r | --region) + REGION=$2 + shift 2 + ;; + -v | --cluster-version) + CLUSTER_VERSION=$2 + shift 2 + ;; + -m | --machine) + MACHINE_TYPE=$2 + shift 2 + ;; + -t | --tags) + TAGS=$2 + shift 2 + ;; + -f | --force-cluster) + FORCE_CLUSTER="true" + shift 1 + ;; + -b | --branch) + BRANCH=$2 + shift 2 + ;; + -e | --repository) + REPOSITORY=$2 + shift 2 + ;; + -s | --size) + SIZE=$2 + shift 2 + ;; + -h | --help) + usage + exit 2 + ;; + --) + shift; + break + ;; + *) + echo $@ + echo "Unexpected option: $1" + ;; + esac +done + +# Required arguments +if [ -z ${REGION+x} ]; then + echo "Please provide your AWS region with --region"; + exit 1 +fi + +if [ -z ${CONFIG_FILE+x} ]; then + echo "Please provide your AWS cluster config file with --config"; + exit 1 +fi + +if [ -z ${MACHINE_TYPE+x} ]; then + echo "Please provide your Amazon EKS machine type with --machine"; + exit 1 +fi + +print_magenta " cluster : ${CLUSTER_NAME}" +print_magenta " version : ${CLUSTER_VERSION}" +print_magenta " machine : ${MACHINE_TYPE}" +print_magenta " region : ${REGION}" +print_magenta " tags : ${TAGS}" +print_magenta " size : ${SIZE}" +print_magenta "repository : ${REPOSITORY}" +print_magenta " branch : ${BRANCH}" +print_magenta " ssh-key : ${SSH_KEY}" + +is_installed kubectl +is_installed eksctl +is_installed wget + +# Check if it already exists +eksctl get clusters --name ${CLUSTER_NAME} --region ${REGION} --color fabulous +retval=$? +if [[ "${retval}" == "0" ]]; then + print_blue "${CLUSTER_NAME} in ${REGION} already exists." + echo + exit 0 +fi + +if [[ "${FORCE_CLUSTER}" != "true" ]]; then + prompt "Do you want to create this cluster?" +fi + +run_echo eksctl create cluster -f ${CONFIG_FILE} + +# Show nodes +run_echo kubectl get nodes + +# Deploy the operator TODO should be variables here +tmpfile=$(mktemp /tmp/flux-operator.XXXXXX.yaml) +rm -rf $tmpfile +run_echo wget -O $tmpfile https://raw.githubusercontent.com/${REPOSITORY}/${BRANCH}/examples/dist/flux-operator.yaml +kubectl apply -f $tmpfile +rm -rf $tmpfile + +run_echo kubectl get namespace +run_echo kubectl describe namespace operator-system diff --git a/fluxcloud/main/clouds/aws/scripts/cluster-destroy b/fluxcloud/main/clouds/aws/scripts/cluster-destroy new file mode 100755 index 0000000..1374301 --- /dev/null +++ b/fluxcloud/main/clouds/aws/scripts/cluster-destroy @@ -0,0 +1,78 @@ +#!/bin/bash + +SHORT="c:,r:,f,h" +LONG="cluster:,region:,force-cluster,help" +OPTS=$(getopt -a -n create --options $SHORT --longoptions $LONG -- "$@") + +eval set -- "$OPTS" + +HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) +ROOT=$(dirname $(dirname ${HERE})) + +# Source shared helper scripts +. $ROOT/shared/scripts/helpers.sh + +# Defaults +CLUSTER_NAME="flux-cluster" +FORCE_CLUSTER="false" +REGION="us-east-1" + +function usage() { + echo "This is the Amazon EKS (Elastic Kubernetes Services) cluster destroyer." + echo "usage: cluster-destroy --cluster --region " +} + +while : +do + case "$1" in + -c | --cluster) + CLUSTER_NAME=$2 + shift 2 + ;; + -r | --region) + REGION=$2 + shift 2 + ;; + -h | --help) + usage + exit 2 + ;; + -f | --force-cluster) + FORCE_CLUSTER="true" + shift 1 + ;; + --) + shift; + break + ;; + *) + echo "Unexpected option: $1" + ;; + esac +done + +if [ -z ${REGION+x} ]; then + echo "Please provide your Amazon EKS region with --region"; + exit 1 +fi + +echo " cluster : ${CLUSTER_NAME}" +echo " region : ${REGION}" + +is_installed eksctl + +# The cluster must exist to delete it +eksctl get clusters --name ${CLUSTER_NAME} --region ${REGION} --color fabulous +retval=$? +if [[ "${retval}" != "0" ]]; then + print_blue "${CLUSTER_NAME} in ${REGION} does not exist." + echo + exit 0 +fi + +if [[ "${FORCE_CLUSTER}" != "true" ]]; then + prompt "Are you sure you want to delete this cluster?" +fi + +cmd="eksctl delete cluster --name=${CLUSTER_NAME} --region=${REGION} --wait --force" +run_echo ${cmd} diff --git a/fluxcloud/main/clouds/aws/templates/cluster-config.yaml b/fluxcloud/main/clouds/aws/templates/cluster-config.yaml new file mode 100644 index 0000000..68f1ad2 --- /dev/null +++ b/fluxcloud/main/clouds/aws/templates/cluster-config.yaml @@ -0,0 +1,23 @@ +apiVersion: eksctl.io/v1alpha5 +kind: ClusterConfig + +metadata: + name: {{ cluster_name }} + region: {{ region }} + version: "{{ kubernetes_version }}" + {% if tags %}tags:{% for tag in tags %} + "{{ tag[0] }}": "{{ tag[1] }}" + {% endfor %}{% endif %} + +# availabilityZones: ["us-east-1a", "us-east-1b", "us-east-1d"] +managedNodeGroups: + - name: workers + instanceType: {{ machine }} + minSize: {{ size }} + maxSize: {{ size }} + labels: { "fluxoperator": "true" } + {% if ssh_key %}ssh: + allow: true + publicKeyPath: {{ ssh_key }}{% endif %} + {% if private_networking %}privateNetworking: true{% endif %} + {% if efa_enabled %}efaEnabled: true{% endif %} diff --git a/fluxcloud/main/clouds/google/client.py b/fluxcloud/main/clouds/google/client.py index 07a3e03..0bdbe06 100644 --- a/fluxcloud/main/clouds/google/client.py +++ b/fluxcloud/main/clouds/google/client.py @@ -3,12 +3,6 @@ # # SPDX-License-Identifier: Apache-2.0 -import copy -import os -import shutil - -import fluxcloud.utils as utils -from fluxcloud.logger import logger from fluxcloud.main.client import ExperimentClient @@ -30,96 +24,6 @@ def __init__(self, **kwargs): "Please provide your Google Cloud project in your settings.yml or flux-cloud set google:project " ) - def apply(self, setup, experiment): - """ - Apply a CRD to run the experiment and wait for output. - - This is really just running the setup! - """ - # Here is where we need a template! - if not setup.template or not os.path.exists(setup.template): - logger.exit( - "You cannot run experiments without a minicluster-template.yaml" - ) - apply_script = self.get_script("minicluster-run") - - # One run per job (command) - jobs = experiment.get("jobs", []) - minicluster = setup.get_minicluster(experiment) - if not jobs: - logger.warning(f"Experiment {experiment} has no jobs, nothing to run.") - return - - # The experiment is defined by the machine type and size - experiment_dir = os.path.join(setup.outdir, experiment["id"]) - - # Jobname is used for output - for jobname, job in jobs.items(): - - # Job specific output directory - job_output = os.path.join(experiment_dir, jobname) - logfile = os.path.join(job_output, "log.out") - - # Do we have output? - if os.path.exists(logfile) and not setup.force: - logger.warning( - f"{logfile} already exists and force is False, skipping." - ) - continue - elif os.path.exists(logfile) and setup.force: - logger.warning(f"Cleaning up previous run in {job_output}.") - shutil.rmtree(job_output) - - # Create job directory anew - utils.mkdir_p(job_output) - - # Generate the populated crd from the template - template = setup.generate_crd(experiment, job) - - # Write to a temporary file - crd = utils.get_tmpfile(prefix="minicluster-", suffix=".yaml") - utils.write_file(template, crd) - - # Apply the job, and save to output directory - cmd = [ - apply_script, - "--apply", - crd, - "--logfile", - logfile, - "--namespace", - minicluster["namespace"], - "--job", - minicluster["name"], - ] - self.run_timed(f"{self.job_prefix}-{jobname}", cmd) - - # Clean up temporary crd if we get here - if os.path.exists(crd): - os.remove(crd) - - # Save times and experiment metadata to file - # TODO we could add cost estimation here - data from cloud select - meta = copy.deepcopy(experiment) - meta["times"] = self.times - meta_file = os.path.join(experiment_dir, "meta.json") - utils.write_json(meta, meta_file) - self.clear_minicluster_times() - return meta - - def clear_minicluster_times(self): - """ - Update times to not include jobs - """ - times = {} - for key, value in self.times.items(): - - # Don't add back a job that was already saved - if key.startswith(self.job_prefix): - continue - times[key] = value - self.times = times - def up(self, setup, experiment=None): """ Bring up a cluster diff --git a/fluxcloud/main/clouds/google/scripts/cluster-create b/fluxcloud/main/clouds/google/scripts/cluster-create index a50843e..9159f6e 100755 --- a/fluxcloud/main/clouds/google/scripts/cluster-create +++ b/fluxcloud/main/clouds/google/scripts/cluster-create @@ -7,9 +7,10 @@ OPTS=$(getopt -a -n create --options $SHORT --longoptions $LONG -- "$@") eval set -- "$OPTS" HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) +ROOT=$(dirname $(dirname ${HERE})) # Source shared helper scripts -. ${HERE}/helpers.sh +. $ROOT/shared/scripts/helpers.sh # Defaults CLUSTER_NAME="flux-cluster" @@ -62,6 +63,14 @@ do SIZE=$2 shift 2 ;; + -b | --branch) + BRANCH=$2 + shift 2 + ;; + -r | --repository) + REPOSITORY=$2 + shift 2 + ;; -h | --help) usage exit 2 diff --git a/fluxcloud/main/clouds/google/scripts/cluster-destroy b/fluxcloud/main/clouds/google/scripts/cluster-destroy index 9204d84..bde38ed 100755 --- a/fluxcloud/main/clouds/google/scripts/cluster-destroy +++ b/fluxcloud/main/clouds/google/scripts/cluster-destroy @@ -7,9 +7,10 @@ OPTS=$(getopt -a -n create --options $SHORT --longoptions $LONG -- "$@") eval set -- "$OPTS" HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) +ROOT=$(dirname $(dirname ${HERE})) # Source shared helper scripts -. ${HERE}/helpers.sh +. $ROOT/shared/scripts/helpers.sh # Defaults CLUSTER_NAME="flux-cluster" diff --git a/fluxcloud/main/clouds/shared/__init__.py b/fluxcloud/main/clouds/shared/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/fluxcloud/main/clouds/google/scripts/helpers.sh b/fluxcloud/main/clouds/shared/scripts/helpers.sh similarity index 100% rename from fluxcloud/main/clouds/google/scripts/helpers.sh rename to fluxcloud/main/clouds/shared/scripts/helpers.sh diff --git a/fluxcloud/main/clouds/google/scripts/minicluster-run b/fluxcloud/main/clouds/shared/scripts/minicluster-run similarity index 95% rename from fluxcloud/main/clouds/google/scripts/minicluster-run rename to fluxcloud/main/clouds/shared/scripts/minicluster-run index 1646287..d2431c4 100755 --- a/fluxcloud/main/clouds/google/scripts/minicluster-run +++ b/fluxcloud/main/clouds/shared/scripts/minicluster-run @@ -9,10 +9,10 @@ eval set -- "$OPTS" HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) # Source shared helper scripts -. ${HERE}/helpers.sh +. $HERE/helpers.sh function usage() { - echo "This is the Google Cloud Flux MiniCluster job runner, where we apply a custom resource definition (CRD)." + echo "This is the Flux Cloud MiniCluster job runner, where we apply a custom resource definition (CRD)." echo "usage: minicluster-run --apply --logfile /path/to/log.out --namespace " } diff --git a/fluxcloud/main/experiment.py b/fluxcloud/main/experiment.py index d623e23..0a7f733 100644 --- a/fluxcloud/main/experiment.py +++ b/fluxcloud/main/experiment.py @@ -31,12 +31,16 @@ def __init__( An experiment setup. """ self.experiment_file = os.path.abspath(experiments) - self.template = os.path.abspath(template) if template else None + self.template = os.path.abspath(template) if template is not None else None self._outdir = outdir self.test = test self.settings = settings.Settings self.quiet = quiet + # Show the user the template file + if template: + logger.debug(f"Using template {self.template}") + # Rewrite existing outputs self.force = kwargs.get("force") or False # Don't ask for confirmation to create/destroy @@ -136,11 +140,11 @@ def validate(self): Validate that all paths exist (create output if it does not) """ if self.template is not None and not os.path.exists(self.template): - logger.exit(f"Template file {self.template} does not exist.") + raise ValueError(f"Template file {self.template} does not exist.") # This file must always be provided and exist if not os.path.exists(self.experiment_file): - logger.exit(f"Experiments file {self.experiment_file} does not exist.") + raise ValueError(f"Experiments file {self.experiment_file} does not exist.") def expand_experiments(experiments): @@ -163,7 +167,7 @@ def expand_experiments(experiments): elif "experiment" in experiments: matrix = expand_single_experiment(experiments) elif "experiments" in experiments: - matrix = expand_single_experiment(experiments) + matrix = expand_experiment_list(experiments) else: raise ValueError( 'The key "experiment" or "experiments" or "matrix" is required.' @@ -182,6 +186,43 @@ def add_experiment_ids(matrix): return matrix +def expand_jobs(jobs): + """ + Expand out jobs based on repeats + """ + final = {} + for jobname, job in jobs.items(): + if "repeats" in job: + repeats = job["repeats"] + if repeats < 1: + raise ValueError( + f'"repeats" must be a positive number greater than 0. Found {repeats} for {job["command"]}' + ) + + # Start at 1 and not 0 + for i in range(1, repeats + 1): + final[f"{jobname}-{i}"] = job + else: + final[jobname] = job + return final + + +def expand_experiment_list(experiments): + """ + Given a list of experiments, expand out jobs + """ + listing = experiments["experiments"] + for entry in listing: + for key in experiments: + if key == "experiments": + continue + if key == "jobs": + entry[key] = expand_jobs(experiments[key]) + continue + entry[key] = experiments[key] + return listing + + def expand_single_experiment(experiments): """ Expand a single experiment, ensuring to add the rest of the config. @@ -190,6 +231,9 @@ def expand_single_experiment(experiments): for key in experiments: if key == "experiment": continue + if key == "jobs": + experiment[key] = expand_jobs(experiments[key]) + continue experiment[key] = experiments[key] return [experiment] @@ -206,6 +250,9 @@ def expand_experiment_matrix(experiments): for key in experiments: if key == "matrix": continue + if key == "jobs": + experiment[key] = expand_jobs(experiments[key]) + continue # This is an ordered dict experiment[key] = experiments[key] matrix.append(experiment) @@ -220,22 +267,3 @@ def validate_experiments(experiments): if jsonschema.validate(experiments, schema=schemas.experiment_schema) is not None: raise ValueError("Invalid experiments schema.") - - -def run_experiment(experiment, outdir, args): - """ - Given one or more experiments, run them. - """ - print("RUN EXPERIMENT") - # First bring up the cluster - import IPython - - IPython.embed() - # TODO vsoch, this should be a shared function - - # template = Template(read_file(template_file)) - - # Run this many commands - # for command in experiment["commands"]: - # experiment["command"] = command - # render = template.render(**experiment) diff --git a/fluxcloud/main/schemas.py b/fluxcloud/main/schemas.py index dcd1b4a..869a5d7 100644 --- a/fluxcloud/main/schemas.py +++ b/fluxcloud/main/schemas.py @@ -23,6 +23,7 @@ "type": "object", "properties": { "command": {"type": "string"}, + "repeats": {"type": "number"}, "workdir": {"type": "string"}, "image": {"type": "string"}, }, @@ -45,10 +46,21 @@ cloud_properties = {"zone": {"type": "string"}, "machine": {"type": "string"}} google_cloud_properties = copy.deepcopy(cloud_properties) google_cloud_properties["project"] = {"type": ["null", "string"]} +aws_cloud_properties = { + "region": {"type": "string"}, + "machine": {"type": "string"}, + "private_networking": {"type": ["null", "boolean"]}, + "efa_enabled": {"type": ["null", "boolean"]}, + "ssh_key": {"type": ["string", "null"]}, +} kubernetes_properties = {"version": {"type": "string"}} kubernetes_cluster_properties = { - "tags": {"type": "array", "items": {"type": "string"}}, + "tags": { + "type": "array", + "items": {"type": "string"}, + "version": {"type": "string"}, + }, } minicluster_properties = { @@ -72,6 +84,12 @@ settings_properties = { "default_cloud": {"type": "string"}, "config_editor": {"type": "string"}, + "aws": { + "type": "object", + "properties": aws_cloud_properties, + "additionalProperties": False, + "required": ["region", "machine"], + }, "google": { "type": "object", "properties": google_cloud_properties, @@ -145,6 +163,7 @@ "minicluster", "operator", "clouds", + "aws", "google", "kubernetes", ], diff --git a/fluxcloud/settings.yml b/fluxcloud/settings.yml index ed26bb2..19fdd3e 100644 --- a/fluxcloud/settings.yml +++ b/fluxcloud/settings.yml @@ -1,13 +1,13 @@ # Defaults for flux-cloud # clouds that are supported -clouds: [google] +clouds: [google, aws] # config editor config_editor: vim # Backend specific settings -default_cloud: google +default_cloud: aws # operator defaults operator: @@ -26,3 +26,9 @@ google: zone: us-central1-a machine: n2-standard-1 project: null + +aws: + region: us-east-1 + machine: m5.large + private_networking: false + efa_enabled: false diff --git a/fluxcloud/version.py b/fluxcloud/version.py index 7586430..2dca194 100644 --- a/fluxcloud/version.py +++ b/fluxcloud/version.py @@ -1,7 +1,7 @@ # Copyright 2022 Lawrence Livermore National Security, LLC # SPDX-License-Identifier: Apache-2.0 -__version__ = "0.0.11" +__version__ = "0.0.12" AUTHOR = "Vanessa Sochat" EMAIL = "vsoch@users.noreply.github.com" NAME = "flux-cloud" diff --git a/setup.cfg b/setup.cfg index f8f009f..336d0c9 100644 --- a/setup.cfg +++ b/setup.cfg @@ -6,4 +6,5 @@ per-file-ignores = fluxcloud/utils/__init__.py:F401 fluxcloud/main/__init__.py:F401 fluxcloud/main/clouds/__init__.py:F401 + fluxcloud/main/clouds/aws/__init__.py:F401 fluxcloud/main/clouds/google/__init__.py:F401