Skip to content

Commit

Permalink
small tweaks to submit (#24)
Browse files Browse the repository at this point in the history
* small tweaks to submit so that variables like requests/limits are carried through
* do not check flux-operator.yaml file
* adding support for saving additional metadata about cluster nodes
* fix data formatting
* skip check of metadata files, clusters likely to be different

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Jan 29, 2023
1 parent 26100d4 commit e808b5c
Show file tree
Hide file tree
Showing 25 changed files with 1,330 additions and 69 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
- data should be namespaced by cloud type (so multiple experiments can be run alongside) (0.1.17)
- add flux-cloud ui to just bring up (and down) a user interface (0.1.16)
- support for submit and batch, to run jobs on the same MiniCluster (0.1.15)
- minikube docker pull needs message, update tests and typo (0.1.14)
Expand Down
35 changes: 22 additions & 13 deletions docs/getting_started/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,21 +295,30 @@ managedNodeGroups:
By default, flux cloud keeps all scripts that the job renders in the experiment output directory under `.scripts`. If you
want to cleanup instead, you can add the `--cleanup` flag. We do this so you can inspect a script to debug, or if you
just want to keep them for reproducibility. As an example, here is outfrom from a run with multiple repeats of the
same command, across two MiniCluster cluster sizes (2 and 4):
same command, across two MiniCluster cluster sizes (2 and 4). As of version `0.1.17` the data is also organized
by the runner (e.g., minikube vs google) so you can run the experiments across multiple clouds without conflict.

```console
$ tree data/k8s-size-4-n1-standard-1/.scripts/
├── cluster-create.sh
├── cluster-destroy.sh
├── eksctl-config.yaml
├── flux-operator.yaml
├── minicluster-run-lmp-16-10-minicluster-size-16.sh
├── minicluster-run-lmp-16-11-minicluster-size-16.sh
├── minicluster-run-lmp-16-12-minicluster-size-16.sh
...
├── minicluster-run-lmp-64-8-minicluster-size-64.sh
├── minicluster-run-lmp-64-9-minicluster-size-64.sh
└── minicluster.yaml
$ tree -a ./data/
./data/
└── minikube
└── k8s-size-4-local
├── lmp-size-2-minicluster-size-2
│ └── log.out
├── lmp-size-4-minicluster-size-4
│ └── log.out
├── meta.json
└── .scripts
├── cluster-create-minikube.sh
├── flux-operator.yaml
├── kubectl-version.yaml
├── minicluster-run-lmp-size-2-minicluster-size-2.sh
├── minicluster-run-lmp-size-4-minicluster-size-4.sh
├── minicluster-size-2.yaml
├── minicluster-size-4.yaml
├── minikube-version.json
├── nodes-size-4.json
└── nodes-size-4.txt
```

And that's it! I think there might be a more elegant way to determine what cluster is running,
Expand Down
5 changes: 4 additions & 1 deletion fluxcloud/client/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#
# SPDX-License-Identifier: Apache-2.0

import os

import fluxcloud.utils as utils
from fluxcloud.logger import logger
from fluxcloud.main import get_experiment_client
Expand All @@ -21,7 +23,8 @@ def prepare_client(args, extra):
force_cluster=args.force_cluster,
template=args.template,
cleanup=args.cleanup,
outdir=args.output_dir,
# Ensure the output directory is namespaced by the cloud name
outdir=os.path.join(args.output_dir, cli.name),
test=args.test,
quiet=True,
)
Expand Down
11 changes: 7 additions & 4 deletions fluxcloud/main/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,13 @@ def open_ui(self, setup, experiment, size, api=None, persistent=False):

logger.info(f"\n🌀 Bringing up MiniCluster of size {size}")

# Get the global "job" for the size (and validate only one image)
# This will raise error if > 1 image, or no image.
image = experiment.get_persistent_image(size)
job = {"image": image, "token": api.token, "user": api.user}
# Get persistent variables for this job size, image is required
job = experiment.get_persistent_variables(size, required=["image"])
job.update({"token": api.token, "user": api.user})

# We can't have a command
if "command" in job:
del job["command"]

# Pre-pull containers, etc.
if hasattr(self, "pre_apply"):
Expand Down
7 changes: 4 additions & 3 deletions fluxcloud/main/clouds/aws/scripts/cluster-create
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,11 @@ fi

run_echo eksctl create cluster -f ${CONFIG_FILE}

# Show nodes
run_echo kubectl get nodes

# Deploy the operator TODO should be variables here
install_operator ${SCRIPT_DIR} ${REPOSITORY} ${BRANCH}
run_echo kubectl get namespace
run_echo kubectl describe namespace operator-system

# Save versions of kubectl, eksctl
run_echo_save "${SCRIPT_DIR}/eksctl-version.json" eksctl version --output=json -d --verbose 5
save_common_metadata ${SCRIPT_DIR} ${SIZE}
4 changes: 4 additions & 0 deletions fluxcloud/main/clouds/google/scripts/cluster-create
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,7 @@ install_operator ${SCRIPT_DIR} ${REPOSITORY} ${BRANCH}

run_echo kubectl get namespace
run_echo kubectl describe namespace operator-system

# Save versions of kubectl, gcloud
run_echo_save "${SCRIPT_DIR}/gcloud-version.json" gcloud version --format=json
save_common_metadata ${SCRIPT_DIR} ${SIZE}
11 changes: 11 additions & 0 deletions fluxcloud/main/clouds/local/scripts/cluster-create-minikube
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,22 @@ print_magenta " branch : ${BRANCH}"
is_installed minikube
is_installed wget

function save_versions () {

SCRIPT_DIR=${1}
SIZE=${2}

run_echo_save "${SCRIPT_DIR}/minikube-version.yaml" minikube version --output=yaml --components=true
save_common_metadata ${SCRIPT_DIR} ${SIZE}
}

# Check if it already exists
minikube status
retval=$?
if [[ "${retval}" == "0" ]]; then
print_blue "A MiniKube cluster already exists."
install_operator ${SCRIPT_DIR} ${REPOSITORY} ${BRANCH}
save_versions ${SCRIPT_DIR} ${SIZE}
echo
exit 0
fi
Expand All @@ -44,3 +54,4 @@ run_echo kubectl get nodes

run_echo kubectl get namespace
run_echo kubectl describe namespace operator-system
save_versions ${SCRIPT_DIR} ${SIZE}
22 changes: 22 additions & 0 deletions fluxcloud/main/clouds/shared/scripts/helpers.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,20 @@ function install_operator() {
kubectl apply -f $tmpfile
}

function save_common_metadata() {
# Save common versions across clouds for kubectl and the cluster nodes
SCRIPT_DIR="${1}"
SIZE="${2}"

run_echo_save "${SCRIPT_DIR}/kubectl-version.yaml" kubectl version --output=yaml

# Show nodes and save metadata to script directory
run_echo kubectl get nodes
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.json" kubectl get nodes -o json
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.txt" kubectl describe nodes
}



function run_echo() {
# Show the user the command then run it
Expand All @@ -55,6 +69,14 @@ function run_echo() {
retry $@
}

function run_echo_save() {
echo
save_to="${1}"
shift
print_green "$@ > ${save_to}"
$@ > ${save_to}
}

function run_echo_allow_fail() {
echo
print_green "$@"
Expand Down
31 changes: 16 additions & 15 deletions fluxcloud/main/experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def variables(self):
@property
def root_dir(self):
"""
Consistent means to get experiment.
Consistent means to get experiment, also namespaced to cloud/runner.
"""
return os.path.join(self.outdir, self.expid)

Expand Down Expand Up @@ -193,31 +193,32 @@ def iter_jobs(self):

yield size, jobname, job

def get_persistent_image(self, size):
def get_persistent_variables(self, size, required=None):
"""
A persistent image is a job image used across a size of MiniCluster
Get persistent variables that should be used across the MiniCluster
"""
image = None
jobvars = {}
for _, job in self.jobs.items():

# Skip jobs targeted for a different size
if "size" in job and job["size"] != size:
continue

if "image" in job and not image:
image = job["image"]
continue
if "image" in job and image != job["image"]:
raise ValueError(
f"Submit uses a consistent container image, but found two images under size {size}: {image} and {job['image']}"
for key, value in job.items():
if key not in jobvars or (key in jobvars and jobvars[key] == value):
jobvars[key] = value
continue
logger.warning(
f'Inconsistent job variable between MiniCluster jobs: {value} vs. {jobvars["value"]}'
)

# If we get here and we don't have an image
if not image:
raise ValueError(
'Submit requires a container "image" under at least one job spec to create the MiniCluster.'
)
return image
for req in required or []:
if req not in jobvars:
raise ValueError(
f'Submit requires a "{req}" field under at least one job spec to create the MiniCluster.'
)
return jobvars

@property
def script_dir(self):
Expand Down
2 changes: 1 addition & 1 deletion fluxcloud/version.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright 2022-2023 Lawrence Livermore National Security, LLC
# SPDX-License-Identifier: Apache-2.0

__version__ = "0.1.16"
__version__ = "0.1.17"
AUTHOR = "Vanessa Sochat"
EMAIL = "[email protected]"
NAME = "flux-cloud"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,20 @@ function install_operator() {
kubectl apply -f $tmpfile
}

function save_common_metadata() {
# Save common versions across clouds for kubectl and the cluster nodes
SCRIPT_DIR="${1}"
SIZE="${2}"

run_echo_save "${SCRIPT_DIR}/kubectl-version.yaml" kubectl version --output=yaml

# Show nodes and save metadata to script directory
run_echo kubectl get nodes
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.json" kubectl get nodes -o json
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.txt" kubectl describe nodes
}



function run_echo() {
# Show the user the command then run it
Expand All @@ -58,6 +72,14 @@ function run_echo() {
retry $@
}

function run_echo_save() {
echo
save_to="${1}"
shift
print_green "$@ > ${save_to}"
$@ > ${save_to}
}

function run_echo_allow_fail() {
echo
print_green "$@"
Expand Down Expand Up @@ -131,11 +153,11 @@ function with_exponential_backoff {
# Defaults - these are in the config but left here for information
CLUSTER_NAME="flux-cluster"
CLUSTER_VERSION="1.23"
FORCE_CLUSTER="true"
FORCE_CLUSTER="false"
SIZE=4
REPOSITORY="flux-framework/flux-operator"
BRANCH="main"
SCRIPT_DIR="/tmp/lammps-data-PeHJF2/k8s-size-4-local/.scripts"
SCRIPT_DIR="/home/vanessa/Desktop/Code/flux/flux-cloud/tests/lammps/data/minikube/k8s-size-4-local/.scripts"

print_magenta " cluster : ${CLUSTER_NAME}"
print_magenta " version : ${CLUSTER_VERSION}"
Expand All @@ -146,12 +168,22 @@ print_magenta " branch : ${BRANCH}"
is_installed minikube
is_installed wget

function save_versions () {

SCRIPT_DIR=${1}
SIZE=${2}

run_echo_save "${SCRIPT_DIR}/minikube-version.yaml" minikube version --output=yaml --components=true
save_common_metadata ${SCRIPT_DIR} ${SIZE}
}

# Check if it already exists
minikube status
retval=$?
if [[ "${retval}" == "0" ]]; then
print_blue "A MiniKube cluster already exists."
install_operator ${SCRIPT_DIR} ${REPOSITORY} ${BRANCH}
save_versions ${SCRIPT_DIR} ${SIZE}
echo
exit 0
fi
Expand All @@ -168,4 +200,5 @@ install_operator ${SCRIPT_DIR} ${REPOSITORY} ${BRANCH}
run_echo kubectl get nodes

run_echo kubectl get namespace
run_echo kubectl describe namespace operator-system
run_echo kubectl describe namespace operator-system
save_versions ${SCRIPT_DIR} ${SIZE}
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,20 @@ function install_operator() {
kubectl apply -f $tmpfile
}

function save_common_metadata() {
# Save common versions across clouds for kubectl and the cluster nodes
SCRIPT_DIR="${1}"
SIZE="${2}"

run_echo_save "${SCRIPT_DIR}/kubectl-version.yaml" kubectl version --output=yaml

# Show nodes and save metadata to script directory
run_echo kubectl get nodes
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.json" kubectl get nodes -o json
run_echo_save "${SCRIPT_DIR}/nodes-size-${SIZE}.txt" kubectl describe nodes
}



function run_echo() {
# Show the user the command then run it
Expand All @@ -58,6 +72,14 @@ function run_echo() {
retry $@
}

function run_echo_save() {
echo
save_to="${1}"
shift
print_green "$@ > ${save_to}"
$@ > ${save_to}
}

function run_echo_allow_fail() {
echo
print_green "$@"
Expand Down Expand Up @@ -129,7 +151,7 @@ function with_exponential_backoff {
}

# Defaults - these are in the config but left here for information
FORCE_CLUSTER="true"
FORCE_CLUSTER="false"

is_installed minikube
is_installed yes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,10 @@ spec:
description: Logging modes determine the output you see in the job
log
properties:
debug:
default: false
description: Debug mode adds extra verbosity to Flux
type: boolean
quiet:
default: false
description: Quiet mode silences all output so the job only shows
Expand Down
Loading

0 comments on commit e808b5c

Please sign in to comment.