Skip to content

Commit

Permalink
start of work to test adding efa support (#8)
Browse files Browse the repository at this point in the history
* start of work to test adding efa support
* ensure schemas can handle boolean, do not overwrite metadata
* ensure we save up/down meta times too
* output directory should be defined for up/down to write meta.json times
* do not overwrite create/destroy cluster keys

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Jan 5, 2023
1 parent 7fd2e0a commit df5a46e
Show file tree
Hide file tree
Showing 13 changed files with 117 additions and 20 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
- support for custom cloud variables in the experiments config (0.0.13)
- support for Amazon EKS and running commands over iterations (0.0.12)
- better control of exit codes, addition of force cluster (0.0.11)
- support for experiment id selection, addition of osu-benchmarks example (0.0.1)
Expand Down
18 changes: 17 additions & 1 deletion docs/getting_started/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,22 @@ This is used so you can ssh (connect) to your workers!
Finally, ensure that aws is either your default cloud (the `default_cloud` in your settings.yml)
or you specify it with `--cloud` when you do run.

## Custom Variables

The following custom variables are supported in the "variables" section (key value pairs)
for Amazon in an `experiments.yaml`

```yaml
variables:
# Enable private networking
private_networking: true

# Enable efa (requires efa also set under the container limits)
efa_enabled: true
```
Note that we currently take a simple approach for boolean values - if it's present (e.g., the examples)
above) it will be rendered as true. Don't put False in there, but rather just delete the key.
## Run Experiments
**IMPORTANT** for any experiment when you choose an instance type, you absolutely
Expand All @@ -54,7 +70,7 @@ true. E.g., `m5.large` has it set to true so it would work.
Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
provide this library for you to easily edit and use! Take a look at the [examples](../examples)
provide this library for you to easily edit and use! Take a look at the [examples](https://github.com/converged-computing/flux-cloud/tree/main/examples)
directory for a few that we provide. We will walk through a generic one here to launch
an experiment on a Kubernetes cluster. Note that before doing this step you should
have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)
Expand Down
14 changes: 14 additions & 0 deletions docs/getting_started/experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,20 @@ experiments:
Note that it's a yaml list.
### Custom Variables
Each cloud provider is allowed to specify any number of custom variables, and these
are available in the "variables" section. As an example, let's say we want to customize networking
arguments for aws:
```yaml
variables:
private_networking: true
efa_enabled: true
```
You can look at each cloud page here to see what variables are known.
### MiniCluster Definition
The minicluster is suggested to be defined, although it's not required (by default we will use the name and namespace in your settings).
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/google.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ or you specify it with `--cloud` when you do run.
Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
provide this library for you to easily edit and use! Take a look at the [examples](../examples)
provide this library for you to easily edit and use! Take a look at the [examples](https://github.com/converged-computing/flux-cloud/tree/main/examples)
directory for a few that we provide. We will walk through a generic one here to launch
an experiment on a Kubernetes cluster. Note that before doing this step you should
have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)
Expand Down
2 changes: 1 addition & 1 deletion fluxcloud/client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ def get_parser():
help="experiment ID to apply to (<machine>-<size>)",
)

for command in run, apply:
for command in run, apply, up, down:
command.add_argument(
"-o",
"--output-dir",
Expand Down
5 changes: 4 additions & 1 deletion fluxcloud/client/down.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ def main(args, parser, extra, subparser):

cli = get_experiment_client(args.cloud)
setup = ExperimentSetup(
args.experiments, quiet=True, force_cluster=args.force_cluster
args.experiments,
quiet=True,
force_cluster=args.force_cluster,
outdir=args.output_dir,
)

# Update config settings on the fly
Expand Down
5 changes: 4 additions & 1 deletion fluxcloud/client/up.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ def main(args, parser, extra, subparser):

cli = get_experiment_client(args.cloud)
setup = ExperimentSetup(
args.experiments, quiet=True, force_cluster=args.force_cluster
args.experiments,
quiet=True,
force_cluster=args.force_cluster,
outdir=args.output_dir,
)

# Update config settings on the fly
Expand Down
33 changes: 27 additions & 6 deletions fluxcloud/main/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,12 @@
#
# SPDX-License-Identifier: Apache-2.0

import copy
import os
import shutil

import fluxcloud.utils as utils
from fluxcloud.logger import logger
from fluxcloud.main.decorator import timed
from fluxcloud.main.decorator import save_meta, timed

here = os.path.dirname(os.path.abspath(__file__))

Expand Down Expand Up @@ -116,6 +115,7 @@ def down(self, *args, **kwargs):
"""
raise NotImplementedError

@save_meta
def apply(self, setup, experiment):
"""
Apply a CRD to run the experiment and wait for output.
Expand Down Expand Up @@ -182,11 +182,32 @@ def apply(self, setup, experiment):
if os.path.exists(crd):
os.remove(crd)

# Save times and experiment metadata to file
# TODO we could add cost estimation here - data from cloud select
meta = copy.deepcopy(experiment)
meta["times"] = self.times
def save_experiment_metadata(self, setup, experiment):
"""
Save experiment metadata, loading an existing meta.json, if present.
"""
# The experiment is defined by the machine type and size
experiment_dir = os.path.join(setup.outdir, experiment["id"])
if not os.path.exists(experiment_dir):
utils.mkdir_p(experiment_dir)

meta_file = os.path.join(experiment_dir, "meta.json")

# Load existing metadata, if we have it
meta = {"times": self.times}
if os.path.exists(meta_file):
meta = utils.read_json(meta_file)

# Don't update cluster-up/down if already here
frozen_keys = ["create-cluster", "destroy-cluster"]
for timekey, timevalue in self.times.items():
if timekey in meta and timekey in frozen_keys:
continue
meta["times"][timekey] = timevalue

# TODO we could add cost estimation here - data from cloud select
for key, value in experiment.items():
meta[key] = value
utils.write_json(meta, meta_file)
self.clear_minicluster_times()
return meta
Expand Down
9 changes: 8 additions & 1 deletion fluxcloud/main/clouds/aws/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import fluxcloud.utils as utils
from fluxcloud.logger import logger
from fluxcloud.main.client import ExperimentClient
from fluxcloud.main.decorator import save_meta

here = os.path.dirname(os.path.abspath(__file__))

Expand All @@ -28,6 +29,7 @@ def __init__(self, **kwargs):
# This could eventually just be provided
self.config_template = os.path.join(here, "templates", "cluster-config.yaml")

@save_meta
def up(self, setup, experiment=None):
"""
Bring up a cluster
Expand Down Expand Up @@ -108,6 +110,9 @@ def generate_config(self, setup, experiment):
values["size"] = setup.get_size(experiment)
values["ssh_key"] = self.settings.aws.get("ssh_key")

# All extra custom variables
values["variables"] = experiment.get("variables", {})

# Optional booleans
for key in ["private_networking", "efa_enabled"]:
value = self.settings.aws.get("private_networking")
Expand All @@ -118,6 +123,7 @@ def generate_config(self, setup, experiment):
logger.debug(result)
return result

@save_meta
def down(self, setup, experiment=None):
"""
Destroy a cluster
Expand All @@ -135,4 +141,5 @@ def down(self, setup, experiment=None):
]
if setup.force_cluster:
cmd.append("--force-cluster")
return self.run_timed("destroy-cluster", cmd)
self.run_timed("destroy-cluster", cmd)
return self.save_experiment_metadata(setup, experiment)
4 changes: 2 additions & 2 deletions fluxcloud/main/clouds/aws/templates/cluster-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ managedNodeGroups:
{% if ssh_key %}ssh:
allow: true
publicKeyPath: {{ ssh_key }}{% endif %}
{% if private_networking %}privateNetworking: true{% endif %}
{% if efa_enabled %}efaEnabled: true{% endif %}
{% if variables["private_networking"] %}privateNetworking: true{% endif %}
{% if variables["efa_enabled"] %}efaEnabled: true{% endif %}
3 changes: 3 additions & 0 deletions fluxcloud/main/clouds/google/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# SPDX-License-Identifier: Apache-2.0

from fluxcloud.main.client import ExperimentClient
from fluxcloud.main.decorator import save_meta


class GoogleCloud(ExperimentClient):
Expand All @@ -24,6 +25,7 @@ def __init__(self, **kwargs):
"Please provide your Google Cloud project in your settings.yml or flux-cloud set google:project <project>"
)

@save_meta
def up(self, setup, experiment=None):
"""
Bring up a cluster
Expand Down Expand Up @@ -54,6 +56,7 @@ def up(self, setup, experiment=None):
cmd += ["--tags", ",".join(tags)]
return self.run_timed("create-cluster", cmd)

@save_meta
def down(self, setup, experiment=None):
"""
Destroy a cluster
Expand Down
39 changes: 34 additions & 5 deletions fluxcloud/main/decorator.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,47 @@
from functools import partial, update_wrapper


class timed:
"""
Time the length of the run, add to times
"""

class Decorator:
def __init__(self, func):
update_wrapper(self, func)
self.func = func

def __get__(self, obj, objtype):
return partial(self.__call__, obj)


class save_meta(Decorator):
"""
Call to save metadata on the class with setup and experiment
"""

def __call__(self, cls, *args, **kwargs):

# Name of the key is after command
idx = 0
if "setup" in kwargs:
setup = kwargs["setup"]
else:
setup = args[idx]
idx += 1

# experiment is either the second argument or a kwarg
if "experiment" in kwargs:
experiment = kwargs["experiment"]
else:
experiment = args[idx]

res = self.func(cls, *args, **kwargs)
experiment = experiment or setup.get_single_experiment()
cls.save_experiment_metadata(setup, experiment)
return res


class timed(Decorator):
"""
Time the length of the run, add to times
"""

def __call__(self, cls, *args, **kwargs):

# Name of the key is after command
Expand Down
2 changes: 1 addition & 1 deletion fluxcloud/main/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
keyvals = {
"type": "object",
"patternProperties": {
"\\w[\\w-]*": {"type": "string"},
"\\w[\\w-]*": {"type": ["string", "number", "integer", "array", "boolean"]},
},
}

Expand Down

0 comments on commit df5a46e

Please sign in to comment.