start of work to test adding efa support (#8)

* start of work to test adding efa support * ensure schemas can handle boolean, do not overwrite metadata * ensure we save up/down meta times too * output directory should be defined for up/down to write meta.json times * do not overwrite create/destroy cluster keys Signed-off-by: vsoch <[email protected]>
converged-computing · Jan 5, 2023 · df5a46e · df5a46e
1 parent 7fd2e0a
commit df5a46e
Show file tree

Hide file tree

Showing 13 changed files with 117 additions and 20 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@ and **Merged pull requests**. Critical items to know are:
 The versions coincide with releases on pip. Only major versions will be released as tags on Github.
 
 ## [0.0.x](https://github.com/converged-computing/flux-cloud/tree/main) (0.0.x)
+ - support for custom cloud variables in the experiments config (0.0.13)
  - support for Amazon EKS and running commands over iterations (0.0.12)
  - better control of exit codes, addition of force cluster (0.0.11)
  - support for experiment id selection, addition of osu-benchmarks example (0.0.1)

diff --git a/docs/getting_started/aws.md b/docs/getting_started/aws.md
@@ -45,6 +45,22 @@ This is used so you can ssh (connect) to your workers!
 Finally, ensure that aws is either your default cloud (the `default_cloud` in your settings.yml)
 or you specify it with `--cloud` when you do run.
 
+## Custom Variables
+
+The following custom variables are supported in the "variables" section (key value pairs)
+for Amazon in an `experiments.yaml`
+
+```yaml
+variables:
+    # Enable private networking
+    private_networking: true
+
+    # Enable efa (requires efa also set under the container limits)
+    efa_enabled: true
+```
+Note that we currently take a simple approach for boolean values - if it's present (e.g., the examples)
+above) it will be rendered as true. Don't put False in there, but rather just delete the key.
+
 ## Run Experiments
 
 **IMPORTANT** for any experiment when you choose an instance type, you absolutely
@@ -54,7 +70,7 @@ true. E.g., `m5.large` has it set to true so it would work.
 Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
 populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
 library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
-provide this library for you to easily edit and use! Take a look at the [examples](../examples)
+provide this library for you to easily edit and use! Take a look at the [examples](https://github.com/converged-computing/flux-cloud/tree/main/examples)
 directory for a few that we provide. We will walk through a generic one here to launch
 an experiment on a Kubernetes cluster. Note that before doing this step you should
 have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)

diff --git a/docs/getting_started/experiments.md b/docs/getting_started/experiments.md
@@ -58,6 +58,20 @@ experiments:
 
 Note that it's a yaml list.
 
+### Custom Variables
+
+Each cloud provider is allowed to specify any number of custom variables, and these
+are available in the "variables" section. As an example, let's say we want to customize networking
+arguments for aws:
+
+```yaml
+variables:
+    private_networking: true
+    efa_enabled: true
+```
+
+You can look at each cloud page here to see what variables are known.
+
 ### MiniCluster Definition
 
 The minicluster is suggested to be defined, although it's not required (by default we will use the name and namespace in your settings).

diff --git a/docs/getting_started/google.md b/docs/getting_started/google.md
@@ -42,7 +42,7 @@ or you specify it with `--cloud` when you do run.
 Each experiment is defined by the matrix and variables in an `experiment.yaml` that is used to
 populate a `minicluster-template.yaml` that you can either provide, or use a template provided by the
 library. One of the goals of the Flux Cloud Experiment runner is not just to run things, but to
-provide this library for you to easily edit and use! Take a look at the [examples](../examples)
+provide this library for you to easily edit and use! Take a look at the [examples](https://github.com/converged-computing/flux-cloud/tree/main/examples)
 directory for a few that we provide. We will walk through a generic one here to launch
 an experiment on a Kubernetes cluster. Note that before doing this step you should
 have installed flux-cloud, along with kubectl and gcloud, and set your defaults (e.g., project zone)

diff --git a/fluxcloud/client/__init__.py b/fluxcloud/client/__init__.py
@@ -189,7 +189,7 @@ def get_parser():
             help="experiment ID to apply to (<machine>-<size>)",
         )
 
-    for command in run, apply:
+    for command in run, apply, up, down:
         command.add_argument(
             "-o",
             "--output-dir",

diff --git a/fluxcloud/client/down.py b/fluxcloud/client/down.py
@@ -16,7 +16,10 @@ def main(args, parser, extra, subparser):
 
     cli = get_experiment_client(args.cloud)
     setup = ExperimentSetup(
-        args.experiments, quiet=True, force_cluster=args.force_cluster
+        args.experiments,
+        quiet=True,
+        force_cluster=args.force_cluster,
+        outdir=args.output_dir,
     )
 
     # Update config settings on the fly

diff --git a/fluxcloud/client/up.py b/fluxcloud/client/up.py
@@ -16,7 +16,10 @@ def main(args, parser, extra, subparser):
 
     cli = get_experiment_client(args.cloud)
     setup = ExperimentSetup(
-        args.experiments, quiet=True, force_cluster=args.force_cluster
+        args.experiments,
+        quiet=True,
+        force_cluster=args.force_cluster,
+        outdir=args.output_dir,
     )
 
     # Update config settings on the fly

diff --git a/fluxcloud/main/client.py b/fluxcloud/main/client.py
@@ -3,13 +3,12 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-import copy
 import os
 import shutil
 
 import fluxcloud.utils as utils
 from fluxcloud.logger import logger
-from fluxcloud.main.decorator import timed
+from fluxcloud.main.decorator import save_meta, timed
 
 here = os.path.dirname(os.path.abspath(__file__))
 
@@ -116,6 +115,7 @@ def down(self, *args, **kwargs):
         """
         raise NotImplementedError
 
+    @save_meta
     def apply(self, setup, experiment):
         """
         Apply a CRD to run the experiment and wait for output.
@@ -182,11 +182,32 @@ def apply(self, setup, experiment):
             if os.path.exists(crd):
                 os.remove(crd)
 
-        # Save times and experiment metadata to file
-        # TODO we could add cost estimation here - data from cloud select
-        meta = copy.deepcopy(experiment)
-        meta["times"] = self.times
+    def save_experiment_metadata(self, setup, experiment):
+        """
+        Save experiment metadata, loading an existing meta.json, if present.
+        """
+        # The experiment is defined by the machine type and size
+        experiment_dir = os.path.join(setup.outdir, experiment["id"])
+        if not os.path.exists(experiment_dir):
+            utils.mkdir_p(experiment_dir)
+
         meta_file = os.path.join(experiment_dir, "meta.json")
+
+        # Load existing metadata, if we have it
+        meta = {"times": self.times}
+        if os.path.exists(meta_file):
+            meta = utils.read_json(meta_file)
+
+            # Don't update cluster-up/down if already here
+            frozen_keys = ["create-cluster", "destroy-cluster"]
+            for timekey, timevalue in self.times.items():
+                if timekey in meta and timekey in frozen_keys:
+                    continue
+                meta["times"][timekey] = timevalue
+
+        # TODO we could add cost estimation here - data from cloud select
+        for key, value in experiment.items():
+            meta[key] = value
         utils.write_json(meta, meta_file)
         self.clear_minicluster_times()
         return meta

diff --git a/fluxcloud/main/clouds/aws/client.py b/fluxcloud/main/clouds/aws/client.py
@@ -10,6 +10,7 @@
 import fluxcloud.utils as utils
 from fluxcloud.logger import logger
 from fluxcloud.main.client import ExperimentClient
+from fluxcloud.main.decorator import save_meta
 
 here = os.path.dirname(os.path.abspath(__file__))
 
@@ -28,6 +29,7 @@ def __init__(self, **kwargs):
         # This could eventually just be provided
         self.config_template = os.path.join(here, "templates", "cluster-config.yaml")
 
+    @save_meta
     def up(self, setup, experiment=None):
         """
         Bring up a cluster
@@ -108,6 +110,9 @@ def generate_config(self, setup, experiment):
         values["size"] = setup.get_size(experiment)
         values["ssh_key"] = self.settings.aws.get("ssh_key")
 
+        # All extra custom variables
+        values["variables"] = experiment.get("variables", {})
+
         # Optional booleans
         for key in ["private_networking", "efa_enabled"]:
             value = self.settings.aws.get("private_networking")
@@ -118,6 +123,7 @@ def generate_config(self, setup, experiment):
         logger.debug(result)
         return result
 
+    @save_meta
     def down(self, setup, experiment=None):
         """
         Destroy a cluster
@@ -135,4 +141,5 @@ def down(self, setup, experiment=None):
         ]
         if setup.force_cluster:
             cmd.append("--force-cluster")
-        return self.run_timed("destroy-cluster", cmd)
+        self.run_timed("destroy-cluster", cmd)
+        return self.save_experiment_metadata(setup, experiment)
diff --git a/fluxcloud/main/clouds/aws/templates/cluster-config.yaml b/fluxcloud/main/clouds/aws/templates/cluster-config.yaml
@@ -19,5 +19,5 @@ managedNodeGroups:
     {% if ssh_key %}ssh:
       allow: true
       publicKeyPath: {{ ssh_key }}{% endif %}
-    {% if private_networking %}privateNetworking: true{% endif %}
-    {% if efa_enabled %}efaEnabled: true{% endif %}
+    {% if variables["private_networking"] %}privateNetworking: true{% endif %}
+    {% if variables["efa_enabled"] %}efaEnabled: true{% endif %}
diff --git a/fluxcloud/main/clouds/google/client.py b/fluxcloud/main/clouds/google/client.py
@@ -4,6 +4,7 @@
 # SPDX-License-Identifier: Apache-2.0
 
 from fluxcloud.main.client import ExperimentClient
+from fluxcloud.main.decorator import save_meta
 
 
 class GoogleCloud(ExperimentClient):
@@ -24,6 +25,7 @@ def __init__(self, **kwargs):
                 "Please provide your Google Cloud project in your settings.yml or flux-cloud set google:project <project>"
             )
 
+    @save_meta
     def up(self, setup, experiment=None):
         """
         Bring up a cluster
@@ -54,6 +56,7 @@ def up(self, setup, experiment=None):
             cmd += ["--tags", ",".join(tags)]
         return self.run_timed("create-cluster", cmd)
 
+    @save_meta
     def down(self, setup, experiment=None):
         """
         Destroy a cluster

diff --git a/fluxcloud/main/decorator.py b/fluxcloud/main/decorator.py
@@ -7,18 +7,47 @@
 from functools import partial, update_wrapper
 
 
-class timed:
-    """
-    Time the length of the run, add to times
-    """
-
+class Decorator:
     def __init__(self, func):
         update_wrapper(self, func)
         self.func = func
 
     def __get__(self, obj, objtype):
         return partial(self.__call__, obj)
 
+
+class save_meta(Decorator):
+    """
+    Call to save metadata on the class with setup and experiment
+    """
+
+    def __call__(self, cls, *args, **kwargs):
+
+        # Name of the key is after command
+        idx = 0
+        if "setup" in kwargs:
+            setup = kwargs["setup"]
+        else:
+            setup = args[idx]
+            idx += 1
+
+        # experiment is either the second argument or a kwarg
+        if "experiment" in kwargs:
+            experiment = kwargs["experiment"]
+        else:
+            experiment = args[idx]
+
+        res = self.func(cls, *args, **kwargs)
+        experiment = experiment or setup.get_single_experiment()
+        cls.save_experiment_metadata(setup, experiment)
+        return res
+
+
+class timed(Decorator):
+    """
+    Time the length of the run, add to times
+    """
+
     def __call__(self, cls, *args, **kwargs):
 
         # Name of the key is after command

diff --git a/fluxcloud/main/schemas.py b/fluxcloud/main/schemas.py
@@ -15,7 +15,7 @@
 keyvals = {
     "type": "object",
     "patternProperties": {
-        "\\w[\\w-]*": {"type": "string"},
+        "\\w[\\w-]*": {"type": ["string", "number", "integer", "array", "boolean"]},
     },
 }