Add the documentation for hyperparameter tuning

lanl · Sep 18, 2024 · 541028e · 541028e
1 parent 8e99f0e
commit 541028e
Show file tree

Hide file tree

Showing 2 changed files with 358 additions and 0 deletions.
diff --git a/docs/source/examples/hyperopt.rst b/docs/source/examples/hyperopt.rst
@@ -0,0 +1,357 @@
+Hyperparameter optimization with Ax and Ray
+===================
+
+Here is an example on how you can perform hyperparameter optimization
+sequentially (with Ax) or parallelly (with Ax and Ray).
+
+Prerequisites
+--------------
+
+The packages required to perform this task are `Ax`_ and `ray`_.::
+
+    conda install -c conda-forge "ray < 2.7.0"
+    pip install ax-platform
+
+.. note::
+   The scripts have been tested with `ax-platform 0.3.1` and `ray 2.3.0`, and
+   some previous versions of the two packages. Unfortunately, several changes
+   made in recent versions of `ray` will break this script. You should install
+   `ray < 2.7.0`. ``pip install`` is recommended by the Ax developers even if
+   a conda environment is used.
+
+.. note::
+   If you can update this example and scripts to accommodate the changes in the
+   latest Ray package, feel free to submit a pull request.
+
+How it works
+--------------
+
+Ax is a package that can perform Bayesian optimization. With the given parameter
+range, a set of initial trails are generated. Then based on the metrics returned
+from these trails, new test parameters are generated. By default, this Ax
+workflow can only be performed sequentially. We can combine Ray and Ax to
+utilize multiple GPU on the same node. Ray interfaces with Ax to pull trail
+parameters and then automatically distribute the trails to available resources.
+With this, we can perform asynchronous parallelized hyperparameter optimization.
+
+
+Ax experiments
+^^^^
+
+You can create a basic Ax experiment this way::
+
+    from ax.service.ax_client import AxClient
+    ax_client = AxClient()
+    ax_client.create_experiment(
+        name="hyper_opt",
+        parameters=[
+            {
+                "name": "parameter_a",
+                "type": "fixed",
+                "value_type": "float",
+                "value": 0.6,
+            },
+            {
+                "name": "parameter_b",
+                "type": "range",
+                "value_type": "int",
+                "bounds": [20, 40],
+            },
+            {
+                "name": "parameter_c",
+                "type": "range",
+                "value_type": "float",
+                "bounds": [30.0, 60.0],
+            },
+            {
+                "name": "parameter_d",
+                "type": "range",
+                "value_type": "float",
+                "bounds": [0.001, 1],
+                "log_scale": True,
+            },
+        ],
+        objectives={
+            "Metric": ObjectiveProperties(minimize=True),
+        },
+        parameter_constraints=[
+            "parameter_b <= parameter_c",
+        ],
+    )
+
+Here we create an Ax experiment called "hyper_opt", with 4 parameters,
+`parameter_a`, `parameter_b`, `parameter_c`, and `parameter_d`. Our goal is to
+minimize a metric called "Metric".
+
+A few crucial things to note:
+
+* You can give a range or fixed value to each parameter. You might want to
+  specify the data type as well. A fixed parameter makes sense here because you
+  can do the optimization with only a subset of parameters without the need of
+  modifying your training function.
+* Constraints can be applied to the search space like the example shows, but
+  there is no easy way to achieve a constraint that contains mathematical
+  expressions (for example, `parameter_a < 2 * parameter_b`).
+* For each experiment, Ax will generate a dictionary as the input of the
+  training function. The dictionary will look like::
+
+    {
+        "parameter_a": 0.6, 
+        "parameter_b": 30,
+        "parameter_c": 35.0,
+        "parameter_d": 0.2
+    }
+
+  As such, the training function must be able to take a dictionary as the input
+  (as a single dictionary or keyword arguments) and use these values to set up
+  the training. 
+* The `objectives` keyword argument takes a dictionary of variables. The keys of
+  the dictionary **MUST** exist in the dictionary returned from the training
+  function. In this example, the training function must return a dictionary
+  like::
+
+    return {
+        ...
+        "Metric": metric,
+        ...
+    }
+
+  The above two points will become more clear when we go through the training
+  function.
+
+Training function
+^^^^
+
+You only need a minimal change to your existing training script to use it with
+Ax. In most case, you just have to wrap the whole script into a function::
+
+    def training(parameter_a, parameter_b, parameter_c, parameter_d):
+        # set up the network with the parameters
+        ...
+        network_params = {
+            ...
+            "parameter_a": parameter_a,
+            ...
+        }
+        network = networks.Hipnn(
+            "hipnn_model", (species, positions), module_kwargs=network_params
+        )
+        # train the network 
+        # `metric_tracker` contains the losses from HIPPYNN
+        metric_tracker = train_model(
+        training_modules,
+        database,
+        controller,
+        metric_tracker,
+        callbacks=None,
+        batch_callbacks=None,
+        )
+        # return the desired metric to Ax, for example, validation loss
+        return {
+            "Metric": metric_tracker.best_metric_values["valid"]["Loss"]
+        }
+
+Note how we can utilize the parameters passed in and return **Metric** at the
+end.
+
+.. _run-sequential-experiments:
+
+Run sequential experiments
+^^^^
+
+Next, we can run the experiments::
+
+    for k in range(30):
+        parameter, trial_index = ax_client.get_next_trial()
+        ax_client.complete_trial(trial_index=trial_index, raw_data=training(parameter))
+        # Save experiment to file as JSON file
+        ax_client.save_to_json_file(filepath="hyperopt.json")
+    data_frame = ax_client.get_trials_data_frame().sort_values("Metric")
+    data_frame.to_csv("hyperopt.csv", header=True)
+
+For example, we will run 30 trails here and the results will be saved into a
+json file and a csv file. The json file will contain all the details of the
+trails, which can be used to restart the experiments or add additional
+experiments. As it contains too many details to be human-friendly, we save a
+more human-friendly csv that only contains the trail indices, parameters, and
+metrics.
+
+Asynchronous parallelized optimization with Ray
+^^^^
+
+To use Ray to distribute the trails across GPUs parallelly, a small update is
+needed for the training function::
+
+    from ray.air import session
+
+
+    def training(parameter_a, parameter_b, parameter_c, parameter_d):
+        # setup and train are the same
+        ....
+        # instead of return, we use `session.report` to communicate with `ray`
+        session.report(
+        {
+            "Metric": metric_tracker.best_metric_values["valid"]["Loss"]
+        }
+    ) 
+
+Instead of a simple `return`, we need the `report` method from `ray.air.session`
+to report the final metric to `ray`.
+
+Also, to run the trails, instead of a loop in :ref:`run-sequential-experiments`,
+we have to use the interfaces between the two packages from `ray`::
+    from ray.tune.experiment.trial import Trial
+    from ray.tune.search import ConcurrencyLimiter
+    from ray.tune.search.ax import AxSearch
+
+    # to make sure ray loads local packages correctly
+    ray.init(runtime_env={"working_dir": "."})
+
+    algo = AxSearch(ax_client=ax_client)
+    # 4 GPUs available
+    algo = ConcurrencyLimiter(algo, max_concurrent=4)
+    tuner = tune.Tuner(
+        # assign 1 GPU for one trail
+        tune.with_resources(training, resources={"gpu": 1}),
+        # run 10 trails
+        tune_config=tune.TuneConfig(search_alg=algo, num_samples=10),
+        # configuration of ray
+        run_config=air.RunConfig(
+            # all results will be saved in a subfolder inside the "test" folder
+            of the current working directory
+            local_dir="./test",
+            verbose=0,
+            log_to_file=True,
+        ),
+    )
+    # run the trails
+    tuner.fit()
+    # save the results as the end
+    # to save the file after each trail, a callback is needed
+    # see advanced details
+    ax_client.save_to_json_file(filepath="hyperopt.json")
+    data_frame = ax_client.get_trials_data_frame().sort_values("Metric")
+    data_frame.to_csv("hyperopt.csv", header=True)
+
+This is all you need. The results will be saved in the path of
+`./test/{trail_function_name}_{timestamp}`. Each trail will be saved within a
+subfolder named
+`{trail_function_name}_{random_id}_{index}_{truncated_parameters}`.
+
+Advanced details
+^^^^
+
+Relative import
+""""
+
+If you save the training function into a separated file and import it into the
+Ray script, one line has to be added before the trails start,::
+
+   ray.init(runtime_env={"working_dir": "."})
+
+assuming the current directory (".") contains the training and Ray script.
+Without this line, Ray will NOT be able to find the training script and import
+the training function.
+
+Callbacks for Ray
+""""
+
+When running `ray.tune`, a set of callback functions can be called during the
+process. Ray has a `documentation`_ on the callback functions. You can build
+your own for your convenience. However, here is a callback function to save
+the json and csv files at the end of each trail and handle failed trails, which
+should cover the most basic functionalities.::
+
+    from ray.tune.logger import JsonLoggerCallback, LoggerCallback
+    
+    class AxLogger(LoggerCallback):
+        def __init__(self, ax_client: AxClient, json_name: str, csv_name: str):
+            """
+            A logger callback to save the progress to json file after every trial ends.
+            Similar to running `ax_client.save_to_json_file` every iteration in sequential
+            searches.
+    
+            Args:
+                ax_client (AxClient): ax client to save
+                json_name (str): name for the json file. Append a path if you want to save the \
+                    json file to somewhere other than cwd.
+                csv_name (str): name for the csv file. Append a path if you want to save the \
+                    csv file to somewhere other than cwd.
+            """
+            self.ax_client = ax_client
+            self.json = json_name
+            self.csv = csv_name
+    
+        def log_trial_end(
+            self, trial: Trial, id: int, metric: float, runtime: int, failed: bool = False
+        ):
+            self.ax_client.save_to_json_file(filepath=self.json)
+            shutil.copy(self.json, f"{trial.local_dir}/{self.json}")
+            try:
+                data_frame = self.ax_client.get_trials_data_frame().sort_values("Metric")
+                data_frame.to_csv(self.csv, header=True)
+            except KeyError:
+                pass
+            shutil.copy(self.csv, f"{trial.local_dir}/{self.csv}")
+            if failed:
+                status = "failed"
+            else:
+                status = "finished"
+            print(
+                f"AX trial {id} {status}. Final loss: {metric}. Time taken"
+                f" {runtime} seconds. Location directory: {trial.logdir}."
+            )
+    
+        def on_trial_error(self, iteration: int, trials: list[Trial], trial: Trial, **info):
+            id = int(trial.experiment_tag.split("_")[0]) - 1
+            ax_trial = self.ax_client.get_trial(id)
+            ax_trial.mark_abandoned(reason="Error encountered")
+            self.log_trial_end(
+                trial, id + 1, "not available", self.calculate_runtime(ax_trial), True
+            )
+    
+        def on_trial_complete(
+            self, iteration: int, trials: list["Trial"], trial: Trial, **info
+        ):
+            # trial.trial_id is the random id generated by ray, not ax
+            # the default experiment_tag starts with ax' trial index
+            # but this workaround is totally fragile, as users can
+            # customize the tag or folder name
+            id = int(trial.experiment_tag.split("_")[0]) - 1
+            ax_trial = self.ax_client.get_trial(id)
+            failed = False
+            try:
+                loss = ax_trial.objective_mean
+            except ValueError:
+                failed = True
+                loss = "not available"
+            else:
+                if np.isnan(loss) or np.isinf(loss):
+                    failed = True
+                    loss = "not available"
+            if failed:
+                ax_trial.mark_failed()
+            self.log_trial_end(
+                trial, id + 1, loss, self.calculate_runtime(ax_trial), failed
+            )
+    
+        @classmethod
+        def calculate_runtime(cls, trial: AXTrial):
+            delta = trial.time_completed - trial.time_run_started
+            return int(delta.total_seconds())
+
+To use callback functions, simple add a line in `ray.RunConfig`::
+
+    ax_logger = AxLogger(ax_client, "hyperopt_ray.json", "hyperopt.csv")
+    run_config=air.RunConfig(
+        local_dir="./test",
+        verbose=0,
+        callbacks=[ax_logger, JsonLoggerCallback()],
+        log_to_file=True,
+    )
+
+A full example script is provided in the examples (WIP).
+
+.. _ray: https://docs.ray.io/en/latest/
+.. _Ax: https://github.com/facebook/Ax
+.. _documentation: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html
diff --git a/docs/source/examples/index.rst b/docs/source/examples/index.rst
@@ -19,4 +19,5 @@ the examples are just snippets. For fully-fledged examples see the
     ase_calculator
     mliap_unified
     excited_states
+    hyperopt