Skip to content

Commit

Permalink
Add the documentation for hyperparameter tuning
Browse files Browse the repository at this point in the history
  • Loading branch information
tautomer committed Sep 18, 2024
1 parent 8e99f0e commit 541028e
Show file tree
Hide file tree
Showing 2 changed files with 358 additions and 0 deletions.
357 changes: 357 additions & 0 deletions docs/source/examples/hyperopt.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,357 @@
Hyperparameter optimization with Ax and Ray
===================

Here is an example on how you can perform hyperparameter optimization
sequentially (with Ax) or parallelly (with Ax and Ray).

Prerequisites
--------------

The packages required to perform this task are `Ax`_ and `ray`_.::

conda install -c conda-forge "ray < 2.7.0"
pip install ax-platform

.. note::
The scripts have been tested with `ax-platform 0.3.1` and `ray 2.3.0`, and
some previous versions of the two packages. Unfortunately, several changes
made in recent versions of `ray` will break this script. You should install
`ray < 2.7.0`. ``pip install`` is recommended by the Ax developers even if
a conda environment is used.

.. note::
If you can update this example and scripts to accommodate the changes in the
latest Ray package, feel free to submit a pull request.

How it works
--------------

Ax is a package that can perform Bayesian optimization. With the given parameter
range, a set of initial trails are generated. Then based on the metrics returned
from these trails, new test parameters are generated. By default, this Ax
workflow can only be performed sequentially. We can combine Ray and Ax to
utilize multiple GPU on the same node. Ray interfaces with Ax to pull trail
parameters and then automatically distribute the trails to available resources.
With this, we can perform asynchronous parallelized hyperparameter optimization.


Ax experiments
^^^^

You can create a basic Ax experiment this way::

from ax.service.ax_client import AxClient
ax_client = AxClient()
ax_client.create_experiment(
name="hyper_opt",
parameters=[
{
"name": "parameter_a",
"type": "fixed",
"value_type": "float",
"value": 0.6,
},
{
"name": "parameter_b",
"type": "range",
"value_type": "int",
"bounds": [20, 40],
},
{
"name": "parameter_c",
"type": "range",
"value_type": "float",
"bounds": [30.0, 60.0],
},
{
"name": "parameter_d",
"type": "range",
"value_type": "float",
"bounds": [0.001, 1],
"log_scale": True,
},
],
objectives={
"Metric": ObjectiveProperties(minimize=True),
},
parameter_constraints=[
"parameter_b <= parameter_c",
],
)

Here we create an Ax experiment called "hyper_opt", with 4 parameters,
`parameter_a`, `parameter_b`, `parameter_c`, and `parameter_d`. Our goal is to
minimize a metric called "Metric".

A few crucial things to note:

* You can give a range or fixed value to each parameter. You might want to
specify the data type as well. A fixed parameter makes sense here because you
can do the optimization with only a subset of parameters without the need of
modifying your training function.
* Constraints can be applied to the search space like the example shows, but
there is no easy way to achieve a constraint that contains mathematical
expressions (for example, `parameter_a < 2 * parameter_b`).
* For each experiment, Ax will generate a dictionary as the input of the
training function. The dictionary will look like::

{
"parameter_a": 0.6,
"parameter_b": 30,
"parameter_c": 35.0,
"parameter_d": 0.2
}

As such, the training function must be able to take a dictionary as the input
(as a single dictionary or keyword arguments) and use these values to set up
the training.
* The `objectives` keyword argument takes a dictionary of variables. The keys of
the dictionary **MUST** exist in the dictionary returned from the training
function. In this example, the training function must return a dictionary
like::

return {
...
"Metric": metric,
...
}

The above two points will become more clear when we go through the training
function.

Training function
^^^^

You only need a minimal change to your existing training script to use it with
Ax. In most case, you just have to wrap the whole script into a function::

def training(parameter_a, parameter_b, parameter_c, parameter_d):
# set up the network with the parameters
...
network_params = {
...
"parameter_a": parameter_a,
...
}
network = networks.Hipnn(
"hipnn_model", (species, positions), module_kwargs=network_params
)
# train the network
# `metric_tracker` contains the losses from HIPPYNN
metric_tracker = train_model(
training_modules,
database,
controller,
metric_tracker,
callbacks=None,
batch_callbacks=None,
)
# return the desired metric to Ax, for example, validation loss
return {
"Metric": metric_tracker.best_metric_values["valid"]["Loss"]
}

Note how we can utilize the parameters passed in and return **Metric** at the
end.

.. _run-sequential-experiments:

Run sequential experiments
^^^^

Next, we can run the experiments::

for k in range(30):
parameter, trial_index = ax_client.get_next_trial()
ax_client.complete_trial(trial_index=trial_index, raw_data=training(parameter))
# Save experiment to file as JSON file
ax_client.save_to_json_file(filepath="hyperopt.json")
data_frame = ax_client.get_trials_data_frame().sort_values("Metric")
data_frame.to_csv("hyperopt.csv", header=True)

For example, we will run 30 trails here and the results will be saved into a
json file and a csv file. The json file will contain all the details of the
trails, which can be used to restart the experiments or add additional
experiments. As it contains too many details to be human-friendly, we save a
more human-friendly csv that only contains the trail indices, parameters, and
metrics.

Asynchronous parallelized optimization with Ray
^^^^

To use Ray to distribute the trails across GPUs parallelly, a small update is
needed for the training function::

from ray.air import session


def training(parameter_a, parameter_b, parameter_c, parameter_d):
# setup and train are the same
....
# instead of return, we use `session.report` to communicate with `ray`
session.report(
{
"Metric": metric_tracker.best_metric_values["valid"]["Loss"]
}
)

Instead of a simple `return`, we need the `report` method from `ray.air.session`
to report the final metric to `ray`.

Also, to run the trails, instead of a loop in :ref:`run-sequential-experiments`,
we have to use the interfaces between the two packages from `ray`::
from ray.tune.experiment.trial import Trial
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.ax import AxSearch

# to make sure ray loads local packages correctly
ray.init(runtime_env={"working_dir": "."})

algo = AxSearch(ax_client=ax_client)
# 4 GPUs available
algo = ConcurrencyLimiter(algo, max_concurrent=4)
tuner = tune.Tuner(
# assign 1 GPU for one trail
tune.with_resources(training, resources={"gpu": 1}),
# run 10 trails
tune_config=tune.TuneConfig(search_alg=algo, num_samples=10),
# configuration of ray
run_config=air.RunConfig(
# all results will be saved in a subfolder inside the "test" folder
of the current working directory
local_dir="./test",
verbose=0,
log_to_file=True,
),
)
# run the trails
tuner.fit()
# save the results as the end
# to save the file after each trail, a callback is needed
# see advanced details
ax_client.save_to_json_file(filepath="hyperopt.json")
data_frame = ax_client.get_trials_data_frame().sort_values("Metric")
data_frame.to_csv("hyperopt.csv", header=True)

This is all you need. The results will be saved in the path of
`./test/{trail_function_name}_{timestamp}`. Each trail will be saved within a
subfolder named
`{trail_function_name}_{random_id}_{index}_{truncated_parameters}`.

Advanced details
^^^^

Relative import
""""

If you save the training function into a separated file and import it into the
Ray script, one line has to be added before the trails start,::

ray.init(runtime_env={"working_dir": "."})

assuming the current directory (".") contains the training and Ray script.
Without this line, Ray will NOT be able to find the training script and import
the training function.

Callbacks for Ray
""""

When running `ray.tune`, a set of callback functions can be called during the
process. Ray has a `documentation`_ on the callback functions. You can build
your own for your convenience. However, here is a callback function to save
the json and csv files at the end of each trail and handle failed trails, which
should cover the most basic functionalities.::

from ray.tune.logger import JsonLoggerCallback, LoggerCallback
class AxLogger(LoggerCallback):
def __init__(self, ax_client: AxClient, json_name: str, csv_name: str):
"""
A logger callback to save the progress to json file after every trial ends.
Similar to running `ax_client.save_to_json_file` every iteration in sequential
searches.
Args:
ax_client (AxClient): ax client to save
json_name (str): name for the json file. Append a path if you want to save the \
json file to somewhere other than cwd.
csv_name (str): name for the csv file. Append a path if you want to save the \
csv file to somewhere other than cwd.
"""
self.ax_client = ax_client
self.json = json_name
self.csv = csv_name
def log_trial_end(
self, trial: Trial, id: int, metric: float, runtime: int, failed: bool = False
):
self.ax_client.save_to_json_file(filepath=self.json)
shutil.copy(self.json, f"{trial.local_dir}/{self.json}")
try:
data_frame = self.ax_client.get_trials_data_frame().sort_values("Metric")
data_frame.to_csv(self.csv, header=True)
except KeyError:
pass
shutil.copy(self.csv, f"{trial.local_dir}/{self.csv}")
if failed:
status = "failed"
else:
status = "finished"
print(
f"AX trial {id} {status}. Final loss: {metric}. Time taken"
f" {runtime} seconds. Location directory: {trial.logdir}."
)
def on_trial_error(self, iteration: int, trials: list[Trial], trial: Trial, **info):
id = int(trial.experiment_tag.split("_")[0]) - 1
ax_trial = self.ax_client.get_trial(id)
ax_trial.mark_abandoned(reason="Error encountered")
self.log_trial_end(
trial, id + 1, "not available", self.calculate_runtime(ax_trial), True
)
def on_trial_complete(
self, iteration: int, trials: list["Trial"], trial: Trial, **info
):
# trial.trial_id is the random id generated by ray, not ax
# the default experiment_tag starts with ax' trial index
# but this workaround is totally fragile, as users can
# customize the tag or folder name
id = int(trial.experiment_tag.split("_")[0]) - 1
ax_trial = self.ax_client.get_trial(id)
failed = False
try:
loss = ax_trial.objective_mean
except ValueError:
failed = True
loss = "not available"
else:
if np.isnan(loss) or np.isinf(loss):
failed = True
loss = "not available"
if failed:
ax_trial.mark_failed()
self.log_trial_end(
trial, id + 1, loss, self.calculate_runtime(ax_trial), failed
)
@classmethod
def calculate_runtime(cls, trial: AXTrial):
delta = trial.time_completed - trial.time_run_started
return int(delta.total_seconds())

To use callback functions, simple add a line in `ray.RunConfig`::

ax_logger = AxLogger(ax_client, "hyperopt_ray.json", "hyperopt.csv")
run_config=air.RunConfig(
local_dir="./test",
verbose=0,
callbacks=[ax_logger, JsonLoggerCallback()],
log_to_file=True,
)

A full example script is provided in the examples (WIP).

.. _ray: https://docs.ray.io/en/latest/
.. _Ax: https://github.com/facebook/Ax
.. _documentation: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html
1 change: 1 addition & 0 deletions docs/source/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,5 @@ the examples are just snippets. For fully-fledged examples see the
ase_calculator
mliap_unified
excited_states
hyperopt

0 comments on commit 541028e

Please sign in to comment.