-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add the documentation for hyperparameter tuning
- Loading branch information
Showing
2 changed files
with
358 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,357 @@ | ||
Hyperparameter optimization with Ax and Ray | ||
=================== | ||
|
||
Here is an example on how you can perform hyperparameter optimization | ||
sequentially (with Ax) or parallelly (with Ax and Ray). | ||
|
||
Prerequisites | ||
-------------- | ||
|
||
The packages required to perform this task are `Ax`_ and `ray`_.:: | ||
|
||
conda install -c conda-forge "ray < 2.7.0" | ||
pip install ax-platform | ||
|
||
.. note:: | ||
The scripts have been tested with `ax-platform 0.3.1` and `ray 2.3.0`, and | ||
some previous versions of the two packages. Unfortunately, several changes | ||
made in recent versions of `ray` will break this script. You should install | ||
`ray < 2.7.0`. ``pip install`` is recommended by the Ax developers even if | ||
a conda environment is used. | ||
|
||
.. note:: | ||
If you can update this example and scripts to accommodate the changes in the | ||
latest Ray package, feel free to submit a pull request. | ||
|
||
How it works | ||
-------------- | ||
|
||
Ax is a package that can perform Bayesian optimization. With the given parameter | ||
range, a set of initial trails are generated. Then based on the metrics returned | ||
from these trails, new test parameters are generated. By default, this Ax | ||
workflow can only be performed sequentially. We can combine Ray and Ax to | ||
utilize multiple GPU on the same node. Ray interfaces with Ax to pull trail | ||
parameters and then automatically distribute the trails to available resources. | ||
With this, we can perform asynchronous parallelized hyperparameter optimization. | ||
|
||
|
||
Ax experiments | ||
^^^^ | ||
|
||
You can create a basic Ax experiment this way:: | ||
|
||
from ax.service.ax_client import AxClient | ||
ax_client = AxClient() | ||
ax_client.create_experiment( | ||
name="hyper_opt", | ||
parameters=[ | ||
{ | ||
"name": "parameter_a", | ||
"type": "fixed", | ||
"value_type": "float", | ||
"value": 0.6, | ||
}, | ||
{ | ||
"name": "parameter_b", | ||
"type": "range", | ||
"value_type": "int", | ||
"bounds": [20, 40], | ||
}, | ||
{ | ||
"name": "parameter_c", | ||
"type": "range", | ||
"value_type": "float", | ||
"bounds": [30.0, 60.0], | ||
}, | ||
{ | ||
"name": "parameter_d", | ||
"type": "range", | ||
"value_type": "float", | ||
"bounds": [0.001, 1], | ||
"log_scale": True, | ||
}, | ||
], | ||
objectives={ | ||
"Metric": ObjectiveProperties(minimize=True), | ||
}, | ||
parameter_constraints=[ | ||
"parameter_b <= parameter_c", | ||
], | ||
) | ||
|
||
Here we create an Ax experiment called "hyper_opt", with 4 parameters, | ||
`parameter_a`, `parameter_b`, `parameter_c`, and `parameter_d`. Our goal is to | ||
minimize a metric called "Metric". | ||
|
||
A few crucial things to note: | ||
|
||
* You can give a range or fixed value to each parameter. You might want to | ||
specify the data type as well. A fixed parameter makes sense here because you | ||
can do the optimization with only a subset of parameters without the need of | ||
modifying your training function. | ||
* Constraints can be applied to the search space like the example shows, but | ||
there is no easy way to achieve a constraint that contains mathematical | ||
expressions (for example, `parameter_a < 2 * parameter_b`). | ||
* For each experiment, Ax will generate a dictionary as the input of the | ||
training function. The dictionary will look like:: | ||
|
||
{ | ||
"parameter_a": 0.6, | ||
"parameter_b": 30, | ||
"parameter_c": 35.0, | ||
"parameter_d": 0.2 | ||
} | ||
|
||
As such, the training function must be able to take a dictionary as the input | ||
(as a single dictionary or keyword arguments) and use these values to set up | ||
the training. | ||
* The `objectives` keyword argument takes a dictionary of variables. The keys of | ||
the dictionary **MUST** exist in the dictionary returned from the training | ||
function. In this example, the training function must return a dictionary | ||
like:: | ||
|
||
return { | ||
... | ||
"Metric": metric, | ||
... | ||
} | ||
|
||
The above two points will become more clear when we go through the training | ||
function. | ||
|
||
Training function | ||
^^^^ | ||
|
||
You only need a minimal change to your existing training script to use it with | ||
Ax. In most case, you just have to wrap the whole script into a function:: | ||
|
||
def training(parameter_a, parameter_b, parameter_c, parameter_d): | ||
# set up the network with the parameters | ||
... | ||
network_params = { | ||
... | ||
"parameter_a": parameter_a, | ||
... | ||
} | ||
network = networks.Hipnn( | ||
"hipnn_model", (species, positions), module_kwargs=network_params | ||
) | ||
# train the network | ||
# `metric_tracker` contains the losses from HIPPYNN | ||
metric_tracker = train_model( | ||
training_modules, | ||
database, | ||
controller, | ||
metric_tracker, | ||
callbacks=None, | ||
batch_callbacks=None, | ||
) | ||
# return the desired metric to Ax, for example, validation loss | ||
return { | ||
"Metric": metric_tracker.best_metric_values["valid"]["Loss"] | ||
} | ||
|
||
Note how we can utilize the parameters passed in and return **Metric** at the | ||
end. | ||
|
||
.. _run-sequential-experiments: | ||
|
||
Run sequential experiments | ||
^^^^ | ||
|
||
Next, we can run the experiments:: | ||
|
||
for k in range(30): | ||
parameter, trial_index = ax_client.get_next_trial() | ||
ax_client.complete_trial(trial_index=trial_index, raw_data=training(parameter)) | ||
# Save experiment to file as JSON file | ||
ax_client.save_to_json_file(filepath="hyperopt.json") | ||
data_frame = ax_client.get_trials_data_frame().sort_values("Metric") | ||
data_frame.to_csv("hyperopt.csv", header=True) | ||
|
||
For example, we will run 30 trails here and the results will be saved into a | ||
json file and a csv file. The json file will contain all the details of the | ||
trails, which can be used to restart the experiments or add additional | ||
experiments. As it contains too many details to be human-friendly, we save a | ||
more human-friendly csv that only contains the trail indices, parameters, and | ||
metrics. | ||
|
||
Asynchronous parallelized optimization with Ray | ||
^^^^ | ||
|
||
To use Ray to distribute the trails across GPUs parallelly, a small update is | ||
needed for the training function:: | ||
|
||
from ray.air import session | ||
|
||
|
||
def training(parameter_a, parameter_b, parameter_c, parameter_d): | ||
# setup and train are the same | ||
.... | ||
# instead of return, we use `session.report` to communicate with `ray` | ||
session.report( | ||
{ | ||
"Metric": metric_tracker.best_metric_values["valid"]["Loss"] | ||
} | ||
) | ||
|
||
Instead of a simple `return`, we need the `report` method from `ray.air.session` | ||
to report the final metric to `ray`. | ||
|
||
Also, to run the trails, instead of a loop in :ref:`run-sequential-experiments`, | ||
we have to use the interfaces between the two packages from `ray`:: | ||
from ray.tune.experiment.trial import Trial | ||
from ray.tune.search import ConcurrencyLimiter | ||
from ray.tune.search.ax import AxSearch | ||
|
||
# to make sure ray loads local packages correctly | ||
ray.init(runtime_env={"working_dir": "."}) | ||
|
||
algo = AxSearch(ax_client=ax_client) | ||
# 4 GPUs available | ||
algo = ConcurrencyLimiter(algo, max_concurrent=4) | ||
tuner = tune.Tuner( | ||
# assign 1 GPU for one trail | ||
tune.with_resources(training, resources={"gpu": 1}), | ||
# run 10 trails | ||
tune_config=tune.TuneConfig(search_alg=algo, num_samples=10), | ||
# configuration of ray | ||
run_config=air.RunConfig( | ||
# all results will be saved in a subfolder inside the "test" folder | ||
of the current working directory | ||
local_dir="./test", | ||
verbose=0, | ||
log_to_file=True, | ||
), | ||
) | ||
# run the trails | ||
tuner.fit() | ||
# save the results as the end | ||
# to save the file after each trail, a callback is needed | ||
# see advanced details | ||
ax_client.save_to_json_file(filepath="hyperopt.json") | ||
data_frame = ax_client.get_trials_data_frame().sort_values("Metric") | ||
data_frame.to_csv("hyperopt.csv", header=True) | ||
|
||
This is all you need. The results will be saved in the path of | ||
`./test/{trail_function_name}_{timestamp}`. Each trail will be saved within a | ||
subfolder named | ||
`{trail_function_name}_{random_id}_{index}_{truncated_parameters}`. | ||
|
||
Advanced details | ||
^^^^ | ||
|
||
Relative import | ||
"""" | ||
|
||
If you save the training function into a separated file and import it into the | ||
Ray script, one line has to be added before the trails start,:: | ||
|
||
ray.init(runtime_env={"working_dir": "."}) | ||
|
||
assuming the current directory (".") contains the training and Ray script. | ||
Without this line, Ray will NOT be able to find the training script and import | ||
the training function. | ||
|
||
Callbacks for Ray | ||
"""" | ||
|
||
When running `ray.tune`, a set of callback functions can be called during the | ||
process. Ray has a `documentation`_ on the callback functions. You can build | ||
your own for your convenience. However, here is a callback function to save | ||
the json and csv files at the end of each trail and handle failed trails, which | ||
should cover the most basic functionalities.:: | ||
|
||
from ray.tune.logger import JsonLoggerCallback, LoggerCallback | ||
class AxLogger(LoggerCallback): | ||
def __init__(self, ax_client: AxClient, json_name: str, csv_name: str): | ||
""" | ||
A logger callback to save the progress to json file after every trial ends. | ||
Similar to running `ax_client.save_to_json_file` every iteration in sequential | ||
searches. | ||
Args: | ||
ax_client (AxClient): ax client to save | ||
json_name (str): name for the json file. Append a path if you want to save the \ | ||
json file to somewhere other than cwd. | ||
csv_name (str): name for the csv file. Append a path if you want to save the \ | ||
csv file to somewhere other than cwd. | ||
""" | ||
self.ax_client = ax_client | ||
self.json = json_name | ||
self.csv = csv_name | ||
def log_trial_end( | ||
self, trial: Trial, id: int, metric: float, runtime: int, failed: bool = False | ||
): | ||
self.ax_client.save_to_json_file(filepath=self.json) | ||
shutil.copy(self.json, f"{trial.local_dir}/{self.json}") | ||
try: | ||
data_frame = self.ax_client.get_trials_data_frame().sort_values("Metric") | ||
data_frame.to_csv(self.csv, header=True) | ||
except KeyError: | ||
pass | ||
shutil.copy(self.csv, f"{trial.local_dir}/{self.csv}") | ||
if failed: | ||
status = "failed" | ||
else: | ||
status = "finished" | ||
print( | ||
f"AX trial {id} {status}. Final loss: {metric}. Time taken" | ||
f" {runtime} seconds. Location directory: {trial.logdir}." | ||
) | ||
def on_trial_error(self, iteration: int, trials: list[Trial], trial: Trial, **info): | ||
id = int(trial.experiment_tag.split("_")[0]) - 1 | ||
ax_trial = self.ax_client.get_trial(id) | ||
ax_trial.mark_abandoned(reason="Error encountered") | ||
self.log_trial_end( | ||
trial, id + 1, "not available", self.calculate_runtime(ax_trial), True | ||
) | ||
def on_trial_complete( | ||
self, iteration: int, trials: list["Trial"], trial: Trial, **info | ||
): | ||
# trial.trial_id is the random id generated by ray, not ax | ||
# the default experiment_tag starts with ax' trial index | ||
# but this workaround is totally fragile, as users can | ||
# customize the tag or folder name | ||
id = int(trial.experiment_tag.split("_")[0]) - 1 | ||
ax_trial = self.ax_client.get_trial(id) | ||
failed = False | ||
try: | ||
loss = ax_trial.objective_mean | ||
except ValueError: | ||
failed = True | ||
loss = "not available" | ||
else: | ||
if np.isnan(loss) or np.isinf(loss): | ||
failed = True | ||
loss = "not available" | ||
if failed: | ||
ax_trial.mark_failed() | ||
self.log_trial_end( | ||
trial, id + 1, loss, self.calculate_runtime(ax_trial), failed | ||
) | ||
@classmethod | ||
def calculate_runtime(cls, trial: AXTrial): | ||
delta = trial.time_completed - trial.time_run_started | ||
return int(delta.total_seconds()) | ||
|
||
To use callback functions, simple add a line in `ray.RunConfig`:: | ||
|
||
ax_logger = AxLogger(ax_client, "hyperopt_ray.json", "hyperopt.csv") | ||
run_config=air.RunConfig( | ||
local_dir="./test", | ||
verbose=0, | ||
callbacks=[ax_logger, JsonLoggerCallback()], | ||
log_to_file=True, | ||
) | ||
|
||
A full example script is provided in the examples (WIP). | ||
|
||
.. _ray: https://docs.ray.io/en/latest/ | ||
.. _Ax: https://github.com/facebook/Ax | ||
.. _documentation: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters