diff --git a/docs/source/examples/hyperopt.rst b/docs/source/examples/hyperopt.rst new file mode 100644 index 00000000..14c5dd19 --- /dev/null +++ b/docs/source/examples/hyperopt.rst @@ -0,0 +1,357 @@ +Hyperparameter optimization with Ax and Ray +=================== + +Here is an example on how you can perform hyperparameter optimization +sequentially (with Ax) or parallelly (with Ax and Ray). + +Prerequisites +-------------- + +The packages required to perform this task are `Ax`_ and `ray`_.:: + + conda install -c conda-forge "ray < 2.7.0" + pip install ax-platform + +.. note:: + The scripts have been tested with `ax-platform 0.3.1` and `ray 2.3.0`, and + some previous versions of the two packages. Unfortunately, several changes + made in recent versions of `ray` will break this script. You should install + `ray < 2.7.0`. ``pip install`` is recommended by the Ax developers even if + a conda environment is used. + +.. note:: + If you can update this example and scripts to accommodate the changes in the + latest Ray package, feel free to submit a pull request. + +How it works +-------------- + +Ax is a package that can perform Bayesian optimization. With the given parameter +range, a set of initial trails are generated. Then based on the metrics returned +from these trails, new test parameters are generated. By default, this Ax +workflow can only be performed sequentially. We can combine Ray and Ax to +utilize multiple GPU on the same node. Ray interfaces with Ax to pull trail +parameters and then automatically distribute the trails to available resources. +With this, we can perform asynchronous parallelized hyperparameter optimization. + + +Ax experiments +^^^^ + +You can create a basic Ax experiment this way:: + + from ax.service.ax_client import AxClient + ax_client = AxClient() + ax_client.create_experiment( + name="hyper_opt", + parameters=[ + { + "name": "parameter_a", + "type": "fixed", + "value_type": "float", + "value": 0.6, + }, + { + "name": "parameter_b", + "type": "range", + "value_type": "int", + "bounds": [20, 40], + }, + { + "name": "parameter_c", + "type": "range", + "value_type": "float", + "bounds": [30.0, 60.0], + }, + { + "name": "parameter_d", + "type": "range", + "value_type": "float", + "bounds": [0.001, 1], + "log_scale": True, + }, + ], + objectives={ + "Metric": ObjectiveProperties(minimize=True), + }, + parameter_constraints=[ + "parameter_b <= parameter_c", + ], + ) + +Here we create an Ax experiment called "hyper_opt", with 4 parameters, +`parameter_a`, `parameter_b`, `parameter_c`, and `parameter_d`. Our goal is to +minimize a metric called "Metric". + +A few crucial things to note: + +* You can give a range or fixed value to each parameter. You might want to + specify the data type as well. A fixed parameter makes sense here because you + can do the optimization with only a subset of parameters without the need of + modifying your training function. +* Constraints can be applied to the search space like the example shows, but + there is no easy way to achieve a constraint that contains mathematical + expressions (for example, `parameter_a < 2 * parameter_b`). +* For each experiment, Ax will generate a dictionary as the input of the + training function. The dictionary will look like:: + + { + "parameter_a": 0.6, + "parameter_b": 30, + "parameter_c": 35.0, + "parameter_d": 0.2 + } + + As such, the training function must be able to take a dictionary as the input + (as a single dictionary or keyword arguments) and use these values to set up + the training. +* The `objectives` keyword argument takes a dictionary of variables. The keys of + the dictionary **MUST** exist in the dictionary returned from the training + function. In this example, the training function must return a dictionary + like:: + + return { + ... + "Metric": metric, + ... + } + + The above two points will become more clear when we go through the training + function. + +Training function +^^^^ + +You only need a minimal change to your existing training script to use it with +Ax. In most case, you just have to wrap the whole script into a function:: + + def training(parameter_a, parameter_b, parameter_c, parameter_d): + # set up the network with the parameters + ... + network_params = { + ... + "parameter_a": parameter_a, + ... + } + network = networks.Hipnn( + "hipnn_model", (species, positions), module_kwargs=network_params + ) + # train the network + # `metric_tracker` contains the losses from HIPPYNN + metric_tracker = train_model( + training_modules, + database, + controller, + metric_tracker, + callbacks=None, + batch_callbacks=None, + ) + # return the desired metric to Ax, for example, validation loss + return { + "Metric": metric_tracker.best_metric_values["valid"]["Loss"] + } + +Note how we can utilize the parameters passed in and return **Metric** at the +end. + +.. _run-sequential-experiments: + +Run sequential experiments +^^^^ + +Next, we can run the experiments:: + + for k in range(30): + parameter, trial_index = ax_client.get_next_trial() + ax_client.complete_trial(trial_index=trial_index, raw_data=training(parameter)) + # Save experiment to file as JSON file + ax_client.save_to_json_file(filepath="hyperopt.json") + data_frame = ax_client.get_trials_data_frame().sort_values("Metric") + data_frame.to_csv("hyperopt.csv", header=True) + +For example, we will run 30 trails here and the results will be saved into a +json file and a csv file. The json file will contain all the details of the +trails, which can be used to restart the experiments or add additional +experiments. As it contains too many details to be human-friendly, we save a +more human-friendly csv that only contains the trail indices, parameters, and +metrics. + +Asynchronous parallelized optimization with Ray +^^^^ + +To use Ray to distribute the trails across GPUs parallelly, a small update is +needed for the training function:: + + from ray.air import session + + + def training(parameter_a, parameter_b, parameter_c, parameter_d): + # setup and train are the same + .... + # instead of return, we use `session.report` to communicate with `ray` + session.report( + { + "Metric": metric_tracker.best_metric_values["valid"]["Loss"] + } + ) + +Instead of a simple `return`, we need the `report` method from `ray.air.session` +to report the final metric to `ray`. + +Also, to run the trails, instead of a loop in :ref:`run-sequential-experiments`, +we have to use the interfaces between the two packages from `ray`:: + from ray.tune.experiment.trial import Trial + from ray.tune.search import ConcurrencyLimiter + from ray.tune.search.ax import AxSearch + + # to make sure ray loads local packages correctly + ray.init(runtime_env={"working_dir": "."}) + + algo = AxSearch(ax_client=ax_client) + # 4 GPUs available + algo = ConcurrencyLimiter(algo, max_concurrent=4) + tuner = tune.Tuner( + # assign 1 GPU for one trail + tune.with_resources(training, resources={"gpu": 1}), + # run 10 trails + tune_config=tune.TuneConfig(search_alg=algo, num_samples=10), + # configuration of ray + run_config=air.RunConfig( + # all results will be saved in a subfolder inside the "test" folder + of the current working directory + local_dir="./test", + verbose=0, + log_to_file=True, + ), + ) + # run the trails + tuner.fit() + # save the results as the end + # to save the file after each trail, a callback is needed + # see advanced details + ax_client.save_to_json_file(filepath="hyperopt.json") + data_frame = ax_client.get_trials_data_frame().sort_values("Metric") + data_frame.to_csv("hyperopt.csv", header=True) + +This is all you need. The results will be saved in the path of +`./test/{trail_function_name}_{timestamp}`. Each trail will be saved within a +subfolder named +`{trail_function_name}_{random_id}_{index}_{truncated_parameters}`. + +Advanced details +^^^^ + +Relative import +"""" + +If you save the training function into a separated file and import it into the +Ray script, one line has to be added before the trails start,:: + + ray.init(runtime_env={"working_dir": "."}) + +assuming the current directory (".") contains the training and Ray script. +Without this line, Ray will NOT be able to find the training script and import +the training function. + +Callbacks for Ray +"""" + +When running `ray.tune`, a set of callback functions can be called during the +process. Ray has a `documentation`_ on the callback functions. You can build +your own for your convenience. However, here is a callback function to save +the json and csv files at the end of each trail and handle failed trails, which +should cover the most basic functionalities.:: + + from ray.tune.logger import JsonLoggerCallback, LoggerCallback + + class AxLogger(LoggerCallback): + def __init__(self, ax_client: AxClient, json_name: str, csv_name: str): + """ + A logger callback to save the progress to json file after every trial ends. + Similar to running `ax_client.save_to_json_file` every iteration in sequential + searches. + + Args: + ax_client (AxClient): ax client to save + json_name (str): name for the json file. Append a path if you want to save the \ + json file to somewhere other than cwd. + csv_name (str): name for the csv file. Append a path if you want to save the \ + csv file to somewhere other than cwd. + """ + self.ax_client = ax_client + self.json = json_name + self.csv = csv_name + + def log_trial_end( + self, trial: Trial, id: int, metric: float, runtime: int, failed: bool = False + ): + self.ax_client.save_to_json_file(filepath=self.json) + shutil.copy(self.json, f"{trial.local_dir}/{self.json}") + try: + data_frame = self.ax_client.get_trials_data_frame().sort_values("Metric") + data_frame.to_csv(self.csv, header=True) + except KeyError: + pass + shutil.copy(self.csv, f"{trial.local_dir}/{self.csv}") + if failed: + status = "failed" + else: + status = "finished" + print( + f"AX trial {id} {status}. Final loss: {metric}. Time taken" + f" {runtime} seconds. Location directory: {trial.logdir}." + ) + + def on_trial_error(self, iteration: int, trials: list[Trial], trial: Trial, **info): + id = int(trial.experiment_tag.split("_")[0]) - 1 + ax_trial = self.ax_client.get_trial(id) + ax_trial.mark_abandoned(reason="Error encountered") + self.log_trial_end( + trial, id + 1, "not available", self.calculate_runtime(ax_trial), True + ) + + def on_trial_complete( + self, iteration: int, trials: list["Trial"], trial: Trial, **info + ): + # trial.trial_id is the random id generated by ray, not ax + # the default experiment_tag starts with ax' trial index + # but this workaround is totally fragile, as users can + # customize the tag or folder name + id = int(trial.experiment_tag.split("_")[0]) - 1 + ax_trial = self.ax_client.get_trial(id) + failed = False + try: + loss = ax_trial.objective_mean + except ValueError: + failed = True + loss = "not available" + else: + if np.isnan(loss) or np.isinf(loss): + failed = True + loss = "not available" + if failed: + ax_trial.mark_failed() + self.log_trial_end( + trial, id + 1, loss, self.calculate_runtime(ax_trial), failed + ) + + @classmethod + def calculate_runtime(cls, trial: AXTrial): + delta = trial.time_completed - trial.time_run_started + return int(delta.total_seconds()) + +To use callback functions, simple add a line in `ray.RunConfig`:: + + ax_logger = AxLogger(ax_client, "hyperopt_ray.json", "hyperopt.csv") + run_config=air.RunConfig( + local_dir="./test", + verbose=0, + callbacks=[ax_logger, JsonLoggerCallback()], + log_to_file=True, + ) + +A full example script is provided in the examples (WIP). + +.. _ray: https://docs.ray.io/en/latest/ +.. _Ax: https://github.com/facebook/Ax +.. _documentation: https://docs.ray.io/en/latest/tune/tutorials/tune-metrics.html diff --git a/docs/source/examples/index.rst b/docs/source/examples/index.rst index 548b884b..31e99d0d 100644 --- a/docs/source/examples/index.rst +++ b/docs/source/examples/index.rst @@ -19,4 +19,5 @@ the examples are just snippets. For fully-fledged examples see the ase_calculator mliap_unified excited_states + hyperopt