Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(doc): Add multi-node torchrun info #2304

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 46 additions & 6 deletions docs/multi-node.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,18 @@ title: Multi Node
description: How to use Axolotl on multiple machines
---

The below are three ways to train multi-node in Axolotl.

::: {.callout-important}
Each machine needs a copy of Axolotl, we suggest using the same commit to ensure compatibility.

You will also need to have the same configuration file for your model on each machine.

Make sure the main machine is reachable by other machines.
:::

# Accelerate

You will need to create a configuration for accelerate, either by using `accelerate config` and follow the instructions or you can use one of the preset below:

~/.cache/huggingface/accelerate/default_config.yaml
Expand All @@ -26,7 +38,7 @@ tpu_use_sudo: false
use_cpu: false
```

Configure your model to use FSDP with for example:
Configure your model to use FSDP in the Axolotl yaml. For example:
```yaml
fsdp:
- full_shard
Expand All @@ -37,12 +49,40 @@ fsdp_config:
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
```

## Machine configuration
All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.

On each machine you need a copy of Axolotl, we suggest using the same commit to ensure compatibility.
# Raytrain

You will also need to have the same configuration file for your model on each machine.
Please see ray train doc [here](ray-integration.qmd).

On the main machine only, make sure the port you set as `main_process_port` is open in TCP and reachable by other machines.
# Torchrun

All you have to do now is launch using accelerate as you would usually do on each machine and voila, the processes will start once you have launched accelerate on every machine.
If you are using Infiniband, we recommend torchrun to utilize the full bandwidth.

Set the following env (change buffersize/socketname depending on your system):

```yaml
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
export NCCL_BUFFSIZE=2097152
```

Run the following on each node:

```bash
torchrun --nnodes $num_nodes --nproc_per_node $gpu_per_node --rdzv_id $rdzv_id --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:$head_node_port" -m axolotl.cli.train config.yaml
```

Please make sure to substitute the placeholder variables.

- `num_nodes`: Number of nodes (containing GPUs)
- `gpu_per_node`: Number of gpus per node
- `head_node_ip`: IP of the head node (make sure other machines can connect to this)
- `head_node_port`: Port of the head node (make sure other machines can connect to this. Default 29400)
- `rdzv_id`: A unique job ID that is used by the job across nodes.

::: {.callout-note}
You need to call `axolotl.cli.train` instead of `axolotl train` as the latter calls accelerate under the hood
:::

More info on the available configs can be found on the Pytorch docs [here](https://pytorch.org/docs/stable/elastic/run.html)