Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed testing #173

Open
hellangleZ opened this issue Dec 7, 2024 · 2 comments
Open

Distributed testing #173

hellangleZ opened this issue Dec 7, 2024 · 2 comments

Comments

@hellangleZ
Copy link

Hi expert

I can use the sigle machine train,but how to do distributed region training use this script or there is some good sample to help me understand

Using 2 GPUs

ZERO_BAND_LOG_LEVEL=DEBUG ./scripts/simulate_multi_node_diloco.sh 2 1 src/zeroband/train.py @configs/debug/diloco.toml

Thanks

@Jackmin801
Copy link
Member

Jackmin801 commented Jan 6, 2025

Sorry for seeing this late. We'll write better documentation around this in the future but ill quickly give a brief rundown here. Hope its helpful.

1. You need to set these environment variables on all machines involved:

export GLOBAL_ADDR=192.168.0.2
export GLOBAL_PORT=1234
export GLOBAL_WORLD_SIZE=3

GLOBAL_ADDR is the IP of the first node. GLOBAL_PORT is any port that is open and reachable on the first node. GLOBAL_WORLD_SIZE is the total number of nodes.

2. You need to set these specific environment variables for each machine:

export GLOBAL_RANK=0 
export GLOBAL_UNIQUE_ID=A0 

GLOBAL_RANK is 0 for the first node and 1 for the second node so on. It only matters that the node that is GLOBAL_ADDR is 0 and the rest are unique and less than GLOBAL_WORLD_SIZE.
GLOBAL_UNIQUE_ID is any unique string for each node. This just needs to be different for each node.

3. Your final launch script will look something like this:

First Node (Master):

export GLOBAL_ADDR=192.168.100.1
export GLOBAL_PORT=1234
export GLOBAL_WORLD_SIZE=3

export GLOBAL_UNIQUE_ID=A0 
export GLOBAL_RANK=0

uv run torchrun --nproc_per_node=2 \
	--rdzv-endpoint localhost:10001 \
	src/zeroband/train.py \
	@configs/150M/3090.toml \
	--no-wandb-resume

Second Node (Worker):

export GLOBAL_ADDR=192.168.100.1
export GLOBAL_PORT=1234
export GLOBAL_WORLD_SIZE=3

export GLOBAL_UNIQUE_ID=B1 
export GLOBAL_RANK=1 

uv run torchrun --nproc_per_node=2 \
	--rdzv-endpoint localhost:10001 \
	src/zeroband/train.py \
	@configs/150M/3090.toml \
	--no-wandb-resume

Third Node (Worker):

export GLOBAL_ADDR=192.168.100.1
export GLOBAL_PORT=1234
export GLOBAL_WORLD_SIZE=3

export GLOBAL_UNIQUE_ID=CCCCCC 
export GLOBAL_RANK=2 

uv run torchrun --nproc_per_node=2 \
	--rdzv-endpoint localhost:10001 \
	src/zeroband/train.py \
	@configs/150M/3090.toml \
	--no-wandb-resume

@lizdongkun
Copy link

Hi Expert,

I have a question,
It looks like all the diloco peer should be one server, which have the rdzv point be localhost for all the participating GPU cards within one peer(the same host).
can one diloco worker be made up with 2 or more physical servers, and FSDP sharding splitted across the multiple servers for one peer, based on current prime framework? Is it supported or not?

Thanks!!

Regards,
Kun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants