Bugfix multi GPUs training #472

hellojixian · 2023-05-02T14:24:05Z

Majorly 2 places has been modified
1, in accelerate multi-gpu mode, for this sd-scripts training, has to make sure all child-process are using the same random_seed, because in the training loop you are using torch.randn_like to generate noisy_latents. if in one batch they are tying to calculate loss regarding different noisy_latents it will cause loss unable to converge.

re-implemented the dataloader to support the mini-batch and still compatible with your current scripts epoch and steps calculate, with this modification, user does not need to modify the training parameters when they are switching from single GPU to multi-gpus. otherwise, users has to reduce the training-batch-size to make sure the actual batch size is correctly calculated.

my training script parameters are below:
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/proj/suchka/image-generation/lora/jixian/img" --resolution=512,512 --output_dir="/proj/suchka/image-generation/lora/jixian/model" --logging_dir="/proj/suchka/image-generation/lora/jixian/log" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-7 --unet_lr=5e-6 --network_dim=128 --output_name="jixianwang_v3" --lr_scheduler_num_cycles="10" --learning_rate="5e-6" --lr_scheduler="constant" --train_batch_size="24" --max_train_steps="2000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.1 --seed=1234

with this changes when I switch the machine from single GPU to multi-GPU it just reduced the training time, without need to change any hyperparameters, so it works as expected :-)

testing environment
torch==2.0
xformers==0.019
transformers==4.28.1
diffusers==0.16.1

FurkanGozukara · 2024-04-18T10:17:50Z

multi gpu broken kaggle training anyone has any guess how to fix?

#1272

hellojixian added 9 commits April 30, 2023 19:39

add support for diffusers>=0.18 with xformers

ba71889

update

41e5d33

bugfix for support diffusers 0.15-0.16.1

0a489ae

tested from diffusers v0.10 to v0.16.1

9ec594b

sync random seed for multi-prcoess (multi-GPU) training

a315279

add mini-batch support for DDP

03fee84

remove the train_dataloader from accelerate prepare

d3664f5

update data loger

65ad5d9

remove notes

411bf9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix multi GPUs training #472

Bugfix multi GPUs training #472

hellojixian commented May 2, 2023

FurkanGozukara commented Apr 18, 2024

Bugfix multi GPUs training #472

Are you sure you want to change the base?

Bugfix multi GPUs training #472

Conversation

hellojixian commented May 2, 2023

FurkanGozukara commented Apr 18, 2024