Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix multi GPUs training #472

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hellojixian
Copy link

Majorly 2 places has been modified
1, in accelerate multi-gpu mode, for this sd-scripts training, has to make sure all child-process are using the same random_seed, because in the training loop you are using torch.randn_like to generate noisy_latents. if in one batch they are tying to calculate loss regarding different noisy_latents it will cause loss unable to converge.

  1. re-implemented the dataloader to support the mini-batch and still compatible with your current scripts epoch and steps calculate, with this modification, user does not need to modify the training parameters when they are switching from single GPU to multi-gpus. otherwise, users has to reduce the training-batch-size to make sure the actual batch size is correctly calculated.

my training script parameters are below:
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/proj/suchka/image-generation/lora/jixian/img" --resolution=512,512 --output_dir="/proj/suchka/image-generation/lora/jixian/model" --logging_dir="/proj/suchka/image-generation/lora/jixian/log" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-7 --unet_lr=5e-6 --network_dim=128 --output_name="jixianwang_v3" --lr_scheduler_num_cycles="10" --learning_rate="5e-6" --lr_scheduler="constant" --train_batch_size="24" --max_train_steps="2000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.1 --seed=1234

with this changes when I switch the machine from single GPU to multi-GPU it just reduced the training time, without need to change any hyperparameters, so it works as expected :-)

testing environment
torch==2.0
xformers==0.019
transformers==4.28.1
diffusers==0.16.1

@FurkanGozukara
Copy link

multi gpu broken kaggle training anyone has any guess how to fix?

#1272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants