Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Majorly 2 places has been modified
1, in accelerate multi-gpu mode, for this sd-scripts training, has to make sure all child-process are using the same random_seed, because in the training loop you are using torch.randn_like to generate noisy_latents. if in one batch they are tying to calculate loss regarding different noisy_latents it will cause loss unable to converge.
my training script parameters are below:
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/proj/suchka/image-generation/lora/jixian/img" --resolution=512,512 --output_dir="/proj/suchka/image-generation/lora/jixian/model" --logging_dir="/proj/suchka/image-generation/lora/jixian/log" --network_alpha="128" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-7 --unet_lr=5e-6 --network_dim=128 --output_name="jixianwang_v3" --lr_scheduler_num_cycles="10" --learning_rate="5e-6" --lr_scheduler="constant" --train_batch_size="24" --max_train_steps="2000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.1 --seed=1234
with this changes when I switch the machine from single GPU to multi-GPU it just reduced the training time, without need to change any hyperparameters, so it works as expected :-)
testing environment
torch==2.0
xformers==0.019
transformers==4.28.1
diffusers==0.16.1