Fix multi-gpu SDXL training #1000

Isotr0py · 2023-12-12T12:14:21Z

Fix : dev SDXL:multi-GPUs train #994
Fix: "Parameter indices which did not receive grad for rank x", Multi-GPU SDXL Training (unet + both text encoders) #997
Add --gradient_as_bucket_view to reduce VRAM usage in DDP training
Add --static_graph to solve conflict between DDP and gradients checkpoints
Freeze the last layer of text_encoder1 in sdxl_train.py since it doesn't participate in loss calculation, used to prevent RuntimeError in DDP training

These are related to text_encoders training, so I tested them by training text_encoders on 2 GPUs due to the limited VRAM.

FurkanGozukara · 2023-12-12T15:03:36Z

are you sure last layer of text_encoder1 is not trained? because i don't want single gpu training to be broken
by the way i did dual gpu training on kaggle and it works i have no error
trained text encoder 1

Isotr0py · 2023-12-12T15:57:34Z

@FurkanGozukara

SDXL uses the output of the penultimate layer of Text Encoder 1 instead of the last layer. As a result, it won't participate in the loss calculation, but raise a RuntimeError while backward the gradients for DDP because it received no grad.

The grad of text_encoder1's last layer should be none neither in single GPU nor multi-gpus when I reproduced the RuntimeError.

Can you check the grad of text_encoder1's last layer or the grad between different device? It seems that the grads between GPUs are not synced correctly.

FurkanGozukara · 2023-12-12T16:14:18Z

@FurkanGozukara

SDXL uses the output of the penultimate layer of Text Encoder 1 instead of the last layer. As a result, it won't participate in the loss calculation, but raise a RuntimeError while backward the gradients for DDP because it received no grad.

The grad of text_encoder1's last layer should be none neither in single GPU nor multi-gpus when I reproduced the RuntimeError.

Can you check the grad of text_encoder1's last layer or the grad between different device? It seems that the grads between GPUs are not synced correctly.

how can I check latest layer difference? i have trained model right now that i can compare

Isotr0py · 2023-12-12T16:24:30Z

You can just add a print after accelerator.backward(loss) in training:

print([k for k,v in text_encoder1.named_parameters() if v.grad is None])

It should outputs in neither single gpu nor multi-gpus:

['text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.final_layer_norm.weight', 'text_model.final_layer_norm.bias']

Or you can print to compare between the device (this will print weight/grad in different values if you are using the main branch):

print(accelerator.device, THE WEIGHT/GRAD YOU WANT TO COMPARE)

kohya-ss · 2023-12-13T12:02:56Z

Thank you so much! I hope this will finally solve the DDP training issue.

I've changed to set DistributedDataParallelKwargs if one of the arguments is specified, and also changed the name of the arguments to ddp_*, to match the another option --ddp_timeout and to make it clear that they are for DDP-only. I appreciate your understanding.

ngitnenlim · 2023-12-13T12:54:40Z

Thank you very much! I'm still running the SD XL training script, but the output images so far are very promising. Great improvements in details and texture.

FurkanGozukara · 2024-04-18T10:18:11Z

multi gpu broken kaggle training anyone has any guess how to fix?

#1272

fix DDP SDXL training

bb5ae38

ngitnenlim added a commit to ngitnenlim/sd-scripts that referenced this pull request Dec 13, 2023

merge pull request kohya-ss#1000

5ab526f

ngitnenlim added a commit to ngitnenlim/sd-scripts that referenced this pull request Dec 13, 2023

merge pull request kohya-ss#1000

8e1d4f3

ngitnenlim added a commit to ngitnenlim/sd-scripts that referenced this pull request Dec 13, 2023

merge pull request kohya-ss#1000; fixed

c146c92

ngitnenlim added a commit to ngitnenlim/sd-scripts that referenced this pull request Dec 13, 2023

rollback pr kohya-ss#1000; unwrapp model for sample_images

d0fff2c

kohya-ss merged commit 471d274 into kohya-ss:dev Dec 13, 2023
1 check passed

kohya-ss added a commit that referenced this pull request Dec 13, 2023

change option names, add ddp kwargs if needed ref #1000

d309a27

ngitnenlim mentioned this pull request Dec 13, 2023

Sync steps to prevent unexpected crash #998

Closed

Isotr0py deleted the dev branch December 13, 2023 12:48

ngitnenlim mentioned this pull request Dec 13, 2023

Dev branch sample_images crashes during multi-GPU training #1002

Closed

fauzanardh mentioned this pull request Dec 13, 2023

"Parameter indices which did not receive grad for rank x", Multi-GPU SDXL Training (unet + both text encoders) #997

Closed

bmaltais mentioned this pull request Dec 28, 2023

v22.4.0 bmaltais/kohya_ss#1815

Merged

thojmr mentioned this pull request Feb 2, 2024

Multi GPU finetuning on SD1.5 with CLIP Skip 2 fails #1099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-gpu SDXL training #1000

Fix multi-gpu SDXL training #1000

Isotr0py commented Dec 12, 2023

FurkanGozukara commented Dec 12, 2023

Isotr0py commented Dec 12, 2023 •

edited

Loading

FurkanGozukara commented Dec 12, 2023

Isotr0py commented Dec 12, 2023 •

edited

Loading

kohya-ss commented Dec 13, 2023

ngitnenlim commented Dec 13, 2023

FurkanGozukara commented Apr 18, 2024

Fix multi-gpu SDXL training #1000

Fix multi-gpu SDXL training #1000

Conversation

Isotr0py commented Dec 12, 2023

FurkanGozukara commented Dec 12, 2023

Isotr0py commented Dec 12, 2023 • edited Loading

FurkanGozukara commented Dec 12, 2023

Isotr0py commented Dec 12, 2023 • edited Loading

kohya-ss commented Dec 13, 2023

ngitnenlim commented Dec 13, 2023

FurkanGozukara commented Apr 18, 2024

Isotr0py commented Dec 12, 2023 •

edited

Loading

Isotr0py commented Dec 12, 2023 •

edited

Loading