Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev branch sample_images crashes during multi-GPU training #1002

Closed
ngitnenlim opened this issue Dec 13, 2023 · 2 comments
Closed

Dev branch sample_images crashes during multi-GPU training #1002

ngitnenlim opened this issue Dec 13, 2023 · 2 comments

Comments

@ngitnenlim
Copy link

ngitnenlim commented Dec 13, 2023

Similar to issue #994, but happened while attempting to generate sample images in sdxl_train.py.

The error message:
AttributeError: 'DistributedDataParallel' object has no attribute 'text_projection'

The issue can be fixed by unwrapping the model, i.e., change

sdxl_train_util.sample_images(
    ...,
    [text_encoder1, text_encoder2],
    unet,
)

to

sdxl_train_util.sample_images(
    ...,
    [accelerator.unwrap_model(text_encoder1), accelerator.unwrap_model(text_encoder2)],
    accelerator.unwrap_model(unet),
)

But I'm not sure if this workaround would potentially break the gradient synchronization issue mentioned in #994.
@Isotr0py I'd be very grateful if you could run a sanity check on this.

@Isotr0py
Copy link
Contributor

Yes, this is the correct way to fix this issue.

I've tested this, and gradients are synchronized when nccl:all_reduce is called in accelerator.backward(loss).

@ngitnenlim
Copy link
Author

Alright, so after merging #1000 and fixing issue #1002, I believe the gradient asynchrony issue has been solved. Here's what I found for my run, compared with the previous run with the same settings:

  1. Lower loss for every step
  2. The output sample images got way better. The sample images in the previous run look broken now to me. The details were cranky and the texture was crude.

Great job, @Isotr0py!

ngitnenlim added a commit to ngitnenlim/sd-scripts that referenced this issue Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants