Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU finetuning on SD1.5 with CLIP Skip 2 fails #1099

Open
thojmr opened this issue Feb 2, 2024 · 7 comments
Open

Multi GPU finetuning on SD1.5 with CLIP Skip 2 fails #1099

thojmr opened this issue Feb 2, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@thojmr
Copy link

thojmr commented Feb 2, 2024

Edit: I solved it 2 posts down.

Multi GPU training fails with the below error when using CLIP skip 2 with finetune.py (SD1.5).

File "/Desktop/code/kohya-sd-scripts/fine_tune.py", line 344, in train
    encoder_hidden_states = train_util.get_hidden_states(
File "/Desktop/code/kohya-sd-scripts/library/train_util.py", line 4139, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
File "/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'

It's fails here in the code:

    if args.clip_skip is None:
        encoder_hidden_states = text_encoder(input_ids)[0]
    else:
        enc_out = text_encoder(input_ids, output_hidden_states=True, return_dict=True)
        encoder_hidden_states = enc_out["hidden_states"][-args.clip_skip]
        encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)   #<-  Fails here

Seems to be similar to these tickets, but I'm not sure which objects needs to be unwrapped or where to do it.
#1000
#1019

Any guidance or things to try would be helpful.
Thanks!

Edit: minimum reproduction script using latest repo version

accelerate launch --num_cpu_threads_per_process=4 fine_tune.py \
    --pretrained_model_name_or_path="${pretrained_model_name_or_path}" \
    --in_json $metadata_dir"/meta_cap.json" \
    --train_data_dir="../../datasets/${raw_img_folder}" \
    --output_dir="./output/${short_name}" \
    --resolution="512,512" \
    --train_batch_size=1 \
    --learning_rate=2e-6 \
    --learning_rate_te=1e-6 \
    --lr_scheduler="cosine" \
    --max_train_epochs 10 \
    --mixed_precision="bf16" \
    --save_precision="fp16" \
    --save_every_n_steps=10000000 \
    --enable_bucket \
    --clip_skip=2 \
    --logging_dir=logs \
    --save_model_as="safetensors" \
    --output_name="test" \
    --caption_extension=".txt" \
    --train_text_encoder
@thojmr
Copy link
Author

thojmr commented Feb 3, 2024

Some findings so far.

I have attempted to use accelerator.unwrap_model(text_encoder).text_model.final_layer_norm(encoder_hidden_states) on the text encoder to bypass the above error. However doing so causes some of the gradient to be skipped as seen in the error below. Layers text_model.encoder.layers.11 specifically

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: text_model.encoder.layers.11.layer_norm2.bias, text_model.encoder.layers.11.layer_norm2.weight, text_model.encoder.layers.11.mlp.fc2.bias, text_model.encoder.layers.11.mlp.fc2.weight, text_model.encoder.layers.11.mlp.fc1.bias, text_model.encoder.layers.11.mlp.fc1.weight, text_model.encoder.layers.11.layer_norm1.bias, text_model.encoder.layers.11.layer_norm1.weight, text_model.encoder.layers.11.self_attn.out_proj.bias, text_model.encoder.layers.11.self_attn.out_proj.weight, text_model.encoder.layers.11.self_attn.q_proj.bias, text_model.encoder.layers.11.self_attn.q_proj.weight, text_model.encoder.layers.11.self_attn.v_proj.bias, text_model.encoder.layers.11.self_attn.v_proj.weight, text_model.encoder.layers.11.self_attn.k_proj.bias, text_model.encoder.layers.11.self_attn.k_proj.weight
Parameter indices which did not receive grad for rank 1: 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193

It seems to be angry about the last layer which makes sense with clip_skip=2. I am currently trying to figure out how to have the script skip performing gradient calculation on this layer (similar to SDXL), but the below does not seem to be working. So I am stuck.

if args.clip_skip == 2:
            print("freezing last layer")
            text_encoder.text_model.encoder.layers[-1].requires_grad_(False)
            text_encoder.text_model.final_layer_norm.requires_grad_(False)

Am I heading in the right direction?

@thojmr
Copy link
Author

thojmr commented Feb 3, 2024

I think I finally got it working. Here were my changes. Line numbers are probably off a bit, and my changes are only working for my specific use case (happy path).

The fix was unwrapping the text encoder with accelerator when using clip_skip=2 during distributed training.

Edit: I realized that my local changes might make the line numbers off by 30 lines at most. So I added some more details on where to place changes.

fine_tune.py:~176

    # Add this code to freeze the last CLIP layer so gradient is not computed for those layers (similar to SDXL_train.py)
    # Place just before 'if not cache_latents:'
    if args.clip_skip == 2:
        print("freezing last layer")
        text_encoder.text_model.encoder.layers[-1].requires_grad_(False)
        text_encoder.text_model.final_layer_norm.requires_grad_(False)

fine_tune.py:~195

    # for m in training_models:
    #     m.requires_grad_(True)
    # We replace the above lines with the below line so the text encoder require_grad prop is not overridden
    training_models[0].requires_grad_(True)

train_util.py:~4139

    # Unwrap the text encoder when clip skip is 2.  Add "accelerator" as a param to the parent method
    # Replace this line with below: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
    encoder_hidden_states = accelerator.unwrap_model(text_encoder).text_model.final_layer_norm(encoder_hidden_states) if accelerator else text_encoder.text_model.final_layer_norm(encoder_hidden_states)
    
    # also add `accelerator` param to the get_hidden_states method, and any calls to this method

Ill try training with it soon, that was enough adventure for one day. (It worked for me)

@kohya-ss kohya-ss added the bug Something isn't working label Feb 29, 2024
@fschiro
Copy link

fschiro commented Apr 17, 2024

Same problem with dual 4090 system. Training failed unless I set clip skip = 1.

@zdoek001
Copy link

I think I finally got it working. Here were my changes. Line numbers are probably off a bit, and my changes are only working for my specific use case (happy path).

fine_tune.py:~176

    # Add this code to freeze the last CLIP layer so gradient is not computed for those layers (similar to SDXL_train.py)
    if args.clip_skip == 2:
        print("freezing last layer")
        text_encoder.text_model.encoder.layers[-1].requires_grad_(False)
        text_encoder.text_model.final_layer_norm.requires_grad_(False)

fine_tune.py:~195

    # for m in training_models:
    #     m.requires_grad_(True)
    # We replace the above lines with the below line so the text encoder require_grad prop is not overridden
    training_models[0].requires_grad_(True)

train_util.py:~4139

    # unwrap the text encoder when clip skip is 2.  Add "accelerator" as a param to the parent method
    encoder_hidden_states = accelerator.unwrap_model(text_encoder).text_model.final_layer_norm(encoder_hidden_states) if accelerator else text_encoder.text_model.final_layer_norm(encoder_hidden_states)

Ill try training with it soon, that was enough adventure for one day. As a side note why is with accelerator.accumulate(training_models[0]) not with accelerator.accumulate(training_models[0], training_models[1]) when training TE?

# unwrap the text encoder when clip skip is 2.  Add "accelerator" as a param to the parent method
encoder_hidden_states = accelerator.unwrap_model(text_encoder).text_model.final_layer_norm(encoder_hidden_states) if accelerator else text_encoder.text_model.final_layer_norm(encoder_hidden_states)

How to add? The specified location cannot be found and the accelerator is missing.

@thojmr
Copy link
Author

thojmr commented Jun 4, 2024

How to add? The specified location cannot be found and the accelerator is missing.

Hey, sorry I don't log in very often. I've updated my fix above with more details on what changes go where.

@Nice-Zhang66
Copy link

I had the same problem as you, may I ask how you eventually solved it.
I followed your instructions to make changes in fine_tune.py and train_util.py respectively and it didn't work.

@thojmr
Copy link
Author

thojmr commented Sep 6, 2024

how you eventually solved it.

A lot of google searching around to handle distributed models in accelerate. I think someone had a similar problem in another repo that I used as reference, but Its been to long now to remember.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants