Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

Open
FurkanGozukara opened this issue Apr 18, 2024 · 7 comments

Comments

@FurkanGozukara
Copy link

I am trying to do multi gpu training on Kaggle

Previously it was working great

But after all these new changes I am getting below error

Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>
    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>

    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps:   0%|                                           | 0/3000 [00:00<?, ?it/s]
[2024-04-18 00:21:49,711] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

train command like this

 Executing command: "/opt/conda/bin/accelerate" launch  
                         --dynamo_backend no --dynamo_mode default --gpu_ids 0,1
                         --mixed_precision no --multi_gpu --num_processes 2     
                         --num_machines 1 --num_cpu_threads_per_process 4       
                         "/kaggle/working/kohya_ss/sd-scripts/train_db.py"      
                         --config_file "./outputs/tmpfiledbooth.toml"           
                         --max_grad_norm=0.0 --no_half_vae                      
                         --ddp_timeout=10000000 --ddp_gradient_as_bucket_view 
@FurkanGozukara
Copy link
Author

FurkanGozukara commented Apr 18, 2024

even single GPU training fails on kaggle now

[2024-04-18 00:29:47,958] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1187 closing signal SIGTERM
[2024-04-18 00:29:48,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1188) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/kaggle/working/kohya_ss/sd-scripts/train_db.py FAILED

@FurkanGozukara FurkanGozukara changed the title Distributed training is broken AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api Apr 18, 2024
@kohya-ss
Copy link
Owner

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

@kohya-ss
Copy link
Owner

possibly duplicate of #1099

@FurkanGozukara
Copy link
Author

FurkanGozukara commented Apr 21, 2024

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

I didn't set clip skip. I use default value. After I selected single GPU P100 it worked. but with dual T4 gpu it always failed

yes my config has "clip_skip": 1,

i train only text encoder 1 and not text encoder 2

@Nice-Zhang66
Copy link

I had the same problem as you, may I ask how you eventually solved it.

@FurkanGozukara
Copy link
Author

I had the same problem as you, may I ask how you eventually solved it.

No I didn't

I used only single gpu to solve issue

Before these changes it was working perfect

After this topic I didn't try again either

@Nice-Zhang66
Copy link

Thank you for your reply, I will continue to look for a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants