New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

FurkanGozukara · 2024-04-18T00:26:22Z

I am trying to do multi gpu training on Kaggle

Previously it was working great

But after all these new changes I am getting below error

Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>
    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>

    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps:   0%|                                           | 0/3000 [00:00<?, ?it/s]
[2024-04-18 00:21:49,711] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train command like this

 Executing command: "/opt/conda/bin/accelerate" launch  
                         --dynamo_backend no --dynamo_mode default --gpu_ids 0,1
                         --mixed_precision no --multi_gpu --num_processes 2     
                         --num_machines 1 --num_cpu_threads_per_process 4       
                         "/kaggle/working/kohya_ss/sd-scripts/train_db.py"      
                         --config_file "./outputs/tmpfiledbooth.toml"           
                         --max_grad_norm=0.0 --no_half_vae                      
                         --ddp_timeout=10000000 --ddp_gradient_as_bucket_view

The text was updated successfully, but these errors were encountered:

FurkanGozukara · 2024-04-18T00:30:43Z

even single GPU training fails on kaggle now

[2024-04-18 00:29:47,958] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1187 closing signal SIGTERM
[2024-04-18 00:29:48,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1188) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/kaggle/working/kohya_ss/sd-scripts/train_db.py FAILED

kohya-ss · 2024-04-21T11:40:58Z

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

kohya-ss · 2024-04-21T12:08:47Z

possibly duplicate of #1099

FurkanGozukara · 2024-04-21T12:37:20Z

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

I didn't set clip skip. I use default value. After I selected single GPU P100 it worked. but with dual T4 gpu it always failed

yes my config has "clip_skip": 1,

i train only text encoder 1 and not text encoder 2

Nice-Zhang66 · 2024-07-01T07:35:05Z

I had the same problem as you, may I ask how you eventually solved it.

FurkanGozukara · 2024-07-01T08:25:42Z

I had the same problem as you, may I ask how you eventually solved it.

No I didn't

I used only single gpu to solve issue

Before these changes it was working perfect

After this topic I didn't try again either

Nice-Zhang66 · 2024-07-02T09:42:24Z

Thank you for your reply, I will continue to look for a solution.

FurkanGozukara changed the title ~~Distributed training is broken AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'~~ New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api Apr 18, 2024

This was referenced Apr 18, 2024

fix for multi gpu training #247

Merged

Bugfix multi GPUs training #472

Open

Fix multi-gpu SDXL training #1000

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

FurkanGozukara commented Apr 18, 2024

FurkanGozukara commented Apr 18, 2024 •

edited

Loading

kohya-ss commented Apr 21, 2024

kohya-ss commented Apr 21, 2024

FurkanGozukara commented Apr 21, 2024 •

edited

Loading

Nice-Zhang66 commented Jul 1, 2024

FurkanGozukara commented Jul 1, 2024

Nice-Zhang66 commented Jul 2, 2024

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272

Comments

FurkanGozukara commented Apr 18, 2024

FurkanGozukara commented Apr 18, 2024 • edited Loading

kohya-ss commented Apr 21, 2024

kohya-ss commented Apr 21, 2024

FurkanGozukara commented Apr 21, 2024 • edited Loading

Nice-Zhang66 commented Jul 1, 2024

FurkanGozukara commented Jul 1, 2024

Nice-Zhang66 commented Jul 2, 2024

FurkanGozukara commented Apr 18, 2024 •

edited

Loading

FurkanGozukara commented Apr 21, 2024 •

edited

Loading