Multi-GPU training sample generation problem - FLUX #1578

FurkanGozukara · 2024-09-08T21:45:24Z

One of my follower had below error while generating samples during 2x GPU training @kohya-ss

Encoding prompt: Style of EC$, a heart shaped character with pink arms and legs, long eyes, and small pink lips. The character is making a V peace sign with one oh his hands. The character is wearing black boots. The background is light blue.
[torch.Size([1, 768]), None, None, None]
  0%|                                                                    | 0/25 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 519, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/train_network.py", line 1253, in train
[rank1]:     self.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizers, text_encoder, unet)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 291, in sample_images
[rank1]:     flux_train_utils.sample_images(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 113, in sample_images
[rank1]:     sample_image_inference(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 229, in sample_image_inference
[rank1]:     x = denoise(flux, noise, img_ids, t5_out, txt_ids, l_pooled, timesteps=timesteps, guidance=scale, t5_attn_mask=t5_attn_mask)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 314, in denoise
[rank1]:     pred = model(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank1]:     return model_forward(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 1004, in forward
[rank1]:     if img.ndim != 3 or txt.ndim != 3:
[rank1]: AttributeError: 'NoneType' object has no attribute 'ndim'
W0908 17:26:18.982000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12066 closing signal SIGTERM
E0908 17:26:20.101000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 12067) of binary: /home/Ubuntu/apps/kohya_ss/venv/bin/python
Traceback (most recent call last):
  File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-08_17:26:18
  host      : 0053-kci-prxmx10033
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12067)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
17:26:21-518520 INFO     Training has ended.                                                    
^[c^[z

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training sample generation problem - FLUX #1578

Multi-GPU training sample generation problem - FLUX #1578

FurkanGozukara commented Sep 8, 2024 •

edited

Loading

Multi-GPU training sample generation problem - FLUX #1578

Multi-GPU training sample generation problem - FLUX #1578

Comments

FurkanGozukara commented Sep 8, 2024 • edited Loading

FurkanGozukara commented Sep 8, 2024 •

edited

Loading