Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training sample generation problem - FLUX #1578

Open
FurkanGozukara opened this issue Sep 8, 2024 · 0 comments
Open

Multi-GPU training sample generation problem - FLUX #1578

FurkanGozukara opened this issue Sep 8, 2024 · 0 comments

Comments

@FurkanGozukara
Copy link

FurkanGozukara commented Sep 8, 2024

One of my follower had below error while generating samples during 2x GPU training @kohya-ss

Encoding prompt: Style of EC$, a heart shaped character with pink arms and legs, long eyes, and small pink lips. The character is making a V peace sign with one oh his hands. The character is wearing black boots. The background is light blue.
[torch.Size([1, 768]), None, None, None]
  0%|                                                                    | 0/25 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 519, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/train_network.py", line 1253, in train
[rank1]:     self.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizers, text_encoder, unet)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 291, in sample_images
[rank1]:     flux_train_utils.sample_images(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 113, in sample_images
[rank1]:     sample_image_inference(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 229, in sample_image_inference
[rank1]:     x = denoise(flux, noise, img_ids, t5_out, txt_ids, l_pooled, timesteps=timesteps, guidance=scale, t5_attn_mask=t5_attn_mask)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 314, in denoise
[rank1]:     pred = model(
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank1]:     return model_forward(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 1004, in forward
[rank1]:     if img.ndim != 3 or txt.ndim != 3:
[rank1]: AttributeError: 'NoneType' object has no attribute 'ndim'
W0908 17:26:18.982000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12066 closing signal SIGTERM
E0908 17:26:20.101000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 12067) of binary: /home/Ubuntu/apps/kohya_ss/venv/bin/python
Traceback (most recent call last):
  File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-08_17:26:18
  host      : 0053-kci-prxmx10033
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12067)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
17:26:21-518520 INFO     Training has ended.                                                    
^[c^[z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant