You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of my follower had below error while generating samples during 2x GPU training @kohya-ss
Encoding prompt: Style of EC$, a heart shaped character with pink arms and legs, long eyes, and small pink lips. The character is making a V peace sign with one oh his hands. The character is wearing black boots. The background is light blue.
[torch.Size([1, 768]), None, None, None]
0%| | 0/25 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 519, in <module>
[rank1]: trainer.train(args)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/train_network.py", line 1253, in train
[rank1]: self.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizers, text_encoder, unet)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py", line 291, in sample_images
[rank1]: flux_train_utils.sample_images(
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 113, in sample_images
[rank1]: sample_image_inference(
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 229, in sample_image_inference
[rank1]: x = denoise(flux, noise, img_ids, t5_out, txt_ids, l_pooled, timesteps=timesteps, guidance=scale, t5_attn_mask=t5_attn_mask)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_train_utils.py", line 314, in denoise
[rank1]: pred = model(
[rank1]: File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank1]: return model_forward(*args, **kwargs)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__
[rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]: File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/Ubuntu/apps/kohya_ss/sd-scripts/library/flux_models.py", line 1004, in forward
[rank1]: if img.ndim != 3 or txt.ndim != 3:
[rank1]: AttributeError: 'NoneType' object has no attribute 'ndim'
W0908 17:26:18.982000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 12066 closing signal SIGTERM
E0908 17:26:20.101000 132414536454144 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 12067) of binary: /home/Ubuntu/apps/kohya_ss/venv/bin/python
Traceback (most recent call last):
File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train_network.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-08_17:26:18
host : 0053-kci-prxmx10033
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12067)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
17:26:21-518520 INFO Training has ended.
^[c^[z
The text was updated successfully, but these errors were encountered:
One of my follower had below error while generating samples during 2x GPU training @kohya-ss
The text was updated successfully, but these errors were encountered: