-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for multi gpu training #247
Conversation
Thank you for this! I forgot ありがとうございます。 |
I have 2 4090s, so I can test this to train a LoRA. Have I to add a parameter to the training, or do I have to select multi-gpu in accelerate config? |
@Panchvzluck Other than that, the default settings are fine. The problem that cannot be tested is that the model output during training is partially corrupted. https://github.com/kohya-ss/sd-scripts/pull/247/files#diff-62cf7de156b588b9acd7af26941d7bb189368221946c8b5e63f69df5cda56f39R581-R584 |
When trying to train, I get this issue (I'm on Windows 10)
My accelerate configs were:
Also, as another question (for the meanwhile), it is possible to run training on separated cards at the same time? So for example, train LoRA 1 on GPU 1 and LoRA 2 on GPU 2. I tried with separate venvs/folders, but it seems both use the same accelerate config from EDIT: Just found you can use --multi_gpu to use multiple GPUs without the need to change the accelerator config. For example, Now, it gives the same issue mentioned above EDIT2: You can specify which GPU to use with --gpu_ids, so for example EDIT3: it seems on Windows you have to use gloo instead of NCCL, but I'm not sure how to do that. |
After looking into it, it seems to be an error that occurs in the windows environment. import os
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" |
@ddPn08 At the end, managed to fix it by replacing all "nccl" string values into "gloo" in I tried
before and sadly that doesn't fix it. Then, started to train a LoRA, and for some reason it seems to be duplicated? It is that a expected behaviour? Also, my settings were
But then it does 4 epochs, andl the output look like this
Both cards seem to use ~20GB of VRAM each, and so then it would be effectively "faster" to run a training instance per GPU for different LoRAs than use both cards for 1 LoRA. Now, I checked the 4 epochs with
And it does give the expected results. So the output is not corrupted! |
thank you! Some duplication of output is normal. |
I got it working so I opened a PR. 動作するようになったのでPRを開きました。 |
I am using 4 GPUs on Windows 10. Although the training process runs smoothly, I encounter an error when attempting to use the safetensors file generated by the Checkpoint to try txt2image.
|
@HayashiItsuki |
Thank you for opening the PR! It looks good. I will wait a while for a comment from HayashiItsuki. ありがとうございます! 良さそうに見えます。 @HayashiItsuki 氏のコメントを少しだけ待ちたいと思います。 |
Sorry for the delay, I had to redo it 3 times, but all were successful and I was able to use the model without any problems. (I think the reason the first one didn't succeed may be due to my environment) |
@ddPn08 I still can't figure out how to run training on multiple video cards? Where to prescribe parameters and which ones? |
Sorry for the old dump, but since SDXL came, multiGPU training is a really big plus. I have 2x4090, and using --multi_gpu seems to work, but I'm not sure if I should set the epoch to half the size to what a single GPU training will do? Assuming batch size 1, For example, if I set epoch amount to 6 and 5000 total steps, it will effectively do 6 epochs with 5000 steps. If I set epoch amount to 6, total steps to 5000 and multiGPU (2 GPUs), I get 12 epochs but the same 5000 total steps. It is training 2 times per 1 step, when using 2 GPUs? |
20:29:46-877134 INFO accelerate launch --gpu_ids="0,1" --multi_gpu --num_cpu_threads_per_process=4 Traceback (most recent call last): |
multi gpu broken kaggle training anyone has any guess how to fix? |
Fixes for multi-GPU training with train_network.py.
I haven't fully tested it yet, so I'm submitting it as a draft.
Open it when everything is tested.
It would be helpful if you could let me know if you have noticed anything.
train_network.py のマルチGPUトレーニングの修正です。
まだ完全にテストしていないのでドラフトです。正常に動作したら開きます。
何かお気づきの点ありましたら教えていただけると助かります。