-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on 2x H100 on Ubuntu and speed is same as 1x H100 what we are doing wrong? #1434
Comments
You could probably provide a copy of the toml as this is what as-scripts ultimately consume and it should make it easier for @kohya-ss to troubleshoot without being concerned with the gui config. Many users have been complaining about issues with multiple GPU so I am curious to learn if perhaps it is something I am doing wrong with the gui… like not properly handling of parameters or actually not allowing needed parameters to be entered. |
here it is
|
@aria1th @BootsofLagrangian any ideas? |
AFAIK batch size is per device, so the effective batch size is 4x2 = 8, which is why it's about half as fast. |
i know it is. each gpu could go up maximum 7 batch size i tested. still wouldn't make difference since the communication overhead is just crazy. before this new multi gpu training system it was way faster. i was doing dual T4 gpu training on Kaggle and there were almost no such communication delay. moreover with new system i never could make it work on Kaggle either |
The slight performance degradation is expected due to communication overhead, its normal. Its more bottlenecked by system itself hardware - which is why everyone is trying to have "less communication bottleneck system" and even B100 / B200 / etc, as NVIDIA says. GCP always knew that hardware is the most important one - you would never get bottleneck from that, but if you're using other service provider, you should check the factors... But if its 'version dependent' then uhh..... kohya script does not handle communication, accelerate does it... |
@aria1th this was on same machine rented on Massed Compute what hardware i have to check? this speed loss is just huge. maybe i am doing something wrong? |
mainboard, storage, RAM, CPU.... bottleneck can happen from various causes.... and you have to check them all first |
i doubt that any of them is the cause. you get a very powerful VM. also single GPU speed looks very accurate so lets say any of them is the cause how to debug it? |
Have you recentrly tried to use the version that used to work fine on the same system? It is possible the hosting has changed the type of machine and it is resulting in this issue? If the speed is back up, then you could provide kohya with the information about what sd-scripts code base used to work best and he might be able to pin-point where the speed issue is coming from? |
it was a very long time ago that i used dual speed successfully. 7 months ago i have a video :D i can try maybe |
Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck. |
just asked them lets see what they tell. can we see it somehow on the machine with a command etc? |
ok it turns out all are PCIe. so i assume we can't get any better right? |
Okay, there is a hardware bottleneck. And I think you can get faster total training time using two H100s, not time per step. i.e. One H100 : 1.27 s/it vs Two H100s 2.07 s/it If you have a budget to buy NVLink, it is faster way to speed up your H100s. If you dont want to buy it, XD Additionally, speed degradation due to communication is not your fault. It is just H100 has super faster memory bandwidth than PCIe, e.g. H100 (2TB /s) vs PCIe 4.0 ( 16Gb / s ) |
@BootsofLagrangian it is not like i purchased them i am using on Massed Compute :) They said they have SXM4 A100. I will test the script there. It is supposed to not get degraded speed like this. We will see :) |
Most of SXM4 system runs on interconnected device(NVLink, NVSwitch). So no degradation is natural, but most of PCIe system dose not. PCIe powered GPU needs external interlink device. |
@kohya-ss the training fails on a SXM4 machine :( when 1 gpu is used it works here batch size 7 speed When I try 2 GPU like below it fails tested all of the dynamo backends all failed
|
I've been training multi-gpu for months using both the gui and the CLI. I think this issue might be related to the CUDA version itself more than kohya. I've had this happen to me once in the past where it couldn't register some specific cuda services. |
The lack of FlashAttention 3 is rearing its ugly head, we don't even have TMA for the H100 in kohya among other stuff. |
multi gpu training worked on PCIe machine on massed compute . but with SXM i got above error. do you know how to fix? how do you setup your accelerator? what cuda version you have on your SXM machine? |
@bmaltais there is nothing wrong with your interface or kohya's script, you've done a great job, altho' some descriptions you have in there are not totally accurate but that's not your fault. |
This is an hardware error. You should contact the compute provider because you've got a faulty node. |
thanks i did. it could be reason |
When training batch size 4 on H100 the speed is 1.27 second / it
When training batch size 4 on 2x H100 the speed is 2.05 second / it
So basically we almost got no speed boost from multiple GPU training
Is this expected? I am training on SDXL RealVis XL model with 1024 no bucketing
We are using latest bmaltais Kohya GUI on Ubuntu with the below multi-gpu configuration
@kohya-ss @bmaltais
this below is training json config
TOML file
The text was updated successfully, but these errors were encountered: