-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api #1272
Comments
even single GPU training fails on kaggle now
|
DDP training for If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue... |
possibly duplicate of #1099 |
I didn't set clip skip. I use default value. After I selected single GPU P100 it worked. but with dual T4 gpu it always failed yes my config has "clip_skip": 1, i train only text encoder 1 and not text encoder 2 |
I had the same problem as you, may I ask how you eventually solved it. |
No I didn't I used only single gpu to solve issue Before these changes it was working perfect After this topic I didn't try again either |
Thank you for your reply, I will continue to look for a solution. |
I am trying to do multi gpu training on Kaggle
Previously it was working great
But after all these new changes I am getting below error
train command like this
The text was updated successfully, but these errors were encountered: