-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer training with pytorch2.1.0 fails using DDP #1390
Comments
Could you post the output of |
I get the following logs after starting the docker container:
After that I don't see any further entries, although i have started the training and it has crashed again. Edit: When using a working image like |
Please post the logs immediately after the error, not after the start. |
There are no new entries after the training crashed … |
Can you try a larger value? |
I tried with |
I have just done the experiment on another machine: Same results: Running zipformer for minilibrispeech using the
There are still no entries in System under test:
Is it working for anyone? |
From my past experiments with the 'official' Dockerimages we know:
So my next idea was to find whether the bug comes from Because I coudn't find a baseimage with torch2.0.0 and cuda11.8 I switched back to my custom nix workflow for further debugging. At first I created a nix devShell with After that I bumped cuda to So I bumped torch to
Does anyone have any idea what could be going wrong here? Is it my environment or is it a general problem of k2/icefall with pytorch2.1.0 ? (BTW: Still no information in |
Looks like this could be related to #1395 |
This issue should be fixed in #1424 Please use the latest master. (Note you can reuse your existing docker image. You only need to run |
Thanks, works now. |
We are currently trying to train a zipformer model on a new machine.
However (using the latest docker image) it fails with the following traceback:
Training the e.g. yesno example on GPUs works just fine.
This can be reproduced using an official image without any customizations:
I tried the above command with other images and some did work:
The system under test
Any idea whats going on?
Is this a known issue with recent pytorch 2.1.0?
I first thought that it could be due to a new CUDA, but I don't understand why the yesno example should run if it is related to CUDA.
I am very grateful for any advice.
The text was updated successfully, but these errors were encountered: