Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer training with pytorch2.1.0 fails using DDP #1390

Closed
matthiasdotsh opened this issue Nov 23, 2023 · 11 comments
Closed

Zipformer training with pytorch2.1.0 fails using DDP #1390

matthiasdotsh opened this issue Nov 23, 2023 · 11 comments

Comments

@matthiasdotsh
Copy link

matthiasdotsh commented Nov 23, 2023

We are currently trying to train a zipformer model on a new machine.

However (using the latest docker image) it fails with the following traceback:

2023-11-23 13:20:46,754 INFO [train.py:1124] (0/2) About to create model
2023-11-23 13:20:47,019 INFO [train.py:1128] (1/2) Number of model parameters: 27596865
2023-11-23 13:20:47,025 INFO [train.py:1128] (0/2) Number of model parameters: 27596865
2023-11-23 13:20:47,183 INFO [train.py:1143] (1/2) Using DDP
2023-11-23 13:20:47,404 INFO [train.py:1143] (0/2) Using DDP
Traceback (most recent call last):
 File "/local/path/to/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1389, in <module>
   main()
 File "/local/path/to/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1380, in main
   mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
   return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
   while not context.join():
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 145, in join
   raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Training the e.g. yesno example on GPUs works just fine.

This can be reproduced using an official image without any customizations:

docker run -it --runtime=nvidia --shm-size=2gb --name=icefall --gpus all k2fsa/icefall:torch2.1.0-cuda12.1 /bin/bash
cd /workspace/icefall/egs/librispeech/ASR/
./prepare.sh
CUDA_VISIBLE_DEVICES="0,1" ./zipformer/train.py \
 --world-size 2 \
 --num-epochs 30 \
 --start-epoch 1 \
 --use-fp16 1 \
 --exp-dir zipformer/exp \
 --causal 1 \
 --full-libri 0 \
 --max-duration 200 \
 --num-encoder-layers 2,2,3,3,2,2 \
 --feedforward-dim 512,768,768,768,768,768 \
 --encoder-dim 192,256,256,256,256,256 \
 --encoder-unmasked-dim 192,192,192,192,192,192

I tried the above command with other images and some did work:

  • custom nix setup: torch.2.1.0-cuda12.1 (python 3.11.6) -> not working
  • official k2fsa/icefall:torch2.1.0-cuda12.1 (python 3.10.13) -> not working
  • official k2fsa/icefall:torch2.1.0-cuda11.8 (python 3.10.13) -> not working
  • official k2fsa/icefall:torch2.0.0-cuda11.7 (python 3.10.0) -> working
  • official k2fsa/icefall:torch1.9.0-cuda10.2 (python 3.7.10) -> working
  • official k2fsa/icefall:torch1.13.0-cuda11.6 (python 3.9.12) -> working
  • official k2fsa/icefall:torch1.12.1-cuda11.3 (python 3.7.13) -> working

The system under test

  • runs alma 9.2
  • 2x NVIDIA GeForce RTX 2080 SUPER
  • Driver Version: 535.104.12
  • CUDA Version: 12.2

Any idea whats going on?

Is this a known issue with recent pytorch 2.1.0?

I first thought that it could be due to a new CUDA, but I don't understand why the yesno example should run if it is related to CUDA.

I am very grateful for any advice.

@csukuangfj
Copy link
Collaborator

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Could you post the output of dmesg immediately after you get the above error?

@matthiasdotsh
Copy link
Author

matthiasdotsh commented Nov 23, 2023

I get the following logs after starting the docker container:

[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:52 2023] veth766da7b: renamed from eth0
[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:52 2023] device veth7e3bc5c left promiscuous mode
[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered blocking state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered disabled state
[Thu Nov 23 15:53:59 2023] device vethcdb2b28 entered promiscuous mode
[Thu Nov 23 15:53:59 2023] eth0: renamed from veth6f5933d
[Thu Nov 23 15:53:59 2023] IPv6: ADDRCONF(NETDEV_CHANGE): vethcdb2b28: link becomes ready
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered blocking state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered forwarding state

After that I don't see any further entries, although i have started the training and it has crashed again.
I'm not sure if it makes a difference here but I call dmesg as non-root user.

Edit: When using a working image like k2fsa/icefall:torch1.12.1-cuda11.3 output of dmesg -T looks similar.

@csukuangfj
Copy link
Collaborator

I get the following logs after starting the docker container:

Please post the logs immediately after the error, not after the start.

@matthiasdotsh
Copy link
Author

Please post the logs immediately after the error, not after the start.

There are no new entries after the training crashed …

@csukuangfj
Copy link
Collaborator

--shm-size=2gb

Can you try a larger value?

@matthiasdotsh
Copy link
Author

matthiasdotsh commented Nov 24, 2023

--shm-size=2gb
Can you try a larger value?

I tried with --shm-size=20gb but same as before. It works for 'the old' images but not for the newer ones and there are no entries in dmesg

@matthiasdotsh
Copy link
Author

matthiasdotsh commented Nov 24, 2023

I have just done the experiment on another machine:

Same results:

Running zipformer for minilibrispeech using the k2fsa/icefall:torch1.12.1-cuda11.3 docker image works fine, running it with k2fsa/icefall:torch2.1.0-cuda12.1 gives

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

There are still no entries in dmesg

System under test:

  • centOS 7
  • 2x NVIDIA A40
  • Driver Version: 545.23.06
  • CUDA Version: 12.3

Is it working for anyone?

@matthiasdotsh
Copy link
Author

matthiasdotsh commented Dec 1, 2023

From my past experiments with the 'official' Dockerimages we know:

  • official k2fsa/icefall:torch2.0.0-cuda11.7 -> working
    • python3.10.0
  • official k2fsa/icefall:torch2.1.0-cuda11.8 -> not working
    • python3.10.13

So my next idea was to find whether the bug comes from cuda11.7 -> cuda11.8 or from torch2.0.0 -> torch2.1.0.

Because I coudn't find a baseimage with torch2.0.0 and cuda11.8 I switched back to my custom nix workflow for further debugging.

At first I created a nix devShell with torch2.0.0 and cuda11.7 and like using the official docker image, training works fine.

After that I bumped cuda to cuda11.8 (+torch2.0.0 and python310) and still everything works fine.

So I bumped torch to torch2.1.0 (+cuda11.8 and python310) and I can reproduce the known error again (but with a longer traceback):

2023-12-01 11:16:57,596 INFO [train.py:1124] (1/2) About to create model
2023-12-01 11:16:57,886 INFO [train.py:1128] (0/2) Number of model parameters: 27596865
2023-12-01 11:16:57,893 INFO [train.py:1128] (1/2) Number of model parameters: 27596865
2023-12-01 11:16:58,040 INFO [train.py:1143] (1/2) Using DDP
2023-12-01 11:16:58,206 INFO [train.py:1143] (0/2) Using DDP
terminate called after throwing an instance of 'c10::Error'
  what():  Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fffccc87617 in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/
site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fffccc42a56 in /path/to/local/workspace/.nix/_build
/pip_packages/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1820f2f (0x7fffafa20f2f in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libt
orch_cpu.so)
frame #3: <unknown function> + 0x706cba (0x7fffc5f06cba in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libto
rch_python.so)
<omitting python frames>

terminate called after throwing an instance of 'c10::Error'
  what():  Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fffccc87617 in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/
site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fffccc42a56 in /path/to/local/workspace/.nix/_build
/pip_packages/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1820f2f (0x7fffafa20f2f in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libt
orch_cpu.so)
frame #3: <unknown function> + 0x706cba (0x7fffc5f06cba in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libto
rch_python.so)
<omitting python frames>

Traceback (most recent call last):
  File "/path/to/local/workspace/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1389, in <module>
    main()
  File "/path/to/local/workspace/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1380, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Does anyone have any idea what could be going wrong here?

Is it my environment or is it a general problem of k2/icefall with pytorch2.1.0 ?

(BTW: Still no information in dmesg)

@matthiasdotsh matthiasdotsh changed the title Zipformer training with recent Dockerimage fails using DDP Zipformer training with pytorch2.1.0 fails using DDP Dec 1, 2023
@matthiasdotsh
Copy link
Author

Looks like this could be related to #1395

@csukuangfj
Copy link
Collaborator

This issue should be fixed in #1424

Please use the latest master.

(Note you can reuse your existing docker image. You only need to run git pull inside your container
to use the latest code)

@matthiasdotsh
Copy link
Author

Thanks, works now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants