Zipformer training with pytorch2.1.0 fails using DDP #1390

matthiasdotsh · 2023-11-23T14:39:10Z

We are currently trying to train a zipformer model on a new machine.

However (using the latest docker image) it fails with the following traceback:

2023-11-23 13:20:46,754 INFO [train.py:1124] (0/2) About to create model
2023-11-23 13:20:47,019 INFO [train.py:1128] (1/2) Number of model parameters: 27596865
2023-11-23 13:20:47,025 INFO [train.py:1128] (0/2) Number of model parameters: 27596865
2023-11-23 13:20:47,183 INFO [train.py:1143] (1/2) Using DDP
2023-11-23 13:20:47,404 INFO [train.py:1143] (0/2) Using DDP
Traceback (most recent call last):
 File "/local/path/to/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1389, in <module>
   main()
 File "/local/path/to/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1380, in main
   mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
   return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
   while not context.join():
 File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 145, in join
   raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Training the e.g. yesno example on GPUs works just fine.

This can be reproduced using an official image without any customizations:

docker run -it --runtime=nvidia --shm-size=2gb --name=icefall --gpus all k2fsa/icefall:torch2.1.0-cuda12.1 /bin/bash
cd /workspace/icefall/egs/librispeech/ASR/
./prepare.sh
CUDA_VISIBLE_DEVICES="0,1" ./zipformer/train.py \
 --world-size 2 \
 --num-epochs 30 \
 --start-epoch 1 \
 --use-fp16 1 \
 --exp-dir zipformer/exp \
 --causal 1 \
 --full-libri 0 \
 --max-duration 200 \
 --num-encoder-layers 2,2,3,3,2,2 \
 --feedforward-dim 512,768,768,768,768,768 \
 --encoder-dim 192,256,256,256,256,256 \
 --encoder-unmasked-dim 192,192,192,192,192,192

I tried the above command with other images and some did work:

custom nix setup: torch.2.1.0-cuda12.1 (python 3.11.6) -> not working
official k2fsa/icefall:torch2.1.0-cuda12.1 (python 3.10.13) -> not working
official k2fsa/icefall:torch2.1.0-cuda11.8 (python 3.10.13) -> not working
official k2fsa/icefall:torch2.0.0-cuda11.7 (python 3.10.0) -> working
official k2fsa/icefall:torch1.9.0-cuda10.2 (python 3.7.10) -> working
official k2fsa/icefall:torch1.13.0-cuda11.6 (python 3.9.12) -> working
official k2fsa/icefall:torch1.12.1-cuda11.3 (python 3.7.13) -> working

The system under test

runs alma 9.2
2x NVIDIA GeForce RTX 2080 SUPER
Driver Version: 535.104.12
CUDA Version: 12.2

Any idea whats going on?

Is this a known issue with recent pytorch 2.1.0?

I first thought that it could be due to a new CUDA, but I don't understand why the yesno example should run if it is related to CUDA.

I am very grateful for any advice.

The text was updated successfully, but these errors were encountered:

csukuangfj · 2023-11-23T14:41:59Z

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Could you post the output of dmesg immediately after you get the above error?

matthiasdotsh · 2023-11-23T15:00:19Z

I get the following logs after starting the docker container:

[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:52 2023] veth766da7b: renamed from eth0
[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:52 2023] device veth7e3bc5c left promiscuous mode
[Thu Nov 23 15:53:52 2023] docker0: port 2(veth7e3bc5c) entered disabled state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered blocking state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered disabled state
[Thu Nov 23 15:53:59 2023] device vethcdb2b28 entered promiscuous mode
[Thu Nov 23 15:53:59 2023] eth0: renamed from veth6f5933d
[Thu Nov 23 15:53:59 2023] IPv6: ADDRCONF(NETDEV_CHANGE): vethcdb2b28: link becomes ready
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered blocking state
[Thu Nov 23 15:53:59 2023] docker0: port 2(vethcdb2b28) entered forwarding state

After that I don't see any further entries, although i have started the training and it has crashed again.
I'm not sure if it makes a difference here but I call dmesg as non-root user.

Edit: When using a working image like k2fsa/icefall:torch1.12.1-cuda11.3 output of dmesg -T looks similar.

csukuangfj · 2023-11-24T02:31:13Z

I get the following logs after starting the docker container:

Please post the logs immediately after the error, not after the start.

matthiasdotsh · 2023-11-24T06:50:36Z

Please post the logs immediately after the error, not after the start.

There are no new entries after the training crashed …

csukuangfj · 2023-11-24T06:54:35Z

--shm-size=2gb

Can you try a larger value?

matthiasdotsh · 2023-11-24T07:06:19Z

--shm-size=2gb
Can you try a larger value?

I tried with --shm-size=20gb but same as before. It works for 'the old' images but not for the newer ones and there are no entries in dmesg

matthiasdotsh · 2023-11-24T11:05:18Z

I have just done the experiment on another machine:

Same results:

Running zipformer for minilibrispeech using the k2fsa/icefall:torch1.12.1-cuda11.3 docker image works fine, running it with k2fsa/icefall:torch2.1.0-cuda12.1 gives

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

There are still no entries in dmesg

System under test:

centOS 7
2x NVIDIA A40
Driver Version: 545.23.06
CUDA Version: 12.3

Is it working for anyone?

matthiasdotsh · 2023-12-01T10:36:23Z

From my past experiments with the 'official' Dockerimages we know:

official k2fsa/icefall:torch2.0.0-cuda11.7 -> working

python3.10.0

official k2fsa/icefall:torch2.1.0-cuda11.8 -> not working

python3.10.13

So my next idea was to find whether the bug comes from cuda11.7 -> cuda11.8 or from torch2.0.0 -> torch2.1.0.

Because I coudn't find a baseimage with torch2.0.0 and cuda11.8 I switched back to my custom nix workflow for further debugging.

At first I created a nix devShell with torch2.0.0 and cuda11.7 and like using the official docker image, training works fine.

After that I bumped cuda to cuda11.8 (+torch2.0.0 and python310) and still everything works fine.

So I bumped torch to torch2.1.0 (+cuda11.8 and python310) and I can reproduce the known error again (but with a longer traceback):

2023-12-01 11:16:57,596 INFO [train.py:1124] (1/2) About to create model
2023-12-01 11:16:57,886 INFO [train.py:1128] (0/2) Number of model parameters: 27596865
2023-12-01 11:16:57,893 INFO [train.py:1128] (1/2) Number of model parameters: 27596865
2023-12-01 11:16:58,040 INFO [train.py:1143] (1/2) Using DDP
2023-12-01 11:16:58,206 INFO [train.py:1143] (0/2) Using DDP
terminate called after throwing an instance of 'c10::Error'
  what():  Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fffccc87617 in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/
site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fffccc42a56 in /path/to/local/workspace/.nix/_build
/pip_packages/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1820f2f (0x7fffafa20f2f in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libt
orch_cpu.so)
frame #3: <unknown function> + 0x706cba (0x7fffc5f06cba in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libto
rch_python.so)
<omitting python frames>

terminate called after throwing an instance of 'c10::Error'
  what():  Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fffccc87617 in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/
site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fffccc42a56 in /path/to/local/workspace/.nix/_build
/pip_packages/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1820f2f (0x7fffafa20f2f in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libt
orch_cpu.so)
frame #3: <unknown function> + 0x706cba (0x7fffc5f06cba in /path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/lib/libto
rch_python.so)
<omitting python frames>

Traceback (most recent call last):
  File "/path/to/local/workspace/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1389, in <module>
    main()
  File "/path/to/local/workspace/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1380, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/path/to/local/workspace/.nix/_build/pip_packages/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

Does anyone have any idea what could be going wrong here?

Is it my environment or is it a general problem of k2/icefall with pytorch2.1.0 ?

(BTW: Still no information in dmesg)

matthiasdotsh · 2023-12-01T10:40:11Z

Looks like this could be related to #1395

csukuangfj · 2023-12-22T16:59:29Z

This issue should be fixed in #1424

Please use the latest master.

(Note you can reuse your existing docker image. You only need to run git pull inside your container
to use the latest code)

matthiasdotsh · 2024-01-10T13:36:26Z

Thanks, works now.

matthiasdotsh changed the title ~~Zipformer training with recent Dockerimage fails using DDP~~ Zipformer training with pytorch2.1.0 fails using DDP Dec 1, 2023

csukuangfj mentioned this issue Dec 22, 2023

Add CI test to cover zipformer/train.py #1424

Merged

matthiasdotsh closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipformer training with pytorch2.1.0 fails using DDP #1390

Zipformer training with pytorch2.1.0 fails using DDP #1390

matthiasdotsh commented Nov 23, 2023 •

edited

Loading

csukuangfj commented Nov 23, 2023

matthiasdotsh commented Nov 23, 2023 •

edited

Loading

csukuangfj commented Nov 24, 2023

matthiasdotsh commented Nov 24, 2023

csukuangfj commented Nov 24, 2023

matthiasdotsh commented Nov 24, 2023 •

edited

Loading

matthiasdotsh commented Nov 24, 2023 •

edited

Loading

matthiasdotsh commented Dec 1, 2023 •

edited

Loading

matthiasdotsh commented Dec 1, 2023

csukuangfj commented Dec 22, 2023

matthiasdotsh commented Jan 10, 2024

Zipformer training with pytorch2.1.0 fails using DDP #1390

Zipformer training with pytorch2.1.0 fails using DDP #1390

Comments

matthiasdotsh commented Nov 23, 2023 • edited Loading

csukuangfj commented Nov 23, 2023

matthiasdotsh commented Nov 23, 2023 • edited Loading

csukuangfj commented Nov 24, 2023

matthiasdotsh commented Nov 24, 2023

csukuangfj commented Nov 24, 2023

matthiasdotsh commented Nov 24, 2023 • edited Loading

matthiasdotsh commented Nov 24, 2023 • edited Loading

matthiasdotsh commented Dec 1, 2023 • edited Loading

matthiasdotsh commented Dec 1, 2023

csukuangfj commented Dec 22, 2023

matthiasdotsh commented Jan 10, 2024

matthiasdotsh commented Nov 23, 2023 •

edited

Loading

matthiasdotsh commented Nov 23, 2023 •

edited

Loading

matthiasdotsh commented Nov 24, 2023 •

edited

Loading

matthiasdotsh commented Nov 24, 2023 •

edited

Loading

matthiasdotsh commented Dec 1, 2023 •

edited

Loading