Zipformer training crash : 'cannot set number of interop threads ' ... #1395

iggygeek · 2023-11-29T13:26:43Z

Training a zipformer with a recent icefall/k2 install results in a crash:

2023-11-29 13:02:22,614 INFO [train.py:1138] About to create model
2023-11-29 13:02:22,996 INFO [train.py:1142] Number of model parameters: 65549011
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x154f981a5617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x154f98160a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1826cbf (0x154f59d7acbf in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x70c26a (0x154f7028526a in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: python3() [0x52422b]

frame #7: python3() [0x5c82ce]

My env:
'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'd12eec7521aaa26f49ca0c11c94ea42879a8e71d', 'k2-git-date': 'Mon Oct 23 11:54:42 2023', 'lhotse-version': '1.17.0.dev+git.3c0574f.clean', 'torch-version': '2.1.0+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ae67f75-clean', 'icefall-git-date': 'Sun Nov 26 03:04:15 2023', 'icefall-path': '/home/user/git_projects/icefall1', 'k2-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/k2/init.py', 'lhotse-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/lhotse/init.py', 'hostname': 'gpu3', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'trnmanifest': PosixPath('data/fbank/cuts_trn.jsonl.gz'), 'devmanifest': PosixPath('data/fbank/cuts_dev.jsonl.gz'), 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 500, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}

csukuangfj · 2023-11-29T15:39:23Z

Could you tell us which script are you using?

Also, have you changed any code?

iggygeek · 2023-11-29T16:00:37Z

I am using zipformer/train.py from Librispeech with minor changes in data machinery to adapt to my dataset.

export CUDA_VISIBLE_DEVICES="0"
./zipformer/train.py
--world-size 1
--num-epochs 50
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--causal 0
--max-duration 500

Note that yesno recipe is working both on CPU and GPU and that data preparation was very slow compared to my previous install of K2/icefall, as if parallel data processing was not working correctly...

csukuangfj · 2023-11-30T10:28:33Z

with minor changes in data machinery to adapt to my dataset.

Is there anything related to thread in your changes?

iggygeek · 2023-11-30T10:37:51Z

No they are just related to data formatting...
In addition, I tried to run data preparation for Librispeech and it was very slow ...

csukuangfj · 2023-11-30T10:45:15Z

icefall/egs/librispeech/ASR/zipformer/train.py

Lines 1385 to 1386 in 0622dea

    
           torch.set_num_threads(1) 
        
           torch.set_num_interop_threads(1)

The above two lines can only be executed once.

Could you use pdb to run your code step by step and check at which location the program crashes?

iggygeek · 2023-11-30T12:35:50Z

I followed the debugging until a _dynamo module import which fails ...

/home/user/git/icefall1/egs/ber1/zipformer/train.py(1161)run()
-> optimizer = ScaledAdam(
/home/user/git/icefall1/egs/ber1/zipformer/optim.py(41)init()
->super(BatchedOptimizer, self).init(params, defaults)
/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/optim/optimizer.py(266)init()
->self.add_param_group(cast(dict, param_group))
/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/_compile.py(22)inner()
-> import torch._dynamo
(Pdb) s
--Call--

(1165)_find_and_load()
(Pdb) n
(1170)_find_and_load()
(Pdb) n
(1171)_find_and_load()
(Pdb) n
(1173)_find_and_load()
(Pdb) n
(1174)_find_and_load()
(Pdb) n
(1175)_find_and_load()
(Pdb) n
(1176)_find_and_load()
(Pdb) n
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92d2e87617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f92d2e42a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)

csukuangfj · 2023-11-30T13:10:30Z

Can you remove the following code block and re-try?

icefall/egs/librispeech/ASR/zipformer/optim.py

Line 1230 in 0622dea

if __name__ == "__main__":

iggygeek · 2023-11-30T13:37:34Z

Yes but the exact same error occurs

MarcoMultichannel · 2023-11-30T13:53:39Z

Hello, I'm having the same issue, and in my case too the two lines to set the threads are only seen once in train.py

csukuangfj · 2023-12-01T02:45:50Z

Yes but the exact same error occurs

Are you able to train using the librispeech zipformer recipe with the master branch?

Also, please check that there is no unguarded code of torch.set_num_interop_threads(1) in your script that
is imported directly or indirectly by train.py.

For instance,

icefall/egs/librispeech/ASR/local/compute_fbank_librispeech.py

Lines 44 to 45 in 0622dea

    
           torch.set_num_threads(1) 
        
           torch.set_num_interop_threads(1)

compute_fbank_librispeech.py has an unprotected call to torch.set_num_interop_threads(1) so
it should not be imported directly or indirectly by train.py since train.py also has an unprotected call to torch.set_num_interop_threads(1)

icefall/egs/librispeech/ASR/zipformer/train.py

Lines 1385 to 1386 in 0622dea

    
           torch.set_num_threads(1) 
        
           torch.set_num_interop_threads(1)

csukuangfj · 2023-12-01T11:08:51Z

Our team member has tested the latest master and it works fine on our server.

matthiasdotsh · 2023-12-01T11:39:21Z

Our team member has tested the latest master and it works fine on our server.

Do you know which cuda and pytorch version was used for testing?

joazoa · 2023-12-05T23:57:17Z

I have the same problem :/
Training code has not changed for me, but recent builds give me the same error.

joazoa · 2023-12-06T09:57:56Z

Downgrading to pytorch 2.0.0 resolved the issue for me.

csukuangfj · 2023-12-22T16:59:59Z

This issue should be fixed in #1424

Please use the latest master.

matthiasdotsh mentioned this issue Dec 1, 2023

Zipformer training with pytorch2.1.0 fails using DDP #1390

Closed

JinZr closed this as completed Dec 11, 2023

csukuangfj mentioned this issue Dec 22, 2023

Add CI test to cover zipformer/train.py #1424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

iggygeek commented Nov 29, 2023

csukuangfj commented Nov 29, 2023

iggygeek commented Nov 29, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

MarcoMultichannel commented Nov 30, 2023

csukuangfj commented Dec 1, 2023

csukuangfj commented Dec 1, 2023

matthiasdotsh commented Dec 1, 2023

joazoa commented Dec 5, 2023

joazoa commented Dec 6, 2023

csukuangfj commented Dec 22, 2023

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

Comments

iggygeek commented Nov 29, 2023

csukuangfj commented Nov 29, 2023

iggygeek commented Nov 29, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

csukuangfj commented Nov 30, 2023

iggygeek commented Nov 30, 2023

MarcoMultichannel commented Nov 30, 2023

csukuangfj commented Dec 1, 2023

csukuangfj commented Dec 1, 2023

matthiasdotsh commented Dec 1, 2023

joazoa commented Dec 5, 2023

joazoa commented Dec 6, 2023

csukuangfj commented Dec 22, 2023