Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer training crash : 'cannot set number of interop threads ' ... #1395

Closed
iggygeek opened this issue Nov 29, 2023 · 15 comments
Closed

Comments

@iggygeek
Copy link

Training a zipformer with a recent icefall/k2 install results in a crash:

2023-11-29 13:02:22,614 INFO [train.py:1138] About to create model
2023-11-29 13:02:22,996 INFO [train.py:1142] Number of model parameters: 65549011
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x154f981a5617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x154f98160a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1826cbf (0x154f59d7acbf in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x70c26a (0x154f7028526a in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: python3() [0x52422b]

frame #7: python3() [0x5c82ce]

My env:
'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'd12eec7521aaa26f49ca0c11c94ea42879a8e71d', 'k2-git-date': 'Mon Oct 23 11:54:42 2023', 'lhotse-version': '1.17.0.dev+git.3c0574f.clean', 'torch-version': '2.1.0+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ae67f75-clean', 'icefall-git-date': 'Sun Nov 26 03:04:15 2023', 'icefall-path': '/home/user/git_projects/icefall1', 'k2-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/k2/init.py', 'lhotse-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/lhotse/init.py', 'hostname': 'gpu3', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'trnmanifest': PosixPath('data/fbank/cuts_trn.jsonl.gz'), 'devmanifest': PosixPath('data/fbank/cuts_dev.jsonl.gz'), 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 500, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}

@csukuangfj
Copy link
Collaborator

Could you tell us which script are you using?

Also, have you changed any code?

@iggygeek
Copy link
Author

I am using zipformer/train.py from Librispeech with minor changes in data machinery to adapt to my dataset.

export CUDA_VISIBLE_DEVICES="0"
./zipformer/train.py
--world-size 1
--num-epochs 50
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--causal 0
--max-duration 500

Note that yesno recipe is working both on CPU and GPU and that data preparation was very slow compared to my previous install of K2/icefall, as if parallel data processing was not working correctly...

@csukuangfj
Copy link
Collaborator

with minor changes in data machinery to adapt to my dataset.

Is there anything related to thread in your changes?

@iggygeek
Copy link
Author

No they are just related to data formatting...
In addition, I tried to run data preparation for Librispeech and it was very slow ...

@csukuangfj
Copy link
Collaborator

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

The above two lines can only be executed once.

Could you use pdb to run your code step by step and check at which location the program crashes?

@iggygeek
Copy link
Author

I followed the debugging until a _dynamo module import which fails ...

/home/user/git/icefall1/egs/ber1/zipformer/train.py(1161)run()
-> optimizer = ScaledAdam(
/home/user/git/icefall1/egs/ber1/zipformer/optim.py(41)init()
->super(BatchedOptimizer, self).init(params, defaults)
/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/optim/optimizer.py(266)init()
->self.add_param_group(cast(dict, param_group))
/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/_compile.py(22)inner()
-> import torch._dynamo
(Pdb) s
--Call--

(1165)_find_and_load()
(Pdb) n
(1170)_find_and_load()
(Pdb) n
(1171)_find_and_load()
(Pdb) n
(1173)_find_and_load()
(Pdb) n
(1174)_find_and_load()
(Pdb) n
(1175)_find_and_load()
(Pdb) n
(1176)_find_and_load()
(Pdb) n
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92d2e87617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f92d2e42a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)

@csukuangfj
Copy link
Collaborator

Can you remove the following code block and re-try?

if __name__ == "__main__":

@iggygeek
Copy link
Author

Yes but the exact same error occurs

@MarcoMultichannel
Copy link

Hello, I'm having the same issue, and in my case too the two lines to set the threads are only seen once in train.py

@csukuangfj
Copy link
Collaborator

Yes but the exact same error occurs

Are you able to train using the librispeech zipformer recipe with the master branch?

Also, please check that there is no unguarded code of torch.set_num_interop_threads(1) in your script that
is imported directly or indirectly by train.py.

For instance,

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

compute_fbank_librispeech.py has an unprotected call to torch.set_num_interop_threads(1) so
it should not be imported directly or indirectly by train.py since train.py also has an unprotected call to torch.set_num_interop_threads(1)

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

@csukuangfj
Copy link
Collaborator

Our team member has tested the latest master and it works fine on our server.

@matthiasdotsh
Copy link

Our team member has tested the latest master and it works fine on our server.

Do you know which cuda and pytorch version was used for testing?

@joazoa
Copy link

joazoa commented Dec 5, 2023

I have the same problem :/
Training code has not changed for me, but recent builds give me the same error.

@joazoa
Copy link

joazoa commented Dec 6, 2023

Downgrading to pytorch 2.0.0 resolved the issue for me.

@csukuangfj
Copy link
Collaborator

This issue should be fixed in #1424

Please use the latest master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants