-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer training crash : 'cannot set number of interop threads ' ... #1395
Comments
Could you tell us which script are you using? Also, have you changed any code? |
I am using zipformer/train.py from Librispeech with minor changes in data machinery to adapt to my dataset. export CUDA_VISIBLE_DEVICES="0" Note that yesno recipe is working both on CPU and GPU and that data preparation was very slow compared to my previous install of K2/icefall, as if parallel data processing was not working correctly... |
Is there anything related to |
No they are just related to data formatting... |
icefall/egs/librispeech/ASR/zipformer/train.py Lines 1385 to 1386 in 0622dea
The above two lines can only be executed once. Could you use pdb to run your code step by step and check at which location the program crashes? |
I followed the debugging until a _dynamo module import which fails ... /home/user/git/icefall1/egs/ber1/zipformer/train.py(1161)run()
|
Can you remove the following code block and re-try? icefall/egs/librispeech/ASR/zipformer/optim.py Line 1230 in 0622dea
|
Yes but the exact same error occurs |
Hello, I'm having the same issue, and in my case too the two lines to set the threads are only seen once in train.py |
Are you able to train using the librispeech zipformer recipe with the master branch? Also, please check that there is no unguarded code of For instance, icefall/egs/librispeech/ASR/local/compute_fbank_librispeech.py Lines 44 to 45 in 0622dea
compute_fbank_librispeech.py has an unprotected call to icefall/egs/librispeech/ASR/zipformer/train.py Lines 1385 to 1386 in 0622dea
|
Our team member has tested the latest master and it works fine on our server. |
Do you know which cuda and pytorch version was used for testing? |
I have the same problem :/ |
Downgrading to pytorch 2.0.0 resolved the issue for me. |
This issue should be fixed in #1424 Please use the latest master. |
Training a zipformer with a recent icefall/k2 install results in a crash:
2023-11-29 13:02:22,614 INFO [train.py:1138] About to create model
2023-11-29 13:02:22,996 INFO [train.py:1142] Number of model parameters: 65549011
terminate called after throwing an instance of 'c10::Error'
what(): Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
Exception raised from set_num_interop_threads at ../aten/src/ATen/ParallelThreadPoolNative.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x154f981a5617 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x154f98160a56 in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1826cbf (0x154f59d7acbf in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x70c26a (0x154f7028526a in /home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: python3() [0x52422b]
frame #7: python3() [0x5c82ce]
My env:
'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'd12eec7521aaa26f49ca0c11c94ea42879a8e71d', 'k2-git-date': 'Mon Oct 23 11:54:42 2023', 'lhotse-version': '1.17.0.dev+git.3c0574f.clean', 'torch-version': '2.1.0+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ae67f75-clean', 'icefall-git-date': 'Sun Nov 26 03:04:15 2023', 'icefall-path': '/home/user/git_projects/icefall1', 'k2-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/k2/init.py', 'lhotse-path': '/home/user/miniconda3/envs/icefall1/lib/python3.11/site-packages/lhotse/init.py', 'hostname': 'gpu3', 'IP address': '127.0.1.1'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'trnmanifest': PosixPath('data/fbank/cuts_trn.jsonl.gz'), 'devmanifest': PosixPath('data/fbank/cuts_dev.jsonl.gz'), 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 500, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
The text was updated successfully, but these errors were encountered: