Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal memory access during zipformer training #1764

Closed
ngoel17 opened this issue Sep 30, 2024 · 26 comments
Closed

Illegal memory access during zipformer training #1764

ngoel17 opened this issue Sep 30, 2024 · 26 comments

Comments

@ngoel17
Copy link
Contributor

ngoel17 commented Sep 30, 2024

I am getting the following error during Zipformer training...
Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated.

2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/init.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/init.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model

....

2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor ()(torch::autograd::AutogradContext, at::Tensor, float), k2::SwooshFunctionk2::SwooshRConstants::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown

@ngoel17
Copy link
Contributor Author

ngoel17 commented Sep 30, 2024

here is another different error I get by running the same script again. These errors seem to be happening at random times during the training, but within a few checkpoints.

2024-09-30 15:06:46,412 INFO [train.py:1122] (0/2) Epoch 1, batch 400, loss[loss=1.346, simple_loss=0.8059, pruned_loss=0.9432, over 1061.00 frames. ], tot_loss[loss=1.241, simple_loss=0.7526, pruned_loss=0.8645, over 191645.04 frames. ], batch size: 4, lr: 1.79e-02, grad_scale: 0.125
2024-09-30 15:06:46,524 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7000.0, ans=0.22999999999999998
2024-09-30 15:06:46,525 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=7000.0, ans=0.655
2024-09-30 15:06:48,441 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=7000.0, ans=0.171875
2024-09-30 15:06:48,897 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=7000.0, ans=0.037500000000000006
2024-09-30 15:06:48,898 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=7000.0, ans=0.171875
2024-09-30 15:06:50,010 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.44 vs. limit=12.75
2024-09-30 15:06:52,242 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7008.333333333333, ans=0.22991666666666666
2024-09-30 15:06:52,327 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=18.81 vs. limit=6.752083333333333
2024-09-30 15:06:57,455 INFO [scaling.py:1120] (0/2) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00
2024-09-30 15:06:59,827 INFO [scaling.py:1120] (1/2) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=7.720e+00
2024-09-30 15:07:02,352 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.81 vs. limit=10.134375
2024-09-30 15:07:03,453 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.59 vs. limit=12.76875
2024-09-30 15:07:03,507 INFO [train.py:1060] (1/2) Caught exception: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
.
2024-09-30 15:07:03,508 INFO [checkpoint.py:75] (1/2) Saving checkpoint to exp/zipformer/v6/bad-model-1.pt
[rank1]:[E930 15:07:03.771530808 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7f2d9c0ef78c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7f2d9c093a79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7f2d9c1b8ec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f2d9d341d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7f2d9d346db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7f2d9d3512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7f2d9d352e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7f2db7cb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7f2dc0164ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7f2dc01f6850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: misaligned address

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 1, 2024

I can confirm now that the error is specific to zipformer recipe. For example - pruned-transducer-stateless7 runs fine. If someone guides me towards steps equivalent to what's described here, I will happily do those and report.

@sangeet2020
Copy link

I believe that this has something to do with the torch cuda not being compatible with your CUDA installed in /usr/local/cuda-XX.X. But surprising that the error is only specific to the zipformer reciepe. Perhabs there is improper synch. when doing distributed training. Have you tried the zipformer recipe on single GPU?
My recommendation would be to lower your torch version to 2.4 and cuda to 12.4.
thanks

@danpovey
Copy link
Collaborator

danpovey commented Oct 2, 2024

I think the possible mismatches would be the torch cuda not being compatible with the driver (although this should refuse to run), or not being compatible with the CUDA version that was used to compile k2 (although this should be detected some other way, i think).
do
export CUDA_LAUNCH_BLOCKING=1
export K2_SYNC_KERNELS=1
before running.. these should make sure any errors are caught as they happen. It will help to confirm that the issue is in that SwooshR kernel.
That is a very simple kernel that should not be capable of generating an error by itself though.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

Thanks for the various suggestions. I have multiple experiments to run here to narrow down the cause. I'll get back to you. In the meantime, I would like to know about these CUDA versions. My understanding is that if I have (for example) a CUDA 12.xx driver (that comes with that version of CUDA), it should also be compatible with an earlier version of CUDA such as 11.xx. Is that your understanding also, or is there a compatibility matrix I can find somewhere?

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

On a single GPU I got this assert error.

2024-10-02 11:19:36,387 INFO [train.py:1122] Epoch 1, batch 6000, loss[loss=0.4327, simple_loss=0.4287, pruned_loss=0.2184, over 8488.00 frames. ], tot_loss[loss=0.4947, simple_loss=0.4712, pruned_loss=0.2591, over 1707838.56 frames. ], batch size: 35, lr: 2.00e-02, grad_scale: 64.0
2024-10-02 11:19:36,387 INFO [train.py:1145] Computing validation loss
Traceback (most recent call last):
File "./zipformer/train.py", line 1651, in
main()
File "./zipformer/train.py", line 1644, in main
run(rank=0, world_size=1, args=args)
File "./zipformer/train.py", line 1520, in run
train_one_epoch(
File "./zipformer/train.py", line 1146, in train_one_epoch
valid_info = compute_validation_loss(
File "./zipformer/train.py", line 945, in compute_validation_loss
loss, loss_info = compute_loss(
File "./zipformer/train.py", line 879, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "/home/mousmita/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/dsk4/mousmita/icefall_24spt/icefall/egs/librispeech/ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "/mnt/dsk4/mousmita/icefall_24spt/icefall/egs/librispeech/ASR/zipformer/model.py", line 142, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "/home/mousmita/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/dsk4/mousmita/icefall_24spt/icefall/egs/librispeech/ASR/zipformer/subsampling.py", line 330, in forward
assert x.size(1) == x_lens.max().item(), (x.size(1), x_lens.max())
AssertionError: (750, tensor(1, device='cuda:0', dtype=torch.int32))

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

After enabling
export CUDA_LAUNCH_BLOCKING=1
export K2_SYNC_KERNELS=1
the log looks like as follows ----
2024-10-02 12:26:37,597 INFO [train.py:1155] Maximum memory allocated so far is 5429MB
2024-10-02 12:26:38,978 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=0.0, ans=0.5
2024-10-02 12:26:43,654 INFO [scaling.py:1024] Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=37.57 vs. limit=7.504375
2024-10-02 12:26:43,794 INFO [scaling.py:1024] Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.36 vs. limit=7.504375
2024-10-02 12:26:45,420 WARNING [optim.py:487] Clipping_scale=2.0, grad-norm quartiles 1.967e+03 2.279e+03 2.394e+03 2.565e+03 2.653e+03, threshold=9.575e+03, percent-clipped=0.0
2024-10-02 12:26:46,495 INFO [scaling.py:1024] Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.69 vs. limit=5.002916666666667
2024-10-02 12:26:46,523 INFO [scaling.py:1024] Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=7.67 vs. limit=4.0023333333333335
2024-10-02 12:26:50,261 INFO [scaling.py:1024] Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=39.57 vs. limit=7.504375
2024-10-02 12:26:50,426 INFO [scaling.py:1024] Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.77 vs. limit=5.002916666666667
2024-10-02 12:26:52,609 INFO [scaling.py:1024] Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.04 vs. limit=5.005833333333333
2024-10-02 12:26:52,746 WARNING [optim.py:487] Clipping_scale=2.0, grad-norm quartiles 1.008e+03 1.618e+03 1.967e+03 2.559e+03 2.712e+03, threshold=7.868e+03, percent-clipped=0.0
2024-10-02 12:26:55,027 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=11.666666666666666, ans=0.499453125
2024-10-02 12:26:55,093 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.const_attention_rate, batch_count=11.666666666666666, ans=0.24934375
2024-10-02 12:26:58,164 INFO [scaling.py:1024] Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=135.15 vs. limit=7.5065625
2024-10-02 12:26:58,556 INFO [train.py:1060] Caught exception: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
.
2024-10-02 12:26:58,556 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1041, in train_one_epoch
loss, loss_info = compute_loss(
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 879, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/model.py", line 148, in forward_encoder
encoder_out, encoder_out_lens = self.encoder(x, x_lens, src_key_padding_mask)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 338, in forward
x = module(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 1267, in forward
src = self.encoder(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 1078, in forward
output = mod(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 779, in forward
attn_weights = self.self_attn_weights(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 1652, in forward
attn_scores = torch.matmul(q, k)
RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1644, in main
run(rank=0, world_size=1, args=args)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1520, in run
train_one_epoch(
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1061, in train_one_epoch
save_bad_model()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1018, in save_bad_model
save_checkpoint_impl(
File "/home/ngoel/icefall/icefall/checkpoint.py", line 84, in save_checkpoint
"grad_scaler": scaler.state_dict() if scaler is not None else None,
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 618, in state_dict
"scale": self.get_scale(),
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 546, in get_scale
else cast(float, scale.item())
RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

after adding .contiguous() before the matmul to q and k above, that error did not happen in the last 2 runs. Now the error that happens every time is this.....
r 1701700.22 frames. ], batch size: 37, lr: 1.76e-02, grad_scale: 32.0
2024-10-02 13:48:27,900 INFO [scaling.py:214] ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2362.5, ans=0.178146875
2024-10-02 13:48:31,956 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2368.3333333333335, ans=0.8171083333333333
2024-10-02 13:48:32,632 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2368.3333333333335, ans=0.1111875
2024-10-02 13:48:37,586 INFO [scaling.py:1024] Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=9.50 vs. limit=9.280625
2024-10-02 13:48:44,937 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2380.0, ans=6.1899999999999995
2024-10-02 13:48:58,201 INFO [scaling.py:214] ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2380.0, ans=0.2762
2024-10-02 13:48:58,334 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2380.0, ans=0.2762
2024-10-02 13:48:58,435 INFO [train.py:1060] Caught exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
.
2024-10-02 13:48:58,435 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch
scaler.scale(loss).backward()
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1644, in main
run(rank=0, world_size=1, args=args)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1520, in run
train_one_epoch(
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1061, in train_one_epoch
save_bad_model()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1018, in save_bad_model
save_checkpoint_impl(
File "/home/ngoel/icefall/icefall/checkpoint.py", line 84, in save_checkpoint
"grad_scaler": scaler.state_dict() if scaler is not None else None,
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 618, in state_dict
"scale": self.get_scale(),
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 546, in get_scale
else cast(float, scale.item())
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

I have reduced the pytorch version to 2.4. This time i got the error as follows ....
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 2345, in forward
x = x * s
RuntimeError: CUDA error: misaligned address
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I am putting .contiguous() around x and s. It did more batches than before.....

@sangeet2020
Copy link

assert x.size(1) == x_lens.max().item(), (x.size(1), x_lens.max())

I think the reason for this error is that there is a sample that has a transcript, but has got no feature. But to confirm, could you try this: #1733 (comment)
You can just use pdb.

@sangeet2020
Copy link

I have reduced the pytorch version to 2.4. This time i got the error as follows .... File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/zipformer/zipformer.py", line 2345, in forward x = x * s RuntimeError: CUDA error: misaligned address Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I am putting .contiguous() around x and s. It did more batches than before.....

Could you please write the cuda versions.. I will try to replicate the recipe. and see if the error pops up.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

I haven't changed the CUDA version. Only Pytorch. Same as in the starting message. I think minor versions don't change the API much. 12.6. I had CUDA 11 earlier, with similar errors. I don't think zipformer recipe is so new. I am thinking something wrong in my setup / hardware but not sure. I have cloud compiled pytorch earlier. later I compiled myself , matching the CUDA and GPU architecture, to avoid any conflicts. latest error was around backward pass where it was complaining about misaligned address of loss(). As if CUDA is no longer ensuring that the outputs remain aligned just like inputs.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 2, 2024

I am now mostly getting this error (last 3 times)

2024-10-02 17:24:50,762 INFO [scaling.py:1024] Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.01 vs. limit=7.75875
2024-10-02 17:24:53,118 INFO [scaling.py:1043] Caught exception in Whiten backward: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, size=[166, 67, 384], will continue.
2024-10-02 17:24:53,119 INFO [scaling.py:801] Caught exception in Balancer backward: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
, size=[166, 67, 384], will continue.
2024-10-02 17:24:53,119 INFO [train.py:1060] Caught exception: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
.
2024-10-02 17:24:53,120 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch
scaler.scale(loss.contiguous()).backward()
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
torch.autograd.backward(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/autograd/init.py", line 289, in backward
_engine_run_backward(
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@sangeet2020
Copy link

sangeet2020 commented Oct 3, 2024

Hi @ngoel17 ,

So, I replicated the librispeech zipformer recipe on my own small data, and it ran fine. Here's is how I installed k2 and torch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install k2==1.24.4.dev20240905+cuda12.4.torch2.4.1 -f https://k2-fsa.github.io/k2/cuda.html 

Writing the env details for better readability

{
  "env_info": {
    "k2-version": "1.24.4",
    "k2-build-type": "Release",
    "k2-with-cuda": true,
    "k2-git-sha1": "cf664841c6d93e21e59b40aade84869b76c919c1",
    "k2-git-date": "Thu Sep 5 19:25:17 2024",
    "lhotse-version": "1.28.0.dev+git.c8ba6d01.clean",
    "torch-version": "2.4.1+cu124",
    "torch-cuda-available": true,
    "torch-cuda-version": "12.4",
    "python-version": "3.10",
    "icefall-git-branch": "master",
    "icefall-git-sha1": "5c04c312-dirty",
    "icefall-git-date": "Fri Sep 20 06:38:52 2024",
    "icefall-path": "/mnt/local/sangeet/workncode/k2-fsa/icefall",
    "k2-path": "/tmp/test/test/lib/python3.10/site-packages/k2/__init__.py",
    "lhotse-path": "/tmp/test/lhotse/lhotse/__init__.py",
    "hostname": "emlgpu04",
    "IP address": "127.0.1.1"
  }
}

Here is the expt logs:

python zipformer/train.py --world-size 2 --num-epochs 1 --start-epoch 1 --use-fp16 1 --exp-dir zipformer/exp --full-libri 1 --max-duration 200 --manifest-dir data/En_CV/all_data/fbank/ --bpe-model data/En_CV/all_data/lang_bpe_500/bpe.model --master-port 12344
[W1003 01:59:54.225740073 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1003 01:59:54.230960963 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
2024-10-03 01:59:54,816 INFO [train.py:1194] (0/2) Training started
2024-10-03 01:59:54,817 INFO [train.py:1204] (0/2) Device: cuda:0
2024-10-03 01:59:54,820 INFO [train.py:1235] (0/2) Using dtype=torch.float16
2024-10-03 01:59:54,820 INFO [train.py:1236] (0/2) Use AMP=True
2024-10-03 01:59:54,820 INFO [train.py:1238] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'cf664841c6d93e21e59b40aade84869b76c919c1', 'k2-git-date': 'Thu Sep 5 19:25:17 2024', 'lhotse-version': '1.28.0.dev+git.c8ba6d01.clean', 'torch-version': '2.4.1+cu124', 'torch-cuda-available': True, 'torch-cuda-version': '12.4', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-dirty', 'icefall-git-date': 'Fri Sep 20 06:38:52 2024', 'icefall-path': '/mnt/local/sangeet/workncode/k2-fsa/icefall', 'k2-path': '/tmp/test/test/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/tmp/test/lhotse/lhotse/__init__.py', 'hostname': 'emlgpu04', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12344, 'tensorboard': True, 'num_epochs': 1, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/En_CV/all_data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/En_CV/all_data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-10-03 01:59:54,820 INFO [train.py:1240] (0/2) About to create model
2024-10-03 01:59:54,914 INFO [train.py:1194] (1/2) Training started
2024-10-03 01:59:54,915 INFO [train.py:1204] (1/2) Device: cuda:1
2024-10-03 01:59:54,916 INFO [train.py:1235] (1/2) Using dtype=torch.float16
2024-10-03 01:59:54,917 INFO [train.py:1236] (1/2) Use AMP=True
2024-10-03 01:59:54,917 INFO [train.py:1238] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'cf664841c6d93e21e59b40aade84869b76c919c1', 'k2-git-date': 'Thu Sep 5 19:25:17 2024', 'lhotse-version': '1.28.0.dev+git.c8ba6d01.clean', 'torch-version': '2.4.1+cu124', 'torch-cuda-available': True, 'torch-cuda-version': '12.4', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-dirty', 'icefall-git-date': 'Fri Sep 20 06:38:52 2024', 'icefall-path': '/mnt/local/sangeet/workncode/k2-fsa/icefall', 'k2-path': '/tmp/test/test/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/tmp/test/lhotse/lhotse/__init__.py', 'hostname': 'emlgpu04', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12344, 'tensorboard': True, 'num_epochs': 1, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/En_CV/all_data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/En_CV/all_data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-10-03 01:59:54,917 INFO [train.py:1240] (1/2) About to create model
2024-10-03 01:59:55,339 INFO [train.py:1244] (0/2) Number of model parameters: 65549011
2024-10-03 01:59:55,391 INFO [train.py:1244] (1/2) Number of model parameters: 65549011
2024-10-03 01:59:55,500 INFO [train.py:1259] (1/2) Using DDP
2024-10-03 01:59:56,316 INFO [train.py:1259] (0/2) Using DDP
2024-10-03 01:59:57,341 INFO [asr_datamodule.py:436] (0/2) About to get the shuffled train-clean-100,             train-clean-360 and train-other-500 cuts
2024-10-03 01:59:57,450 INFO [asr_datamodule.py:232] (0/2) Enable MUSAN
2024-10-03 01:59:57,450 INFO [asr_datamodule.py:233] (0/2) About to get Musan cuts
2024-10-03 01:59:57,468 INFO [asr_datamodule.py:436] (1/2) About to get the shuffled train-clean-100,             train-clean-360 and train-other-500 cuts
2024-10-03 01:59:57,581 INFO [asr_datamodule.py:232] (1/2) Enable MUSAN
2024-10-03 01:59:57,581 INFO [asr_datamodule.py:233] (1/2) About to get Musan cuts
2024-10-03 01:59:59,222 INFO [asr_datamodule.py:257] (0/2) Enable SpecAugment
2024-10-03 01:59:59,222 INFO [asr_datamodule.py:258] (0/2) Time warp factor: 80
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:268] (0/2) Num frame mask: 10
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:281] (0/2) About to create train dataset
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:308] (0/2) Using DynamicBucketingSampler.
2024-10-03 01:59:59,322 INFO [asr_datamodule.py:257] (1/2) Enable SpecAugment
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:258] (1/2) Time warp factor: 80
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:268] (1/2) Num frame mask: 10
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:281] (1/2) About to create train dataset
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:308] (1/2) Using DynamicBucketingSampler.
2024-10-03 02:00:00,146 INFO [asr_datamodule.py:325] (0/2) About to create train dataloader
2024-10-03 02:00:00,147 INFO [asr_datamodule.py:453] (0/2) About to get dev-clean cuts
2024-10-03 02:00:00,148 INFO [asr_datamodule.py:460] (0/2) About to get dev-other cuts
2024-10-03 02:00:00,148 INFO [asr_datamodule.py:356] (0/2) About to create dev dataset
2024-10-03 02:00:00,281 INFO [asr_datamodule.py:325] (1/2) About to create train dataloader
2024-10-03 02:00:00,281 INFO [asr_datamodule.py:453] (1/2) About to get dev-clean cuts
2024-10-03 02:00:00,282 INFO [asr_datamodule.py:460] (1/2) About to get dev-other cuts
2024-10-03 02:00:00,283 INFO [asr_datamodule.py:356] (1/2) About to create dev dataset
2024-10-03 02:00:00,707 INFO [asr_datamodule.py:373] (0/2) About to create dev dataloader
2024-10-03 02:00:00,707 INFO [train.py:1463] (0/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2024-10-03 02:00:00,844 INFO [asr_datamodule.py:373] (1/2) About to create dev dataloader
2024-10-03 02:00:00,845 INFO [train.py:1463] (1/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2024-10-03 02:00:07,072 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=1, num_channels=192, metric=45.78 vs. limit=7.5
2024-10-03 02:00:07,224 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:07,226 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:07,557 INFO [scaling.py:1025] (0/2) Whitening: name=None, num_groups=1, num_channels=256, metric=87.59 vs. limit=4.0
2024-10-03 02:00:08,065 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:08,065 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:08,950 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:08,950 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:09,709 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=4, num_channels=128, metric=9.13 vs. limit=3.0
2024-10-03 02:00:09,828 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:09,829 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:10,738 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:10,738 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:11,141 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=1, num_channels=256, metric=38.92 vs. limit=7.5
2024-10-03 02:00:11,362 INFO [scaling.py:1025] (0/2) Whitening: name=None, num_groups=4, num_channels=128, metric=9.64 vs. limit=3.0
2024-10-03 02:00:11,587 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:11,588 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
/tmp/test/icefall/egs/librispeech/ASR/zipformer/train.py:1370: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler(enabled=params.use_autocast, init_scale=1.0)
/tmp/test/icefall/egs/librispeech/ASR/zipformer/train.py:1370: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = GradScaler(enabled=params.use_autocast, init_scale=1.0)
2024-10-03 02:00:22,446 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=7.5
2024-10-03 02:00:22,798 INFO [train.py:1126] (0/2) Epoch 1, batch 0, loss[loss=7.436, simple_loss=6.775, pruned_loss=6.596, over 4731.00 frames. ], tot_loss[loss=7.436, simple_loss=6.775, pruned_loss=6.596, over 4731.00 frames. ], batch size: 23, lr: 2.25e-02, grad_scale: 2.0
2024-10-03 02:00:22,799 INFO [train.py:1149] (0/2) Computing validation loss
2024-10-03 02:00:22,802 INFO [train.py:1126] (1/2) Epoch 1, batch 0, loss[loss=7.43, simple_loss=6.767, pruned_loss=6.622, over 4733.00 frames. ], tot_loss[loss=7.43, simple_loss=6.767, pruned_loss=6.622, over 4733.00 frames. ], batch size: 23, lr: 2.25e-02, grad_scale: 2.0
2024-10-03 02:00:22,803 INFO [train.py:1149] (1/2) Computing validation loss
2024-10-03 02:00:39,860 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3075, 5.4333, 5.3807, 5.4160], device='cuda:0')
2024-10-03 02:00:40,104 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([3.3353, 3.4979, 3.4713, 3.5922, 3.4526, 3.5167, 3.4543, 3.5035],
       device='cuda:1')
2024-10-03 02:00:43,322 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.4515, 3.5168, 3.4991, 3.5555, 3.4853, 3.5217, 3.4923, 3.5199],
       device='cuda:1')
2024-10-03 02:00:43,428 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2810, 5.4639, 5.4801, 5.3358], device='cuda:0')
2024-10-03 02:00:48,377 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.5395, 3.6418, 3.6039, 3.6943, 3.5954, 3.6408, 3.6137, 3.6393],
       device='cuda:1')
2024-10-03 02:00:48,502 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0970, 5.1419, 5.1674, 5.2175], device='cuda:0')
2024-10-03 02:00:53,323 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.4807, 5.6428, 5.6575, 5.5311], device='cuda:1')
2024-10-03 02:00:53,392 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([3.3341, 3.5283, 3.4634, 3.6020, 3.4913, 3.5387, 3.4652, 3.5156],
       device='cuda:0')
2024-10-03 02:00:55,341 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.4074, 3.5270, 3.4807, 3.6096, 3.4510, 3.5521, 3.4889, 3.5356],
       device='cuda:1')
2024-10-03 02:00:55,601 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.9412, 4.9748, 5.0364, 5.0812], device='cuda:0')
2024-10-03 02:00:56,504 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.3655, 4.2595, 4.2836, 4.3864], device='cuda:1')
2024-10-03 02:00:56,776 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([4.2437, 4.3846, 4.2819, 4.2082], device='cuda:0')
2024-10-03 02:01:02,899 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2188, 5.4193, 5.4720, 5.2883], device='cuda:0')
2024-10-03 02:01:03,419 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.8552, 3.9198, 3.8987, 3.9552, 3.8833, 3.9226, 3.9035, 3.9127],
       device='cuda:1')
2024-10-03 02:01:19,152 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.9902, 5.2923, 5.3694, 5.1818], device='cuda:0')
2024-10-03 02:01:20,277 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([4.0022, 4.1509, 4.0052, 3.9495], device='cuda:1')
2024-10-03 02:01:29,367 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.7156, 3.8265, 3.8094, 3.9157, 3.7977, 3.8325, 3.8000, 3.8370],
       device='cuda:0')
2024-10-03 02:01:29,570 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.4350, 4.7387, 4.5909, 4.2373], device='cuda:1')
2024-10-03 02:01:41,441 INFO [train.py:1158] (1/2) Epoch 1, validation: loss=7.282, simple_loss=6.63, pruned_loss=6.509, over 4897180.00 frames. 
2024-10-03 02:01:41,442 INFO [train.py:1159] (1/2) Maximum memory allocated so far is 14153MB
2024-10-03 02:01:41,444 INFO [train.py:1158] (0/2) Epoch 1, validation: loss=7.282, simple_loss=6.63, pruned_loss=6.509, over 4897180.00 frames. 
2024-10-03 02:01:41,445 INFO [train.py:1159] (0/2) Maximum memory allocated so far is 14152MB
2024-10-03 02:01:45,218 INFO [scaling.py:215] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=0.0, ans=0.2
2024-10-03 02:01:45,222 INFO [scaling.py:215] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2
2024-10-03 02:01:45,440 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=34.05 vs. limit=4.0
2024-10-03 02:01:45,499 INFO [scaling.py:1025] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.76 vs. limit=7.5
2024-10-03 02:01:46,157 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=32.18 vs. limit=5.0
2024-10-03 02:01:46,662 INFO [scaling.py:215] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=0.0, ans=0.5
2024-10-03 02:01:46,792 INFO [scaling.py:1025] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=4.0
2024-10-03 02:01:47,392 INFO [scaling.py:215] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=0.0, ans=0.5

I am thinking something wrong in my setup / hardware but not sure.

i second to you on this one. Maybe you could try installing torch and k2 using the setup that worked for me.

thanks

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 3, 2024

Sure thing. I'll try this as well. Currently, the training hasn't crashed and progressed to epoch 2. I believe it's due to all the ideas proposed by you/Dan, but fingers crossed. Will need to go back and hash out what helped if we determine that everything is resolved.

@danpovey
Copy link
Collaborator

danpovey commented Oct 3, 2024

Nagendra, I suspect the kernel syncing was not working i.e. the export CUDA_LAUNCH_BLOCKING=1
had no effect. (If it did work, it should affect the time taken a bit, i.e. slow things down). I think the error had nothing to do with the matmul of q and k
and was actually an error from a previous kernel launch that showed up later as it wasn't synchronized; you could
revert that assertion I think.

Also, the error that you got in validation when you used 1 GPU: I think this was a totally different error where in the validation code you were passing in the wrong thing. Probably you never reached batch 6000 with 4 GPUs so it never hit the validation code on batch 1.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 3, 2024

@danpovey export CUDA_LAUNCH_BLOCKING=1 at least made that message about the option to use CUDA_LAUNCH_BLOCKING in the error messages go away. So the only message left was about TORCH_USE_CUDA_DSA. I tried this and also tried exporting USE_GPU=1 and export TORCH_USE_CUDA_DSA=1, before compiling pytorch, but that would not affect the TORCH_USE_CUDA_DSA message. interestingly editing the ./torch/include/c10/cuda/CUDADeviceAssertionHost.h to explicitly #define lead to "previously defined" .. warning and I could not figure out where its previously defined. This header has pragma once.
I am not sure what export K2_SYNC_KERNELS=1 did. I think the execution is slower than normal. A bunch of data did not get loaded so what I called an epoch was actually 1/10 of an epoch and the training did crash soon after my last message.

@sangeet2020 I did move to CUDA 12.4 installed and then pip install of the pre-compiled version. Unfortunately, that has not helped either. The latest error message isn't even about the memory alignment....

2024-10-03 14:10:47,940 INFO [train.py:1060] Caught exception: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmStridedBatchedEx( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP).
2024-10-03 14:10:47,941 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch
scaler.scale(loss.contiguous()).backward()
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
torch.autograd.backward(
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 289, in backward
_engine_run_backward(
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmStridedBatchedEx( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

Interestingly gpu_burn runs for hours without any problem, and the CPU hasn't thrown any errors. In coming days, would try to use the system for a NLP training task, just to rule out general hardware issues.

@danpovey
Copy link
Collaborator

danpovey commented Oct 4, 2024

You said:
"export CUDA_LAUNCH_BLOCKING=1 at least made that message about the option to use CUDA_LAUNCH_BLOCKING in the error messages go away. "
I didn't see anything like that message in what you pasted before.
In general these kernels can die due to errors from previous kernels, although in theory export CUDA_LAUNCH_BLOCKING=1 should prevent this, at least for kernels launched from pytorch. export K2_SYNC_KERNELS=1 should do the same for k2, ie force it to sync kernels each time.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 4, 2024

@danpovey - I may have exported that before opening the ticket because it was an easier thing to do. There is some progress probably. Past 4 times, I have had error messages only from one "primary backtrace" --- File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch
scaler.scale(loss).backward()

The message this time is slightly different but then I had edited the compute_loss function to add .contiguous() to loss (not metircs info). Here is the complete log.

_2024-10-03 22:19:39,729 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36600.0, ans=0.125
2024-10-03 22:21:06,619 WARNING [optim.py:487] Clipping_scale=2.0, grad-norm quartiles 1.322e+02 1.984e+02 2.248e+02 2.536e+02 6.445e+02, threshold=4.496e+02, percent-clipped=3.0
2024-10-03 22:21:20,148 INFO [train.py:1122] Epoch 2, batch 450, loss[loss=0.07262, simple_loss=0.1017, pruned_loss=0.02175, over 22242.00 frames. ], tot_loss[loss=0.1312, simple_loss=0.1391, pruned_loss=0.06165, over 3983925.31 frames. ], batch size: 121, lr: 8.78e-03, grad_scale: 32.0
2024-10-03 22:21:23,227 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=36675.0, ans=0.0
[F] /var/www/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<std::vectorat::Tensor ()(torch::autograd::AutogradContext, std::vectorat::Tensor), k2::SwooshFunctionk2::SwooshRConstants::backward, void, 1>, const float*, int, const unsigned char*, float, float, float*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.

[ Stack-Trace: ]
/home/ngoel/.local/lib/python3.10/site-packages/k2/lib64/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x7f825602e9b4]
/home/ngoel/.local/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x4ae9a) [0x7f825c36fe9a]
/home/ngoel/.local/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x17db1a) [0x7f825c4a2b1a]
/home/ngoel/.local/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x1879df) [0x7f825c4ac9df]
/home/ngoel/.local/lib/python3.10/site-packages/_k2.cpython-310-x86_64-linux-gnu.so(+0x18a437) [0x7f825c4af437]
/home/ngoel/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so(+0x52e992b) [0x7f834f6df92b]
/home/ngoel/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&)+0x14e6) [0x7f834f6d99e6]
/home/ngoel/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&)+0x698) [0x7f834f6da658]
/home/ngoel/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so(torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool)+0x13f) [0x7f834f6d15df]
/home/ngoel/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so(torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool)+0x5c) [0x7f8362e46acc]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8363ab0253]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f83652c0ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8365352850]

2024-10-03 22:21:23,924 INFO [train.py:1060] Caught exception:
Some bad things happened. Please read the above error messages and stack
trace. If you are using Python, the following command may be helpful:

  gdb --args python /path/to/your/code.py

(You can use `gdb` to debug the code. Please consider compiling
a debug version of k2.).

If you are unable to fix it, please open an issue at:

  https://github.com/k2-fsa/k2/issues/new
.

2024-10-03 22:21:23,925 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch
scaler.scale(loss).backward()
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
torch.autograd.backward(
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 290, in backward
_engine_run_backward(
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError:
Some bad things happened. Please read the above error messages and stack
trace. If you are using Python, the following command may be helpful:

  gdb --args python /path/to/your/code.py

(You can use `gdb` to debug the code. Please consider compiling
a debug version of k2.).

If you are unable to fix it, please open an issue at:

  https://github.com/k2-fsa/k2/issues/new

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1644, in main
run(rank=0, world_size=1, args=args)
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1520, in run
train_one_epoch(
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1061, in train_one_epoch
save_bad_model()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1018, in save_bad_model
save_checkpoint_impl(
File "/home/ngoel/icefall/icefall/checkpoint.py", line 84, in save_checkpoint
"grad_scaler": scaler.state_dict() if scaler is not None else None,
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 615, in state_dict
"scale": self.get_scale(),
File "/home/ngoel/.local/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 543, in get_scale
else cast(float, scale.item())
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
_

@danpovey
Copy link
Collaborator

danpovey commented Oct 4, 2024

Can you try adding in k2's swoosh.cu,
after
out_grad = out_grad.contiguous();
add:
out_grad = out_grad.to(torch::kFloat32);
I suspect the grad is actually float16. We get the ptr as if it's float32.
Would be great if you could submit a patch for k2 if this resolves it. The odd thing is why this error doesn't happen immediately and also why doesn't it just completely ruin the training.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 8, 2024

It turns out to be bad hardware. Hopefully, this thread will remind others that not all problems are software problems. The error was not frequent, and probably i was in too much of a rush to draw conclusions.

@ngoel17 ngoel17 closed this as completed Oct 8, 2024
@danpovey
Copy link
Collaborator

danpovey commented Oct 8, 2024

Oh. How did you determine that it was bad hardware?

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 8, 2024 via email

@danpovey
Copy link
Collaborator

danpovey commented Oct 9, 2024

OK, but sometimes on different GPU types it might run a different kernel for some reason.
In the case where the gradient was in fp16, I think there might actually be a bug in that code that it would treat it as fp32; and you would get misaligned address errors, potentially, at least in principle. I just don't know whether that would actually happen in practice. It would be good if you could apply that change, recompile k2, and test.

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 9, 2024 via email

@ngoel17
Copy link
Contributor Author

ngoel17 commented Oct 10, 2024

According to this pytorch Documentation autocast will be f32 because swoosh is not in the list. That should be very deterministic, so nothing to worry about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants