-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LF-MMI GPU OOM #196
Comments
What's your training command? What's the value of --max-duration? |
It would be helpful to see the traceback from when it dies. |
This is the error log.(When the number of phones is 220, it can run normally) Killing subprocess 3803024 |
Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned()
[I forget the python-level wrapper, probably
intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the
issue. Early in training you can get too many arcs
active, and if you are using the "normal" topology (not modified topology),
the LF-MMI denominator graph size is
quadratic in the number of symbols.
…On Sun, Jan 30, 2022 at 1:51 PM abner ***@***.***> wrote:
This is the error log.(When the number of phones is 220, it can run
normally)
`2022-01-30 05:34:59,582 INFO Loading L.fst
INFO from MMI module:
device: cuda
use pruned_intersect: True
use segment info: True
self.lo Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=256, out_features=1253, bias=True)
)
number of phones 1252
2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08
2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4
timeslarger than before
2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att
77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0
2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274
loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB
total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB
reserved in total by PyTorch)
Exception raised from malloc at
/opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288
(most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f71382b72f2 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1 <#1>: + 0x1bc21
(0x7f7138516c21 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2 <#2>: + 0x1c944
(0x7f7138517944 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3 <#3>: + 0x1cf63
(0x7f7138517f63 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4 <#4>:
c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #5 <#5>:
k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f
(0x7f709044b65f in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #6 <#6>:
k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175
(0x7f709016b015 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #7 <#7>:
k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #8 <#8>:
k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #9 <#9>:
k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907
(0x7f70902c7547 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #10 <#10>:
std::_Function_handler<void (),
k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1
<#1>}>::_M_invoke(std::_Any_data
const&) + 0x26e (0x7f70902ca58e in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #11 <#11>:
k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #12 <#12>: + 0xc9039
(0x7f719dc49039 in
/opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #13 <#13>: + 0x76db
(0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14 <#14>: clone + 0x3f
(0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6)
Killing subprocess 3803024
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py",
line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py",
line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py",
line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
`
—
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy? |
It's better to set max_active_arcs. It may only be present in newer
versions of k2.
max_active_states is a bit less precise because some states can have many
arcs leaving them.
…On Mon, Jan 31, 2022 at 2:36 PM abner ***@***.***> wrote:
Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned()
[I forget the python-level wrapper, probably intersect_dense_pruned()].
Setting that to, e.g. 1000, may resolve the issue. Early in training you
can get too many arcs active, and if you are using the "normal" topology
(not modified topology), the LF-MMI denominator graph size is quadratic in
the number of symbols.
… <#m_-5600041957703070315_>
On Sun, Jan 30, 2022 at 1:51 PM abner *@*.*> wrote: This is the error
log.(When the number of phones is 220, it can run normally) 2022-01-30
05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use
pruned_intersect: True use segment info: True self.lo Sequential( (0):
Dropout(p=0.1, inplace=False) (1): Linear(in_features=256,
out_features=1253, bias=True) ) number of phones 1252 2022-01-30
05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO
using accumulate grad, new batch size is 4 timeslarger than before
2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att
77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933
DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi
123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance
of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to
allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already
allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception
raised from malloc at
/opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288
(most recent call first): frame #0: c10::Error::Error(c10::SourceLocation,
std::string) + 0x42 (0x7f71382b72f2 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>:
+ 0x1bc21 (0x7f7138516c21 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2
<#2>: + 0x1c944 (0x7f7138517944 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3
<#3>: + 0x1cf63 (0x7f7138517f63 in
/opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4
<#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) +
0x5f (0x7f709044b65f in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) +
0x175 (0x7f709016b015 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int)
+ 0x907 (0x7f70902c7547 in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #10 <#10>: std::_Function_handler<void (),
k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1
<#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in
/opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in
/opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #13 <#13>: + 0x76db (0x7f71c00216db in
/lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f
(0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess
3803024 Traceback (most recent call last): File
"/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File
"/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code,
run_globals) File
"/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line
340, in main() File
"/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line
326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File
"/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line
301, in sigkill_handler raise
subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply
to this email directly, view it on GitHub <#196 (comment)
<#196 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ
<https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ>
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID: @.*>
Thanks, Is it max_active_states? Will lowering this parameter lead to poor
training accuracy?
—
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3PDKI7GDEU5KCTCRLUYYUW7ANCNFSM5NC2HWKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
There is a GPU OOM problem when I use lf-mmi for training, my token size about 1300 , I want to know how to avoid this problem.
The text was updated successfully, but these errors were encountered: