CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

taomrzhang · 2018-12-04T04:27:41Z

Hello, I want to train my datasets. However, when I try to run the code, the error occurs as follows:
Namespace(backbone='resnet', base_size=513, batch_size=8, checkname='deeplab-resnet', crop_size=513, cuda=True, dataset='pascal', epochs=50, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=[0], loss_type='ce', lr=0.007, lr_scheduler='poly', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resume=None, seed=1, start_epoch=0, sync_bn=False, test_batch_size=8, use_balanced_weights=False, use_sbd=False, weight_decay=0.0005, workers=4) Number of images in train: 3184 Number of images in val: 797 Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/398 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Train loss: 0.288: 1%|▏ | 3/398 [00:03<07:59, 1.21s/it]
Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [457,0,0] Assertion t >= 0 && t < n_classesfailed. /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [458,0,0] Assertiont >= 0 && t < n_classesfailed.

The text was updated successfully, but these errors were encountered:

jfzhang95 · 2018-12-04T05:07:48Z

It seems like the error is in your label. Maybe you should check your label, or you could provide more evidence about how this error comes up.

taomrzhang · 2018-12-04T05:21:44Z

Thanks! But I have altered the number of label, but the error is same.

Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

jfzhang95 · 2018-12-04T06:01:04Z

I did not encounter such problem before. Could you successfully run my default training code in VOC dataset?

719637146 · 2019-01-16T07:22:36Z

I have a same problem when I run the default training code in VOC dataset. Have you solved it?

krishnadusad · 2019-02-24T02:21:33Z

Have the same issue.
Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/4179 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train.py", line 301, in <module> main() File "train.py", line 294, in main trainer.training(epoch) File "train.py", line 106, in training loss.backward() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Any suggestions?

Pyten · 2019-06-19T11:40:13Z

maybe try smaller batch size if your GPU memory is not enough.

ghost · 2019-12-03T13:36:52Z

recently I meet the same issue, any suggestions?

ghost · 2019-12-03T13:38:20Z

Thanks! But I have altered the number of label, but the error is same.
Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

did you solve this problem, can give some suggestions? thank you.

coordxyz · 2020-05-13T02:55:09Z

In my case, I solve the same issue by fixing the error labels of my own dataset.

parthkvv · 2022-06-04T06:23:34Z

In my case, I solve the same issue by fixing the error labels of my own dataset.

Can you be more specific as to where did you made those changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

taomrzhang commented Dec 4, 2018

jfzhang95 commented Dec 4, 2018

taomrzhang commented Dec 4, 2018

jfzhang95 commented Dec 4, 2018 •

edited

Loading

719637146 commented Jan 16, 2019

krishnadusad commented Feb 24, 2019 •

edited

Loading

Pyten commented Jun 19, 2019

ghost commented Dec 3, 2019

ghost commented Dec 3, 2019

coordxyz commented May 13, 2020

parthkvv commented Jun 4, 2022

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

Comments

taomrzhang commented Dec 4, 2018

jfzhang95 commented Dec 4, 2018

taomrzhang commented Dec 4, 2018

jfzhang95 commented Dec 4, 2018 • edited Loading

719637146 commented Jan 16, 2019

krishnadusad commented Feb 24, 2019 • edited Loading

Pyten commented Jun 19, 2019

ghost commented Dec 3, 2019

ghost commented Dec 3, 2019

coordxyz commented May 13, 2020

parthkvv commented Jun 4, 2022

jfzhang95 commented Dec 4, 2018 •

edited

Loading

krishnadusad commented Feb 24, 2019 •

edited

Loading