Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

Open
taomrzhang opened this issue Dec 4, 2018 · 10 comments
Open

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39

taomrzhang opened this issue Dec 4, 2018 · 10 comments

Comments

@taomrzhang
Copy link

Hello, I want to train my datasets. However, when I try to run the code, the error occurs as follows:
Namespace(backbone='resnet', base_size=513, batch_size=8, checkname='deeplab-resnet', crop_size=513, cuda=True, dataset='pascal', epochs=50, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=[0], loss_type='ce', lr=0.007, lr_scheduler='poly', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resume=None, seed=1, start_epoch=0, sync_bn=False, test_batch_size=8, use_balanced_weights=False, use_sbd=False, weight_decay=0.0005, workers=4) Number of images in train: 3184 Number of images in val: 797 Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/398 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Train loss: 0.288: 1%|▏ | 3/398 [00:03<07:59, 1.21s/it]
Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [457,0,0] Assertion t >= 0 && t < n_classesfailed. /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [458,0,0] Assertiont >= 0 && t < n_classesfailed.

@jfzhang95
Copy link
Owner

It seems like the error is in your label. Maybe you should check your label, or you could provide more evidence about how this error comes up.

@taomrzhang
Copy link
Author

Thanks! But I have altered the number of label, but the error is same.

Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

@jfzhang95
Copy link
Owner

jfzhang95 commented Dec 4, 2018

I did not encounter such problem before. Could you successfully run my default training code in VOC dataset?

@719637146
Copy link

I have a same problem when I run the default training code in VOC dataset. Have you solved it?

@krishnadusad
Copy link

krishnadusad commented Feb 24, 2019

Have the same issue.
Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/4179 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train.py", line 301, in <module> main() File "train.py", line 294, in main trainer.training(epoch) File "train.py", line 106, in training loss.backward() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Any suggestions?

@Pyten
Copy link

Pyten commented Jun 19, 2019

maybe try smaller batch size if your GPU memory is not enough.

@ghost
Copy link

ghost commented Dec 3, 2019

recently I meet the same issue, any suggestions?

@ghost
Copy link

ghost commented Dec 3, 2019

Thanks! But I have altered the number of label, but the error is same.
Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

did you solve this problem, can give some suggestions? thank you.

@coordxyz
Copy link

In my case, I solve the same issue by fixing the error labels of my own dataset.

@parthkvv
Copy link

parthkvv commented Jun 4, 2022

In my case, I solve the same issue by fixing the error labels of my own dataset.

Can you be more specific as to where did you made those changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants