-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CuDNN error: CUDNN_STATUS_EXECUTION_FAILED #39
Comments
It seems like the error is in your label. Maybe you should check your label, or you could provide more evidence about how this error comes up. |
Thanks! But I have altered the number of label, but the error is same.
|
I did not encounter such problem before. Could you successfully run my default training code in VOC dataset? |
I have a same problem when I run the default training code in VOC dataset. Have you solved it? |
Have the same issue. |
maybe try smaller batch size if your GPU memory is not enough. |
recently I meet the same issue, any suggestions? |
did you solve this problem, can give some suggestions? thank you. |
In my case, I solve the same issue by fixing the error labels of my own dataset. |
Can you be more specific as to where did you made those changes? |
Hello, I want to train my datasets. However, when I try to run the code, the error occurs as follows:
Namespace(backbone='resnet', base_size=513, batch_size=8, checkname='deeplab-resnet', crop_size=513, cuda=True, dataset='pascal', epochs=50, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=[0], loss_type='ce', lr=0.007, lr_scheduler='poly', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resume=None, seed=1, start_epoch=0, sync_bn=False, test_batch_size=8, use_balanced_weights=False, use_sbd=False, weight_decay=0.0005, workers=4) Number of images in train: 3184 Number of images in val: 797 Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/398 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Train loss: 0.288: 1%|▏ | 3/398 [00:03<07:59, 1.21s/it]
Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [457,0,0] Assertion
t >= 0 && t < n_classesfailed. /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [458,0,0] Assertion
t >= 0 && t < n_classesfailed.
The text was updated successfully, but these errors were encountered: