You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 104, in
main()
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 100, in main
runner.train()
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run
self.run_iter(data_batch)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter
outputs = self.runner.model.train_step(
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step
optim_wrapper.update_params(parsed_losses)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
self.backward(loss)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward
loss.backward(**kwargs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
don't know how to deal with this
i've tryed using different gpus, but no use
when using vit, swin, and deeplabv3+, it works smoothly, but meeting error while training unet
The text was updated successfully, but these errors were encountered:
additionally, when i expanded my batch size, the error disappered
but i have to occupy more gpus to meet the memory requirement caused by bigger batch size
Traceback (most recent call last):
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 104, in
main()
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 100, in main
runner.train()
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run
self.run_iter(data_batch)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter
outputs = self.runner.model.train_step(
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step
optim_wrapper.update_params(parsed_losses)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
self.backward(loss)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward
loss.backward(**kwargs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 16, 512, 512], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(16, 2, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_FLOAT
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x117f345c0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 16, 512, 512,
strideA = 4194304, 262144, 512, 1,
output: TensorDescriptor 0x7efe215f8960
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 2, 512, 512,
strideA = 524288, 262144, 512, 1,
weight: FilterDescriptor 0x117f38ce0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 2, 16, 1, 1,
Pointer addresses:
input: 0x7efea6400000
output: 0x7efea6000000
weight: 0x7f00277b0c00
don't know how to deal with this
i've tryed using different gpus, but no use
when using vit, swin, and deeplabv3+, it works smoothly, but meeting error while training unet
The text was updated successfully, but these errors were encountered: