Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meeting error when using unet on custom dataset #3836

Open
liugd18 opened this issue Jan 17, 2025 · 1 comment
Open

meeting error when using unet on custom dataset #3836

liugd18 opened this issue Jan 17, 2025 · 1 comment

Comments

@liugd18
Copy link

liugd18 commented Jan 17, 2025

Traceback (most recent call last):
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 104, in
main()
File "/root/autodl-tmp/mmsegmentation-main/tools/train.py", line 100, in main
runner.train()
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 287, in run
self.run_iter(data_batch)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/runner/loops.py", line 311, in run_iter
outputs = self.runner.model.train_step(
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 116, in train_step
optim_wrapper.update_params(parsed_losses)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 196, in update_params
self.backward(loss)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/mmengine/optim/optimizer/optimizer_wrapper.py", line 220, in backward
loss.backward(**kwargs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/swin_trans/lib/python3.10/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 16, 512, 512], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(16, 2, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_FLOAT
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x117f345c0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 16, 512, 512,
strideA = 4194304, 262144, 512, 1,
output: TensorDescriptor 0x7efe215f8960
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 2, 512, 512,
strideA = 524288, 262144, 512, 1,
weight: FilterDescriptor 0x117f38ce0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 2, 16, 1, 1,
Pointer addresses:
input: 0x7efea6400000
output: 0x7efea6000000
weight: 0x7f00277b0c00

don't know how to deal with this
i've tryed using different gpus, but no use
when using vit, swin, and deeplabv3+, it works smoothly, but meeting error while training unet

@liugd18
Copy link
Author

liugd18 commented Jan 17, 2025

additionally, when i expanded my batch size, the error disappered
but i have to occupy more gpus to meet the memory requirement caused by bigger batch size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant