[Bug]: EfficientAd - CUDA out of memory. #2531

leemorton · 2025-01-21T15:14:33Z

Describe the bug

Training an EfficientAd(small) model with all other parameters at default produces CUDA out of memory issue.

Dataset

N/A

Model

N/A

Steps to reproduce the behavior

41 Good Training Images
30 Good Validation Images and 170 Bad Validation Images

EfficientAd(small) model with all other parameters at default

At around epoch 30 of 300 (no callbacks) at 8 minutes or so using an RTX A5000 (24GB) I hit this issue. It feels like the hardware should be sufficient. I also watch the memory usage via nvidia-smi CLI and it swings back and forth but still climbs throughout training.

Adopting the suggested environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True also did not help

OS information

OS information:

OS: Ubuntu 24.04.1 LTS
Python version: 3.10.10
Anomalib version: 2.0.0b2
PyTorch version: 2.5.0
CUDA/cuDNN version: 12.4
GPU models and configuration: RTX A5000 (24GB)
Using a custom dataset

Expected behavior

Training to complete

Screenshots

Pip/GitHub

pip

What version/branch did you use?

2.0.0b2

Configuration YAML

Logs

N/A

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

suahelen · 2025-01-30T00:20:54Z

I am running into the exact same issue while using Fastflow.

The jumps in allocated memory happen at the end of every validation epoch. I think I was able to track it down to the BinaryPrecisionRecallCurve(Metric)
Something in the compute is leaking memory. I have been tinkering around a bit but was unable to make it go away until now. Maybe someone else has a good idea?

suahelen · 2025-01-30T08:18:29Z

I think I found it:
Since the preds and targets are always appended in the update call, self.preds and self.targets continuously grow.

Adjusting the compute call like that solved it for me.
After the concate the lists can be cleared.

    def compute(self) -> tuple[Tensor, Tensor, Tensor]:
        """Compute metric."""
        if self.thresholds is None:
            if not self.preds or not self.target:
                return torch.tensor([]), torch.tensor([]), torch.tensor([])
            state = (torch.cat(self.preds), torch.cat(self.target))

            self.preds.clear()
            self.target.clear()
        else:
            state = self.confmat
            self.confmat.zero_()

        precision, recall, thresholds = _binary_precision_recall_curve_compute(state, self.thresholds)
        return precision, recall, thresholds if thresholds is not None else torch.tensor([])

suahelen · 2025-01-30T08:27:51Z

I just saw that this is a class from torchmetrics and not anomalib but i'll create an issue there and link it to this one here.

suahelen · 2025-01-30T08:46:04Z

FYI Lightning-AI/torchmetrics#2921

samet-akcay · 2025-01-30T14:29:58Z

Thanks for sharing @suahelen

suahelen mentioned this issue Jan 30, 2025

Memory Leak in BinarPrecisionRecallCurve Lightning-AI/torchmetrics#2921

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: EfficientAd - CUDA out of memory. #2531

[Bug]: EfficientAd - CUDA out of memory. #2531

leemorton commented Jan 21, 2025 •

edited

Loading

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

samet-akcay commented Jan 30, 2025

[Bug]: EfficientAd - CUDA out of memory. #2531

[Bug]: EfficientAd - CUDA out of memory. #2531

Comments

leemorton commented Jan 21, 2025 • edited Loading

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

suahelen commented Jan 30, 2025

samet-akcay commented Jan 30, 2025

leemorton commented Jan 21, 2025 •

edited

Loading