Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: EfficientAd - CUDA out of memory. #2531

Open
1 task done
leemorton opened this issue Jan 21, 2025 · 5 comments
Open
1 task done

[Bug]: EfficientAd - CUDA out of memory. #2531

leemorton opened this issue Jan 21, 2025 · 5 comments

Comments

@leemorton
Copy link

leemorton commented Jan 21, 2025

Describe the bug

Training an EfficientAd(small) model with all other parameters at default produces CUDA out of memory issue.

Dataset

N/A

Model

N/A

Steps to reproduce the behavior

41 Good Training Images
30 Good Validation Images and 170 Bad Validation Images

EfficientAd(small) model with all other parameters at default

At around epoch 30 of 300 (no callbacks) at 8 minutes or so using an RTX A5000 (24GB) I hit this issue. It feels like the hardware should be sufficient. I also watch the memory usage via nvidia-smi CLI and it swings back and forth but still climbs throughout training.

Adopting the suggested environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True also did not help

OS information

OS information:

  • OS: Ubuntu 24.04.1 LTS
  • Python version: 3.10.10
  • Anomalib version: 2.0.0b2
  • PyTorch version: 2.5.0
  • CUDA/cuDNN version: 12.4
  • GPU models and configuration: RTX A5000 (24GB)
  • Using a custom dataset

Expected behavior

Training to complete

Screenshots

Image

Pip/GitHub

pip

What version/branch did you use?

2.0.0b2

Configuration YAML

Image

Logs

N/A

Code of Conduct

  • I agree to follow this project's Code of Conduct
@suahelen
Copy link

I am running into the exact same issue while using Fastflow.

The jumps in allocated memory happen at the end of every validation epoch. I think I was able to track it down to the BinaryPrecisionRecallCurve(Metric)
Something in the compute is leaking memory. I have been tinkering around a bit but was unable to make it go away until now. Maybe someone else has a good idea?

@suahelen
Copy link

I think I found it:
Since the preds and targets are always appended in the update call, self.preds and self.targets continuously grow.

Adjusting the compute call like that solved it for me.
After the concate the lists can be cleared.

    def compute(self) -> tuple[Tensor, Tensor, Tensor]:
        """Compute metric."""
        if self.thresholds is None:
            if not self.preds or not self.target:
                return torch.tensor([]), torch.tensor([]), torch.tensor([])
            state = (torch.cat(self.preds), torch.cat(self.target))

            self.preds.clear()
            self.target.clear()
        else:
            state = self.confmat
            self.confmat.zero_()

        precision, recall, thresholds = _binary_precision_recall_curve_compute(state, self.thresholds)
        return precision, recall, thresholds if thresholds is not None else torch.tensor([])

@suahelen
Copy link

I just saw that this is a class from torchmetrics and not anomalib but i'll create an issue there and link it to this one here.

@suahelen
Copy link

FYI Lightning-AI/torchmetrics#2921

@samet-akcay
Copy link
Contributor

Thanks for sharing @suahelen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants