Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory-mapped caching for image translation training #218

Draft
wants to merge 13 commits into
base: segmentation-module
Choose a base branch
from

Conversation

ziw-liu
Copy link
Collaborator

@ziw-liu ziw-liu commented Jan 6, 2025

#195 implemented in-RAM caching for image translation trainings. However it does not scale to datasets larger than available system memory. This PR implements a node-local disk cache via tensordict's memory-mapped tensor.

@ziw-liu ziw-liu added enhancement New feature or request translation Image translation (VS) labels Jan 6, 2025
@ziw-liu ziw-liu changed the base branch from main to segmentation-module January 6, 2025 19:38
@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Jan 14, 2025

Progress summary: previously training on 2.3 TB of datasets takes 20 s/iter, now training on 3 TB takes 10 s/iter.

Lessons learned:

  • Default local scratch configuration on our compute nodes was not optimal for sustained I/O (ZFS and very small sector size). This was fixed on select nodes and will be rolled out more broadly later.
  • Optimizations/mitigations done for system memory caching might hurt performance in the mmap setup. Moving some transforms back to the CPU improved end-to-end timing. This might be because MONAI transforms are not batched (executed in a loop), and CPU/GPU sync could be taking much longer than the actual compute.

To be investigated:

  • Moving augmentations back to the CPU recreated the CPU compute bottleneck (removing augmentations further reduces end-to-end training time to 3s/iter, or a 3x reduction). This is potentially fixable by using batched augmentation or distributing the compute better across devices.
  • Precomputing the normalization to simplify training-time logic (d2cd340).

@ziw-liu
Copy link
Collaborator Author

ziw-liu commented Jan 14, 2025

This might be because MONAI transforms are not batched (executed in a loop), and CPU/GPU sync could be taking much longer than the actual compute.

Benchmark of 3D random affine in 6a88ec4 (10 runs, milliseconds):

Device MONAI (sequential) Kornia (batched) Relative
Zen 2 CPU (1 thread) 9160 3800 2.4
Zen 2 CPU (16 threads) 7320 556 13.2
A40 GPU 2620 210 12.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request translation Image translation (VS)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant