🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

ankurmahesh · 2023-08-08T20:52:22Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

I use Modulus DistributedManager with SLURM. Right now, DistributedManager sets the local_rank based on the number of local processes on the node (this line).

local_rank = int(os.environ.get("SLURM_LOCALID"))

This line) then sets the device based on the local_rank.

manager._device = torch.device(
            f"cuda:{manager.local_rank}" if torch.cuda.is_available() else "cpu"
        )

Notably, this line breaks if "SLURM_LOCALID" is greater than torch.cuda.device_count().

In my use case, however, I need to use the SBATCH —-gpu-bind:map_gpus:0,1,2,3 flag on a node with 4 GPUs. With 4 processes per node and 4 GPUs per node, each process only sees 1 device called cuda:0, though that name actually refers to 4 different GPUs. (This forum explains why I need to use this flag.)

There may be other use cases where the number of local processes specified through SLURM may not equal the number of GPUs accessible (e.g. running FourCastNet with 4 GPUs and 1 process per GPU, but analyzing the output with more processes).

My request would be to add a flag to DistributedManager, through which I could specify that the behavior below is desired for SLURM as well.

manager._local_rank = rank % torch.cuda.device_count()

This ensures that torch.device is not called on a device that can't be accessed.

Describe any alternatives you have considered

Without a flag, DistributedManager.initialize() returns an error because torch.device is used to access a device that is not available. I could make an equivalent for DistributedManager, or I could create a subclass of DistributedManager that overrides the initialize_slurm method. Let me know if that would be the preferred solution, and I can continue with my fix on my local end.

The text was updated successfully, but these errors were encountered:

akshaysubr · 2023-08-09T04:52:53Z

@ankurmahesh This is an interesting use case that I don't think we've encountered before. Am I understanding this correctly that you need this because you are using a TorchScript serialized model that has cuda:0 baked in as the device?

There are two solutions I can think of in this case:

Set the SLURM_LOCALID variable to 0 for all ranks before calling DistributedManager.initialize() or
Add this feature to always just do this to get the device ID:

manager._device = torch.device(
            f"cuda:{manager.local_rank} % torch.cuda.device_count()" if torch.cuda.is_available() else "cpu"
        )

The second option will allow you to use the --gpu-bind argument in a SLURM environment or you could also just set CUDA_VISIBLE_DEVICES=0 manually for all ranks.

…ample (#115) * Update config.yaml * Update train_era5.py

ktangsali · 2025-01-28T19:21:56Z

Hi @ankurmahesh @akshaysubr does this issue still exist?

ankurmahesh added ? - Needs Triage Need team to review and classify enhancement New feature or request labels Aug 8, 2023

ankurmahesh changed the title ~~🚀[FEA]:~~ 🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs Aug 8, 2023

akshaysubr added external Issues/PR filed by people outside the team good first issue and removed ? - Needs Triage Need team to review and classify labels Aug 9, 2023

akshaysubr mentioned this issue Aug 16, 2023

Add example for 1D simulation of blood flow dynamics NVIDIA/modulus-launch#95

Merged

4 tasks

ktangsali added a commit that referenced this issue Nov 3, 2023

Make number of training samples per year configurable for FCN AFNO ex…

287f27b

…ample (#115) * Update config.yaml * Update train_era5.py

ram-cherukuri removed the good first issue label Sep 5, 2024

ktangsali self-assigned this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

ankurmahesh commented Aug 8, 2023 •

edited

Loading

akshaysubr commented Aug 9, 2023

ktangsali commented Jan 28, 2025

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

Comments

ankurmahesh commented Aug 8, 2023 • edited Loading

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

akshaysubr commented Aug 9, 2023

ktangsali commented Jan 28, 2025

ankurmahesh commented Aug 8, 2023 •

edited

Loading