Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs #115

Open
ankurmahesh opened this issue Aug 8, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request external Issues/PR filed by people outside the team

Comments

@ankurmahesh
Copy link

ankurmahesh commented Aug 8, 2023

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

I use Modulus DistributedManager with SLURM. Right now, DistributedManager sets the local_rank based on the number of local processes on the node (this line).

local_rank = int(os.environ.get("SLURM_LOCALID"))

This line) then sets the device based on the local_rank.

manager._device = torch.device(
            f"cuda:{manager.local_rank}" if torch.cuda.is_available() else "cpu"
        )

Notably, this line breaks if "SLURM_LOCALID" is greater than torch.cuda.device_count().

In my use case, however, I need to use the SBATCH —-gpu-bind:map_gpus:0,1,2,3 flag on a node with 4 GPUs. With 4 processes per node and 4 GPUs per node, each process only sees 1 device called cuda:0, though that name actually refers to 4 different GPUs. (This forum explains why I need to use this flag.)

There may be other use cases where the number of local processes specified through SLURM may not equal the number of GPUs accessible (e.g. running FourCastNet with 4 GPUs and 1 process per GPU, but analyzing the output with more processes).

My request would be to add a flag to DistributedManager, through which I could specify that the behavior below is desired for SLURM as well.

manager._local_rank = rank % torch.cuda.device_count()

This ensures that torch.device is not called on a device that can't be accessed.

Describe any alternatives you have considered

Without a flag, DistributedManager.initialize() returns an error because torch.device is used to access a device that is not available. I could make an equivalent for DistributedManager, or I could create a subclass of DistributedManager that overrides the initialize_slurm method. Let me know if that would be the preferred solution, and I can continue with my fix on my local end.

@ankurmahesh ankurmahesh added ? - Needs Triage Need team to review and classify enhancement New feature or request labels Aug 8, 2023
@ankurmahesh ankurmahesh changed the title 🚀[FEA]: 🚀[FEA]: DistributedManager for SLURM when local processes can't access all GPUs Aug 8, 2023
@akshaysubr
Copy link
Collaborator

@ankurmahesh This is an interesting use case that I don't think we've encountered before. Am I understanding this correctly that you need this because you are using a TorchScript serialized model that has cuda:0 baked in as the device?

There are two solutions I can think of in this case:

  1. Set the SLURM_LOCALID variable to 0 for all ranks before calling DistributedManager.initialize() or
  2. Add this feature to always just do this to get the device ID:
manager._device = torch.device(
            f"cuda:{manager.local_rank} % torch.cuda.device_count()" if torch.cuda.is_available() else "cpu"
        )

The second option will allow you to use the --gpu-bind argument in a SLURM environment or you could also just set CUDA_VISIBLE_DEVICES=0 manually for all ranks.

@akshaysubr akshaysubr added external Issues/PR filed by people outside the team good first issue and removed ? - Needs Triage Need team to review and classify labels Aug 9, 2023
ktangsali added a commit that referenced this issue Nov 3, 2023
…ample (#115)

* Update config.yaml

* Update train_era5.py
@ktangsali
Copy link
Collaborator

Hi @ankurmahesh @akshaysubr does this issue still exist?

@ktangsali ktangsali self-assigned this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request external Issues/PR filed by people outside the team
Projects
None yet
Development

No branches or pull requests

4 participants