Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use torch.distributed as alternative communication backenend for Heat #1772

Open
mrfh92 opened this issue Jan 28, 2025 · 2 comments
Open

Use torch.distributed as alternative communication backenend for Heat #1772

mrfh92 opened this issue Jan 28, 2025 · 2 comments
Assignees

Comments

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 28, 2025

Related
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265

Feature functionality
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow torch.distributed as alternative backend for communication.

So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py.
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no ...v-operations are supported; to deal with them, workaround need to be create.

The overall idea would be that one can run a Heat code script.py both via mpirun -n 4 python script.py or torchrun --nproc-per-node=4 script.py (or similar) and the required backend is chosen automatically.

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 28, 2025

Some additional problems might arise as isend and irecv seem to be the only non-blocking operations.

@Berkant03 Berkant03 self-assigned this Jan 29, 2025
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants