Use `torch.distributed` as alternative communication backenend for Heat #1772

mrfh92 · 2025-01-28T12:17:51Z

Related
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265

Feature functionality
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow torch.distributed as alternative backend for communication.

So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py.
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no ...v-operations are supported; to deal with them, workaround need to be create.

The overall idea would be that one can run a Heat code script.py both via mpirun -n 4 python script.py or torchrun --nproc-per-node=4 script.py (or similar) and the required backend is chosen automatically.

The text was updated successfully, but these errors were encountered:

mrfh92 · 2025-01-28T12:25:58Z

Some additional problems might arise as isend and irecv seem to be the only non-blocking operations.

github-actions · 2025-01-29T05:49:08Z

Branch features/1772-Use_torch_distributed_as_alternative_communication_backenend_for_Heat created!

mrfh92 added enhancement New feature or request communication interoperability core labels Jan 28, 2025

Berkant03 self-assigned this Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `torch.distributed` as alternative communication backenend for Heat #1772

Use `torch.distributed` as alternative communication backenend for Heat #1772

mrfh92 commented Jan 28, 2025 •

edited

Loading

mrfh92 commented Jan 28, 2025

github-actions bot commented Jan 29, 2025

Use torch.distributed as alternative communication backenend for Heat #1772

Use torch.distributed as alternative communication backenend for Heat #1772

Comments

mrfh92 commented Jan 28, 2025 • edited Loading

mrfh92 commented Jan 28, 2025

github-actions bot commented Jan 29, 2025

Use `torch.distributed` as alternative communication backenend for Heat #1772

Use `torch.distributed` as alternative communication backenend for Heat #1772

mrfh92 commented Jan 28, 2025 •

edited

Loading