Use torch.distributed
as alternative communication backenend for Heat
#1772
Labels
torch.distributed
as alternative communication backenend for Heat
#1772
Related
As a first step, one might separate the mpi4py-wrappers more clearly from the Heat-communication structures; see, e.g., #1243 and the draft PR #1265
Feature functionality
Currently, communication in Heat is based on MPI via mpi4py; this is very much standard in traditional HPC. In machine learning / deep learning HPC, NCCL/RCCL communication is more standard and seems to be advantageous in particular for GPU-GPU-communication. The most easy way to support this as well and -at the same time- to improve interoperability with PyTorch, would be to allow
torch.distributed
as alternative backend for communication.So far, a communicator in Heat is actually an mpi4py-communicator, and communication routines are actually communication routines in mpi4py.
I would like to suggest to keep the API for Heat-communication, but to allow for another backend. Fortunately, the communication in torch.distributed is quite MPI-inspired. The main difference is that no
...v
-operations are supported; to deal with them, workaround need to be create.The overall idea would be that one can run a Heat code
script.py
both viampirun -n 4 python script.py
ortorchrun --nproc-per-node=4 script.py
(or similar) and the required backend is chosen automatically.The text was updated successfully, but these errors were encountered: