From d4778e17d76036a9303f731a649e202237813fbd Mon Sep 17 00:00:00 2001 From: RaulPPealez Date: Tue, 23 Jan 2024 12:48:19 +0100 Subject: [PATCH] Add note about parallel trainings hanging. --- README.md | 1 + docs/source/usage.rst | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f51e83431..f547cdb4d 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.y ### Known Limitations - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash. - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090). +- Some CUDA systems might hang during a multi-GPU parallel training. Try `export NCCL_P2P_DISABLE=1`, which disables direct peer to peer GPU communication. ## Cite diff --git a/docs/source/usage.rst b/docs/source/usage.rst index cbc9209e0..0248f55c0 100644 --- a/docs/source/usage.rst +++ b/docs/source/usage.rst @@ -94,7 +94,7 @@ In order to train models on multiple nodes some environment variables have to be - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash. - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090). - + - Some CUDA systems might hang during a multi-GPU parallel training. Try ``export NCCL_P2P_DISABLE=1``, which disables direct peer to peer GPU communication. Developer Guide ---------------