Add note about parallel trainings hanging.

torchmd · Jan 23, 2024 · d4778e1 · d4778e1
1 parent fd9d19f
commit d4778e1
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.y
 ### Known Limitations
 - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
 - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).
+- Some CUDA systems might hang during a multi-GPU parallel training. Try `export NCCL_P2P_DISABLE=1`, which disables direct peer to peer GPU communication.
 
 
 ## Cite

diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -94,7 +94,7 @@ In order to train models on multiple nodes some environment variables have to be
 
 	  - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
 	  - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).
-
+	  - Some CUDA systems might hang during a multi-GPU parallel training. Try ``export NCCL_P2P_DISABLE=1``, which disables direct peer to peer GPU communication.
 
 Developer Guide
 ---------------