diff --git a/README.md b/README.md
index f51e83431..f547cdb4d 100644
--- a/README.md
+++ b/README.md
@@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.y
 ### Known Limitations
 - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
 - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).
+- Some CUDA systems might hang during a multi-GPU parallel training. Try `export NCCL_P2P_DISABLE=1`, which disables direct peer to peer GPU communication.
 
 
 ## Cite
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
index cbc9209e0..0248f55c0 100644
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -94,7 +94,7 @@ In order to train models on multiple nodes some environment variables have to be
 	  
 	  - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
 	  - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).
-
+	  - Some CUDA systems might hang during a multi-GPU parallel training. Try ``export NCCL_P2P_DISABLE=1``, which disables direct peer to peer GPU communication.
 	  
 Developer Guide
 ---------------