diff --git a/README.md b/README.md index f51e83431..f547cdb4d 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.y ### Known Limitations - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash. - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090). +- Some CUDA systems might hang during a multi-GPU parallel training. Try `export NCCL_P2P_DISABLE=1`, which disables direct peer to peer GPU communication. ## Cite diff --git a/docs/source/usage.rst b/docs/source/usage.rst index cbc9209e0..0248f55c0 100644 --- a/docs/source/usage.rst +++ b/docs/source/usage.rst @@ -94,7 +94,7 @@ In order to train models on multiple nodes some environment variables have to be - Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash. - We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090). - + - Some CUDA systems might hang during a multi-GPU parallel training. Try ``export NCCL_P2P_DISABLE=1``, which disables direct peer to peer GPU communication. Developer Guide ---------------