Skip to content

Commit

Permalink
Add note about parallel trainings hanging.
Browse files Browse the repository at this point in the history
  • Loading branch information
RaulPPelaez committed Jan 23, 2024
1 parent fd9d19f commit d4778e1
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0,1 torchmd-train --conf torchmd-net/examples/ET-QM9.yaml.y
### Known Limitations
- Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
- We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).
- Some CUDA systems might hang during a multi-GPU parallel training. Try `export NCCL_P2P_DISABLE=1`, which disables direct peer to peer GPU communication.


## Cite
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ In order to train models on multiple nodes some environment variables have to be

- Due to the way PyTorch Lightning calculates the number of required DDP processes, all nodes must use the same number of GPUs. Otherwise training will not start or crash.
- We observe a 50x decrease in performance when mixing nodes with different GPU architectures (tested with RTX 2080 Ti and RTX 3090).

- Some CUDA systems might hang during a multi-GPU parallel training. Try ``export NCCL_P2P_DISABLE=1``, which disables direct peer to peer GPU communication.

Developer Guide
---------------
Expand Down

0 comments on commit d4778e1

Please sign in to comment.