Error in parallel training with multi-gpus using pytorch ddp #4154

jjmqz · 2024-09-21T17:22:34Z

jjmqz
Sep 21, 2024

Dear Developers,

I am using DeePMD-kit v3.0.0b3 for pretraining and fine-tuning with DPA-2. The software was installed offline with CUDA 11.8, which matches my CUDA version. The pretraining completed successfully with single gpu, however, there is something wrong for parallel training with multi-gpus. Could you please help me resolve this issue? Thank you for your assistance.

The command is: torchrun --nproc_per_node=4 --no-python dp --pt train input.json
The input.json is as follows:
input.json

When pretraining with RTX3070, the job runs normally if GPU 0 is idle, and the log file is as follows:
RTX3070_1.log
However, if GPU 0 is in use, submitting the job results in an immediate error:
RTX3070_2.log

When pretraining with RTX4090, if GPU 0 is idle, the job runs for about 30 minutes before encountering an error, during which there is no output. The log is as follows:
RTX4090_1.log
If GPU 0 is in use, submitting the job results in an immediate error, with the same error message as when using RTX3070 with GPU 0 in use:
RTX4090_2.log

iProzd · 2024-09-22T07:22:09Z

iProzd
Sep 22, 2024
Collaborator

Hi @jjmqz , For the first issue, it seems that GPU0 is already using port 29500, which prevents you from using torchrun again on the same port. This appears to be a bug within torchrun. To resolve this, you can try to specify a different port using the --master_port argument, such as:

torchrun --nproc_per_node=4 --master_port=29501 --no-python dp --pt train input.json

For the second issue, it is unlikely to be a bug in DeePMD-kit, as the code runs successfully on 4 RTX3070s and does not contain hardware-specific parameters. So far I could only recommend verifying that your CUDA and cuDNN versions are compatible with your current setup. Additionally, ensure that your environment is properly configured and that all dependencies are correctly installed. (Maybe you can run some simple multi-card programs to verify this, e.g. training the example water input using torchrun on 4 RTX 4090s.)

Hope this helps!

1 reply

jjmqz Sep 30, 2024
Author

Thank you very much for your response. Regarding the first issue, using the --master_port argument successfully resolved it. As for the second issue, your suggestion was extremely helpful. While the major version of CUDA I had installed matched the one shown by nvidia-smi on the RTX4090, the minor versions were different. After aligning the minor versions, parallel training with multi-GPUs worked without issue. Once again, I sincerely appreciate your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in parallel training with multi-gpus using pytorch ddp #4154

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Error in parallel training with multi-gpus using pytorch ddp #4154

jjmqz Sep 21, 2024

Replies: 1 comment · 1 reply

iProzd Sep 22, 2024 Collaborator

jjmqz Sep 30, 2024 Author

jjmqz
Sep 21, 2024

Replies: 1 comment 1 reply

iProzd
Sep 22, 2024
Collaborator

jjmqz Sep 30, 2024
Author