Replies: 1 comment 1 reply
-
Hi @jjmqz , For the first issue, it seems that GPU0 is already using port 29500, which prevents you from using torchrun --nproc_per_node=4 --master_port=29501 --no-python dp --pt train input.json For the second issue, it is unlikely to be a bug in DeePMD-kit, as the code runs successfully on 4 RTX3070s and does not contain hardware-specific parameters. So far I could only recommend verifying that your CUDA and cuDNN versions are compatible with your current setup. Additionally, ensure that your environment is properly configured and that all dependencies are correctly installed. (Maybe you can run some simple multi-card programs to verify this, e.g. training the example water input using torchrun on 4 RTX 4090s.) Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Dear Developers,
I am using DeePMD-kit v3.0.0b3 for pretraining and fine-tuning with DPA-2. The software was installed offline with CUDA 11.8, which matches my CUDA version. The pretraining completed successfully with single gpu, however, there is something wrong for parallel training with multi-gpus. Could you please help me resolve this issue? Thank you for your assistance.
The command is: torchrun --nproc_per_node=4 --no-python dp --pt train input.json
The input.json is as follows:
input.json
When pretraining with RTX3070, the job runs normally if GPU 0 is idle, and the log file is as follows:
RTX3070_1.log
However, if GPU 0 is in use, submitting the job results in an immediate error:
RTX3070_2.log
When pretraining with RTX4090, if GPU 0 is idle, the job runs for about 30 minutes before encountering an error, during which there is no output. The log is as follows:
RTX4090_1.log
If GPU 0 is in use, submitting the job results in an immediate error, with the same error message as when using RTX3070 with GPU 0 in use:
RTX4090_2.log
Beta Was this translation helpful? Give feedback.
All reactions