Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: optimize the stream strategy in module_gint #5845

Merged
merged 3 commits into from
Jan 10, 2025

Conversation

dzzz2001
Copy link
Collaborator

@dzzz2001 dzzz2001 commented Jan 10, 2025

Background

While testing the LCAO GPU version of abacus on an A800 GPU, I noticed a significant difference in performance when running different commands on a machine with only 16 cores. Specifically, the efficiency of the command OMP_NUM_THREADS=4 mpirun -n 4 differs greatly from that of the command OMP_NUM_THREADS=1 mpirun -n 16. The cal_gint efficiency of the latter can be approximately 8 times slower than the former. Below are the runtime statistics I collected(the test case is si256):

command cal_gint_vl cal_gint_rho cal_gint_force
omp 4 mpirun 4 15.49 14.39 2.30
omp 1 mpirun 16 114.35 113.6 19.25

After reviewing the code, I discovered that the significant difference in performance might be due to the OpenMP thread setting strategy in the GPU code of module_gint:
image
From the code, it is evident that the grid integration code sets num_stream parallel threads (where num_stream is typically 4) regardless of whether the system has enough cores. This likely results in the number of threads exceeding the available system cores, leading to a loss in efficiency. Therefore, I modified the thread settings here to address this issue.
Additionally, the stream synchronization strategy in module_gint was previously rather coarse. I have now reset the stream synchronization strategy using CUDA events, which has resulted in some performance gains. After completing all modifications, I re-measured the runtime for the same test cases, with the following results:

command cal_gint_vl cal_gint_rho cal_gint_force
omp 4 mpirun 4 10.99 9.97 3.99
omp 1 mpirun 16 28.60 28.60 9.14

@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes labels Jan 10, 2025
@mohanchen mohanchen merged commit 16714c6 into deepmodeling:develop Jan 10, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants