Skip to content

Commit

Permalink
Feature: enable multi-k calculation for CUDA version of module_gint (d…
Browse files Browse the repository at this point in the history
…eepmodeling#4839)

* add find_matrix_offset function to Class hcontainer

* modify dm matrix in gint_rho_gpu.cu

* modify dm matrix in gint_force_gpu.cu

* remove GPU restriction in multi-k

* modify some functions name

* modify hRGint in gint_vl_gpu.cu

* enable mult-k calculation in gint_vl_gpu

* add const

* modify related doc

* add two testing cases for multi-k calculation

* fix an error

* replace a test case

* remove parameter “gamma_only" in get_device_flag

* modify cuda.md

* modify cuda.md again

* Update docs/advanced/acceleration/cuda.md

Co-authored-by: Chun Cai <[email protected]>

* Update docs/advanced/acceleration/cuda.md

Co-authored-by: Chun Cai <[email protected]>

* Update docs/advanced/input_files/input-main.md

Co-authored-by: Chun Cai <[email protected]>

* Update cuda.md

* Update cuda.md

---------

Co-authored-by: Chun Cai <[email protected]>
  • Loading branch information
dzzz2001 and caic99 authored Aug 1, 2024
1 parent f7ca386 commit a83a3b6
Show file tree
Hide file tree
Showing 33 changed files with 462 additions and 380 deletions.
13 changes: 5 additions & 8 deletions docs/advanced/acceleration/cuda.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ In ABACUS, we provide the option to use GPU devices to accelerate performance. T

- **Electronic state data**: (e.g. electronic density) are moved from the GPU to the CPU(s) every scf step.

- **Acclerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.
- **Accelerated by the NVIDIA libraries**: `cuBLAS` for common linear algebra calculations, `cuSolver` for eigen values/vectors, and `cuFFT` for the conversions between the real and recip spaces.

- **Multi GPU supprted**: Using multiple MPI tasks will often give the best performance. Note each MPI task will be bind to a GPU device with automatically computing load balancing.

- **Parallel strategy**: K point parallel.

Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis, and the acceleration is limited to gamma only calculation. Additionally, LCAO basis does not support multi-GPU acceleration. Both the grid integration module and the Hamiltonian matrix solver only support acceleration on a single GPU.
Unlike PW basis, only the grid integration module (module_gint) and the diagonalization of the Hamiltonian matrix (module_hsolver) have been implemented with GPU acceleration under LCAO basis.

## Required hardware/software

Expand All @@ -31,17 +31,14 @@ Check the [Advanced Installation Options](https://abacus-rtd.readthedocs.io/en/l

## Run with the GPU support by editing the INPUT script:

In `INPUT` file we need to set the value keyword [device](../input_files/input-main.md#device) to be `gpu`.
In `INPUT` file we need to set the input parameter [device](../input_files/input-main.md#device) to `gpu`. If this parameter is not set, ABACUS will try to determine if there are available GPUs.
- Set `ks_solver`: For the PW basis, CG, BPCG and Davidson methods are supported on GPU; set the input parameter [ks_solver](../input_files/input-main.md#ks_solver) to `cg`, `bpcg` or `dav`. For the LCAO basis, `cusolver` is supported on GPU.
- **multi-card**: ABACUS allows for multi-GPU acceleration. If you have multiple GPU cards, you can run ABACUS with several MPI processes, and each process will utilize one GPU card. For example, the command `mpirun -n 2 abacus` will by default launch two GPUs for computation. If you only have one card, this command will only start one GPU.

## Examples
We provides [examples](https://github.com/deepmodeling/abacus-develop/tree/develop/examples/gpu) of gpu calculations.

## Known limitations
PW basis:
- CG, BPCG and Davidson methods are supported, so the input keyword `ks_solver` can take the values `cg`, `bpcg` or `dav`.
- Only k point parallelization is supported, so the input keyword `kpar` will be set to match the number of MPI tasks automatically.
- By default, CUDA architectures 60, 70, 75, 80, 86, and 89 are compiled (if supported). It can be overriden using the CMake variable [`CMAKE_CUDA_ARCHITECTURES`](https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html) or the environmental variable [`CUDAARCHS`](https://cmake.org/cmake/help/latest/envvar/CUDAARCHS.html).

LCAO basis:
- Does not support multi-k calculation, so if the input keyword `device` is set to `gpu`, the input keyword `gamma_only` can only take the value `1`.
- Does not support multi-GPU acceleration.
2 changes: 1 addition & 1 deletion docs/advanced/input_files/input-main.md
Original file line number Diff line number Diff line change
Expand Up @@ -629,7 +629,7 @@ If only one value is set (such as `kspacing 0.5`), then kspacing values of a/b/c
- cpu: for CPUs via Intel, AMD, or Other supported CPU devices
- gpu: for GPUs via CUDA or ROCm.

Known limitations: If using the pw basis, the ks_solver must be cg/bpcg/dav to support `gpu` acceleration. If using the lcao basis, `gamma_only` must be set to `1`, as multi-k calculation is currently not supported for `gpu`. lcao_in_pw also does not support `gpu`.
Known limitations: `ks_solver` must also be set to the algorithms supported. lcao_in_pw currently does not support `gpu`.

- **Default**: cpu

Expand Down
7 changes: 1 addition & 6 deletions source/module_base/module_device/device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -149,8 +149,7 @@ int set_device_by_rank(const MPI_Comm mpi_comm) {

std::string get_device_flag(const std::string &device,
const std::string &ks_solver,
const std::string &basis_type,
const bool &gamma_only) {
const std::string &basis_type) {
if (device == "cpu") {
return "cpu"; // no extra checks required
}
Expand Down Expand Up @@ -181,10 +180,6 @@ if (basis_type == "lcao_in_pw") {
error_message +=
"The GPU currently does not support the basis type \"lcao_in_pw\"!";
}
if (basis_type == "lcao" && gamma_only == false) {
error_message += "The GPU currently does not support the basis type "
"\"lcao\" with \"gamma_only\" set to \"0\"!";
}
if(error_message.empty())
{
return "gpu"; // possibly automatically set to GPU
Expand Down
3 changes: 1 addition & 2 deletions source/module_base/module_device/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ int get_device_kpar(const int& kpar);
*/
std::string get_device_flag(const std::string& device,
const std::string& ks_solver,
const std::string& basis_type,
const bool& gamma_only);
const std::string& basis_type);

#if __MPI
int get_node_rank();
Expand Down
8 changes: 4 additions & 4 deletions source/module_hamilt_lcao/module_gint/gint.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,16 @@ void Gint::cal_gint(Gint_inout* inout) {
}
if (max_size > 0) {
#ifdef __CUDA
if (GlobalV::device_flag == "gpu" && GlobalV::GAMMA_ONLY_LOCAL
if (GlobalV::device_flag == "gpu"
&& (inout->job == Gint_Tools::job_type::vlocal
|| inout->job == Gint_Tools::job_type::rho
|| inout->job == Gint_Tools::job_type::force)) {
if (inout->job == Gint_Tools::job_type::vlocal) {
gamma_gpu_vlocal_interface(inout);
gpu_vlocal_interface(inout);
} else if (inout->job == Gint_Tools::job_type::rho) {
gamma_gpu_rho_interface(inout);
gpu_rho_interface(inout);
} else if (inout->job == Gint_Tools::job_type::force) {
gamma_gpu_force_interface(inout);
gpu_force_interface(inout);
}
} else
#endif
Expand Down
6 changes: 3 additions & 3 deletions source/module_hamilt_lcao/module_gint/gint.h
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,11 @@ class Gint {
int ny, nplane, startz_current; // from rhopw

// in cal_gint_gpu.cpp
void gamma_gpu_vlocal_interface(Gint_inout* inout);
void gpu_vlocal_interface(Gint_inout* inout);

void gamma_gpu_rho_interface(Gint_inout* inout);
void gpu_rho_interface(Gint_inout* inout);

void gamma_gpu_force_interface(Gint_inout* inout);
void gpu_force_interface(Gint_inout* inout);

// in cal_gint_cpu.cpp

Expand Down
41 changes: 9 additions & 32 deletions source/module_hamilt_lcao/module_gint/gint_force_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ namespace GintKernel
* 6. force dot on the GPU.
* 7. Copy the results back to the host.
*/
void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
const double* vlocal,
double* force_in,
double* stress_in,
Expand Down Expand Up @@ -82,36 +82,12 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
Cuda_Mem_Wrapper<double> force(3 * nat, num_streams, true);
Cuda_Mem_Wrapper<double> stress(6, num_streams, true);

Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
for (int iat1 = 0; iat1 < ucell.nat; iat1++)
{
for (int iat2 = 0; iat2 < ucell.nat; iat2++)
{
const int it1 = ucell.iat2it[iat1];
const int it2 = ucell.iat2it[iat2];
const int lo1
= gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
const int lo2
= gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];

hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
int orb_index = 0;
if (tmp_ap == NULL)
{
continue;
}
for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
{
for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
{
dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
= tmp_ap->get_pointer(0)[orb_index];
orb_index++;
}
}
}
}
dm_matrix.copy_host_to_device_sync();
Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
// retrieve the density matrix on the host
checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
dm->get_wrapper(),
dm->get_nnr() * sizeof(double),
cudaMemcpyHostToDevice));

#pragma omp parallel for num_threads(num_streams) collapse(2)
for (int i = 0; i < gridt.nbx; i++)
Expand Down Expand Up @@ -142,7 +118,8 @@ void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
dr_part.get_host_pointer(sid),
vldr3.get_host_pointer(sid));

alloc_mult_force(gridt,
alloc_mult_force(dm,
gridt,
ucell,
grid_index_ij,
max_atom,
Expand Down
5 changes: 3 additions & 2 deletions source/module_hamilt_lcao/module_gint/gint_force_gpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#include "module_hamilt_lcao/module_gint/grid_technique.h"
namespace GintKernel
{
void gint_fvl_gamma_gpu(hamilt::HContainer<double>* dm,
void gint_fvl_gpu(const hamilt::HContainer<double>* dm,
const double* vlocal,
double* force_in,
double* stress_in,
Expand All @@ -29,7 +29,8 @@ void gtask_force(const Grid_Technique& gridt,
double* dr_part,
double* vldr3);

void alloc_mult_force(const Grid_Technique& gridt,
void alloc_mult_force(const hamilt::HContainer<double>* dm,
const Grid_Technique& gridt,
const UnitCell& ucell,
const int grid_index_ij,
const int max_atom,
Expand Down
27 changes: 15 additions & 12 deletions source/module_hamilt_lcao/module_gint/gint_gpu_interface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#include "module_base/memory.h"
#include "module_base/timer.h"

void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
void Gint::gpu_vlocal_interface(Gint_inout* inout) {
ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");

Expand All @@ -17,19 +17,22 @@ void Gint::gamma_gpu_vlocal_interface(Gint_inout* inout) {
ylmcoef[i] = ModuleBase::Ylm::ylmcoef[i];
}

GintKernel::gint_gamma_vl_gpu(this->hRGint,
inout->vl,
ylmcoef,
dr,
this->gridt->rcuts.data(),
*this->gridt,
ucell);
double* pvpR = GlobalV::GAMMA_ONLY_LOCAL ? nullptr : this->pvpR_reduced[inout->ispin];
GintKernel::gint_vl_gpu(this->hRGint,
inout->vl,
ylmcoef,
dr,
this->gridt->rcuts.data(),
*this->gridt,
ucell,
pvpR,
GlobalV::GAMMA_ONLY_LOCAL);

ModuleBase::TITLE("Gint_interface", "cal_gint_vlocal");
ModuleBase::timer::tick("Gint_interface", "cal_gint_vlocal");
}

void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
void Gint::gpu_rho_interface(Gint_inout* inout) {
ModuleBase::TITLE("Gint_interface", "cal_gint_rho");
ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");

Expand All @@ -43,7 +46,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
int nrxx = this->gridt->ncx * this->gridt->ncy * this->nplane;
for (int is = 0; is < GlobalV::NSPIN; ++is) {
ModuleBase::GlobalFunc::ZEROS(inout->rho[is], nrxx);
GintKernel::gint_gamma_rho_gpu(this->DMRGint[is],
GintKernel::gint_rho_gpu(this->DMRGint[is],
ylmcoef,
dr,
this->gridt->rcuts.data(),
Expand All @@ -55,7 +58,7 @@ void Gint::gamma_gpu_rho_interface(Gint_inout* inout) {
ModuleBase::timer::tick("Gint_interface", "cal_gint_rho");
}

void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
void Gint::gpu_force_interface(Gint_inout* inout) {
ModuleBase::TITLE("Gint_interface", "cal_gint_force");
ModuleBase::timer::tick("Gint_interface", "cal_gint_force");

Expand All @@ -74,7 +77,7 @@ void Gint::gamma_gpu_force_interface(Gint_inout* inout) {
if (isforce || isstress) {
std::vector<double> force(nat * 3, 0.0);
std::vector<double> stress(6, 0.0);
GintKernel::gint_fvl_gamma_gpu(this->DMRGint[inout->ispin],
GintKernel::gint_fvl_gpu(this->DMRGint[inout->ispin],
inout->vl,
force.data(),
stress.data(),
Expand Down
36 changes: 7 additions & 29 deletions source/module_hamilt_lcao/module_gint/gint_rho_gpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
namespace GintKernel
{

void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
void gint_rho_gpu(const hamilt::HContainer<double>* dm,
const double* ylmcoef_now,
const double dr,
const double* rcut,
Expand Down Expand Up @@ -60,35 +60,12 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
Cuda_Mem_Wrapper<double> rho_g(num_mcell_on_proc, 1, false);
Cuda_Mem_Wrapper<double*> dot_product(nbzp * gridt.bxyz, num_streams, true);

Cuda_Mem_Wrapper<double> dm_matrix(lgd * lgd, 1, true);
Cuda_Mem_Wrapper<double> dm_matrix(dm->get_nnr(), 1, false);
// retrieve the density matrix on the host
for (int iat1 = 0; iat1 < ucell.nat; iat1++)
{
for (int iat2 = 0; iat2 < ucell.nat; iat2++)
{
const int it1 = ucell.iat2it[iat1];
const int it2 = ucell.iat2it[iat2];
const int lo1 = gridt.trace_lo[ucell.itiaiw2iwt(it1, ucell.iat2ia[iat1], 0)];
const int lo2 = gridt.trace_lo[ucell.itiaiw2iwt(it2, ucell.iat2ia[iat2], 0)];

hamilt::AtomPair<double>* tmp_ap = dm->find_pair(iat1, iat2);
int orb_index = 0;
if (tmp_ap == NULL)
{
continue;
}
for (int orb_i = 0; orb_i < tmp_ap->get_row_size(); orb_i++)
{
for (int orb_j = 0; orb_j < tmp_ap->get_col_size(); orb_j++)
{
dm_matrix.get_host_pointer()[(lo1 + orb_i) * lgd + (lo2 + orb_j)]
= tmp_ap->get_pointer(0)[orb_index];
orb_index++;
}
}
}
}
dm_matrix.copy_host_to_device_sync();
checkCuda(cudaMemcpy(dm_matrix.get_device_pointer(),
dm->get_wrapper(),
dm->get_nnr() * sizeof(double),
cudaMemcpyHostToDevice));

// calculate the rho for every nbzp bigcells
#pragma omp parallel for num_threads(num_streams) collapse(2)
Expand Down Expand Up @@ -120,6 +97,7 @@ void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
atoms_per_z);

alloc_mult_dot_rho(
dm,
gridt,
ucell,
grid_index_ij,
Expand Down
35 changes: 3 additions & 32 deletions source/module_hamilt_lcao/module_gint/gint_rho_gpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ namespace GintKernel
* @param ucell UnitCell.
* @param rho rho.
*/
void gint_gamma_rho_gpu(const hamilt::HContainer<double>* dm,
void gint_rho_gpu(const hamilt::HContainer<double>* dm,
const double* ylmcoef_now,
const double dr,
const double* rcut,
Expand All @@ -37,37 +37,8 @@ void gtask_rho(const Grid_Technique& gridt,
int* atoms_num_info,
int& atoms_per_z);

/**
* Allocate resources and perform matrix multiplication and vector dot products
* for computing the rho.
*
* @param gridt Grid_Technique object containing grid information.
* @param ucell UnitCell object containing unit cell information.
* @param gpu_mat_cal_flag Flags indicating which parts of the calculation will use the GPU.
* @param grid_index_ij Combined X and Y indices of the bigcell.
* @param max_size Maximum number of atoms in a meshcell.
* @param lgd lgd.
* @param nczp Number of meshcells along the z-axis on this processor.
* @param psir_ylm_g One-dimensional array storing psir.
* @param psir_dm_g One-dimensional array storing psir_dm.
* @param dm_matrix_g One-dimensional array storing mat_dm.
* @param mat_alpha Alpha values for matrix multiplication.
* @param mat_m Number of rows in mat_dm.
* @param mat_n Number of columns in mat_psir.
* @param mat_k Number of columns in mat_dm, which equals the number of rows in mat_psir.
* @param mat_lda Leading dimension of mat_dm.
* @param mat_ldb Leading dimension of mat_psir.
* @param mat_ldc Leading dimension of mat_psir_dm.
* @param mat_A Pointers to mat_dm.
* @param mat_B Pointers to mat_psir.
* @param mat_C Pointers to mat_psir_dm.
* @param max_m Maximum value of m.
* @param max_n Maximum value of n.
* @param atom_pair_num Total count of atom pairs, which is also the number of matrix multiplication operations.
* @param rho_g Rho.
* @param dot_product Pointers to the results of dot products.
*/
void alloc_mult_dot_rho(const Grid_Technique& gridt,
void alloc_mult_dot_rho(const hamilt::HContainer<double>* dm,
const Grid_Technique& gridt,
const UnitCell& ucell,
const int grid_index_ij,
const int max_atom,
Expand Down
7 changes: 0 additions & 7 deletions source/module_hamilt_lcao/module_gint/gint_tools.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -256,11 +256,4 @@ void cal_dpsirr_ylm(
return psir_vlbr3;
}


// calculating (psi_DMR)_mu = sum_nu DMR_mu,nu psi_nu
// note : there is a difference between rho and force
// in calculating rho, due to symmetry, the summation over mu,nu
// can be done as sum_mu,mu + 2 sum_mu<nu, saving some time
// but for force, we cannot exchange the index

} // namespace Gint_Tools
Loading

0 comments on commit a83a3b6

Please sign in to comment.