Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: enable scalapcack solver for LR-TDDFT #5867

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

maki49
Copy link
Collaborator

@maki49 maki49 commented Jan 16, 2025

No description provided.

@maki49
Copy link
Collaborator Author

maki49 commented Jan 16, 2025

The bug occured in the integration test seems strange:
image

I tried to debug it on my machine and find that:

  • It happens at the second call of complex eigensolver pzheev
  • Compiling with intel oneapi has no such problem. It only happends with GNU compiler.
  • Running with 1 processor encounters the same problem.
  • Both jobtype = 'N' and V have such problem.
  • pzheevd and pzheevx also have such problem.
  • Enlarging work and rwork to 100 times of the queried value of lwork and rwork does not help.

gdb info:

Thread 1 "abacus" received signal SIGSEGV, Segmentation fault.
0x00001555447c8cbb in mkl_lapack_zladiv () from /opt/intel/oneapi/mkl/2024.2/lib/libmkl_core.so.2
(gdb) bt
#0  0x00001555447c8cbb in mkl_lapack_zladiv ()
   from /opt/intel/oneapi/mkl/2024.2/lib/libmkl_core.so.2
#1  0x0000555555f5bb2a in pzlarfg_.constprop ()
#2  0x0000555555f6736b in pzlatrd_.constprop ()
#3  0x0000555555f5aa8c in pzhetrd_.constprop ()
#4  0x0000555555f0637a in pzheev_ ()
#5  0x0000555555e3e5f0 in LR_Util::diag_scalapack (
    n=@0x7fffffff89a0: 80, mat=mat@entry=0x555559120830, 
    eigval=0x555558a4a560, eigvec=eigvec@entry=0x555559139840, 
    desc=...)
    at /home/fortneu49/abacus-fix/abacus-develop/source/module_lr/utils/lr_util.cpp:222

@caic99 @dyzheng do you have any idea?

@mohanchen mohanchen added EXX and lr-TDDFT Related to EXX or lr-TDDFT Refactor Refactor ABACUS codes labels Jan 17, 2025
@caic99
Copy link
Member

caic99 commented Jan 23, 2025

Hi @maki49 ,
Would you check all the input params (and its shape for arrays) with extra care? Like, does the pointer and ld matches?

@mohanchen
Copy link
Collaborator

It seems that all tests have passed, if the issue has been solved and the PR can be accepted, let me know.

@maki49
Copy link
Collaborator Author

maki49 commented Jan 23, 2025

It havn't been solved. I'm still debugging. Here are somes test results:

compiler solver 1 processor multi-processor
intel pzheevx OK segfault
gnu pzheevx OK segfault
intel pzhegvx OK segfault
gnu pzhegvx OK segfault
intel pzheev OK OK
gnu pzheev segfault segfault
intel pzheevd OK OK
gnu pzheevd OK usually OK but occasional segfault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EXX and lr-TDDFT Related to EXX or lr-TDDFT Refactor Refactor ABACUS codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants