Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

back_spawn tests failing with SEGFAULT #18

Open
boegel opened this issue Nov 13, 2024 · 8 comments
Open

back_spawn tests failing with SEGFAULT #18

boegel opened this issue Nov 13, 2024 · 8 comments

Comments

@boegel
Copy link

boegel commented Nov 13, 2024

When building the latest version of NECI (commit 558e88c) with GCC 12.3.0, we're consistently seeing segfaults in the tests, even when a lot of memory is available on the system (~185GB, with 36 cores).

94% tests passed, 6 tests failed out of 96

Total Test time (real) = 372.92 sec

The following tests FAILED:
         22 - test_neci_back_spawn_excit_gen (SEGFAULT)
         23 - test_kneci_back_spawn_excit_gen (SEGFAULT)
         24 - test_dneci_back_spawn_excit_gen (SEGFAULT)
         25 - test_mneci_back_spawn_excit_gen (SEGFAULT)
         26 - test_kdneci_back_spawn_excit_gen (SEGFAULT)
         27 - test_kmneci_back_spawn_excit_gen (SEGFAULT)

More details on test_neci_back_spawn_excit_gen below incl. backtrace, the actual full error is "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." (other tests fail in a very similar way, and always in back_spawn.F90:593):

      Start 22: test_neci_back_spawn_excit_gen
22/96 Test #22: test_neci_back_spawn_excit_gen .............***Exception: SegFault  0.21 sec

 Test module initialized

    . : successful assert,   F : failed assert


 testing: calc_pgen_back_spawn_ueg
 with necessary global data:
 nel:            2
 nOccBeta:            1
 nOccAlpha:            1
 tUEG:  T
 tNoFailAb:  F
 projedet:            1           2
 dSFMT_init()
 niftot:            2
 n_int:            8
 set_flag()
 get_initiator_flag()
 flag:            1
 nBasis:            4
.........
 testing: calc_pgen_back_spawn_hubbard

 testing: calc_pgen_back_spawn_hubbard
 with necessary global data:
 nel:            2
 nOccBeta:            1
 nOccAlpha:            1
 tHub:  T
 projedet:            1           2
 dSFMT_init()
 niftot:            2
 n_int:            8
 set_flag()
 get_initiator_flag()
 IlutBits%ind_flag:            1
 nBasis:            4
.......
 testing: calc_pgen_back_spawn_ueg
 with necessary global data:
 nel:            2
 nBasis:            4
 niftot:            1
 encodebitdet()
 dSFMT_init
......
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x147aa063e6ef in ???
#1  0x420ebe in __back_spawn_MOD_pick_occupied_orbital_ueg
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn.F90:593
#2  0x425a68 in __back_spawn_excit_gen_MOD_calc_pgen_back_spawn_ueg_new
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn_excit_gen.F90:284
#3  0x407c4e in calc_pgen_back_spawn_ueg_new_test
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:487
#4  0x785d3d in __fruit_MOD_run_test_case_named_
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/fruit_src/fruit.f90:909
#5  0x407108 in back_spawn_excit_gen_test_driver
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:29
#6  0x407108 in test_back_spawn_excit_gen
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:16
#7  0x407108 in main
        at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:5

Are there any known problems when NECI is built with a recent compiler?
Any suggestions on how to get to the bottom of this, could it be a bug in back_spawn.F90?

Although we've seen very similar problems before with an older version of NECI and GCC 11.3.0 (see easybuilders/easybuild-easyconfigs#17164), we didn't observe these problems when using GCC 12.2.0 to build NECI commit 558e88c, which seems strange to me...

@demetriovilardi
Copy link
Contributor

Thank you very much for taking the time for writing this. I tested NECI, commit 558e88c, by compiling it with GCC 12.3.0, OpenMPI 4.2.5, on both Intel and AMD hardwares. I could not reproduce the issue.

I compile the program by using:
cmake -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_C_COMPILER=mpicc /path/to/source

Could please tell me the following info?

  • How you compile NECI
  • MPI version
  • OS and libgfortran version

@boegel
Copy link
Author

boegel commented Nov 16, 2024

@demetriovilardi Thank you for looking into this.

We saw the segfaults happen on both RHEL 8.8 and 9.4 systems, across a range of CPU types (different generations of Intel & AMD).
OpenMPI version we're using for this is v4.1.5.
gfortran version is 12.3.0 (we install our own GCC)

We build NECI via EasyBuild, which uses the following cmake command:

cmake  -DCMAKE_INSTALL_PREFIX=/software/NECI/20230620-foss-2023a -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DPYTHON_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython3_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DENABLE_HDF5=ON

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

@mcocdawc
Copy link
Collaborator

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

@demetriovilardi it seems like a good idea to also add -fno-math-errno to Release builds, isn't it?

¹ I changed groups and am not anymore one of the maintainers.

@boegel
Copy link
Author

boegel commented Nov 17, 2024

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

This has grown historically I guess: -O2 used to be a lot more stable in terms of generating correct machine instructions than -O3, and at some point we noticed a significant benefit of enabling -ftree-vectorize.
There may still be some compiler optimizations enabled with -O3 (and not in -O2) that could have a negative side effect, like on floating-point accuracy for example.

@demetriovilardi
Copy link
Contributor

After analysing the test with valgrind, we noticed an issue about a missing initialisation of a variable inside the test back_spawn_excit_gen.F90, that caused a memory problem at the line you specified. This issue doesn't affect the main program and we will fix it soon on our internal repository.

@boegel
Copy link
Author

boegel commented Nov 21, 2024

@demetriovilardi Can you point us to the details of the fix, so we can apply a patch for this and have a test suite that passes in full?

@demetriovilardi
Copy link
Contributor

Essentially, inside unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90, the array KPointToBasisFn is not initialised after allocation, in test_back_spawn_excit_gen.F90:66 and test_back_spawn_excit_gen.F90:405. One can add

KPointToBasisFn = -1

one line after allocation in both cases. After this change I get no further valgrind warnings or errors.

@boegel
Copy link
Author

boegel commented Nov 21, 2024

@PetrKralCZ Can you look into making a patch file for NECI based on this info, and see if that fixes the segfaults we were seeing in the test suite?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants