Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure running develop code on hercules #820

Open
jack-woollen opened this issue Dec 26, 2024 · 15 comments
Open

failure running develop code on hercules #820

jack-woollen opened this issue Dec 26, 2024 · 15 comments

Comments

@jack-woollen
Copy link
Contributor

jack-woollen commented Dec 26, 2024

@DavidHuber-NOAA @RussTreadon-NOAA I'm testing my current reanalysis fork on hercules and it runs okay. When I updated the fork from gsi develop it fails in what looks like netcdf reading. I think there are some other active issues like this. Is there a status on this problem? Thanks!

libc.so.6 00001530CB738D90 Unknown Unknown Unknown
gsi.x 0000000001D92F13 module_ncio_mp_re 50 read_vardata_code_3d.f90
gsi.x 000000000115E783 general_read_gfsa 3603 general_read_gfsatm.f90
gsi.x 00000000013730B4 netcdfgfs_io_mp_r 225 netcdfgfs_io.f90
gsi.x 000000000092FFE4 read_guess_ 201 read_guess.F90
gsi.x 000000000085D849 observermod_mp_in 165 observer.F90

@RussTreadon-NOAA
Copy link
Contributor

@jack-woollen , thank you for reporting this error.

The following test was conducted on Hercules

  1. build gsi.x and enkf.x from GSI develop at 737c6b8
  2. run ctests using develop as the updat and contrl
  3. all tests ran with ctest results shown below
Test project /work/noaa/da/rtreadon/git/gsi/develop/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  486.32 sec
2/6 Test #6: global_enkf ......................   Passed  1026.59 sec
3/6 Test #2: rtma .............................   Passed  1505.64 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1574.13 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1643.89 sec
6/6 Test #1: global_4denvar ...................   Passed  2221.92 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 2221.93 sec

GSI develop works on Hercules ... at least the cases represented by the ctests work.

Perhaps something didn't get merged correctly when you brought develop into your fork or additional changes are needed in your fork.

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for the info. For the next step I cloned and built develop directly and reran and got the same outcome. The output is in /work2/noaa/da/jwoollen/RAEXPS/2023exp2/2023060106/logs/run_gsiobserver.out.

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 0000153DDE225D90 Unknown Unknown Unknown
gsi.x 0000000001D93463 module_ncio_mp_re 50 read_vardata_code_3d.f90
gsi.x 000000000115EC73 general_read_gfsa 3603 general_read_gfsatm.f90
gsi.x 0000000001373554 netcdfgfs_io_mp_r 225 netcdfgfs_io.f90
gsi.x 000000000092FCD4 read_guess_ 201 read_guess.F90
gsi.x 000000000085D849 observermod_mp_in 165 observer.F90
gsi.x 00000000011C88EF glbsoi_ 198 glbsoi.f90
gsi.x 00000000006879A7 gsisub_ 200 gsisub.F90
gsi.x 000000000041639D gsimod_mp_gsimain 2432 gsimod.F90
gsi.x 00000000004162DF MAIN__ 633 gsimain.f90
gsi.x 000000000041627D Unknown Unknown Unknown
libc.so.6 0000153DDE210EB0 Unknown Unknown Unknown
libc.so.6 0000153DDE210F60 __libc_start_main Unknown Unknown
gsi.x 0000000000416195 Unknown Unknown Unknown

@RussTreadon-NOAA
Copy link
Contributor

@jack-woollen , since ctests work and no one else has reported similar problems on Hercules, we will need to debug this case. Where is the script used to run this case?

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA This is the structure @jswhit made to run reanalysis scout runs since the workflow for RA isn't worked out yet. I also made some changes for my testing. It was working well until the merge with current develop.

The directory with scripts is /work2/noaa/da/jwoollen/RAEXPS/scripts/2023exp2.
The calling sequence for the observer is:
job.sh
main3dvar.sh
run_gsiobserver.sh
run_gsi_4densvar.sh

There is a change in the merged version of netcdfgfs_io.f90 defining num_fields. Can this be a problem?

image

@RussTreadon-NOAA
Copy link
Contributor

@jack-woollen , I agree. This looks like a configuration error.

One caution: We no longer have GSI code support given the transition to JEDI. The only global GSI development these days is what is required for GFS v16.4 and v17. We stop using GSI for global atmospheric DA with GFS v18. The longer you use GSI, the more likely you will encounter issues.

A ncdump of /work2/noaa/da/jwoollen/RAEXPS/2023exp2/2023060106/sfg_2023060106_fhr03_control returns the following fields

        float pfull(pfull) ;
        float phalf(phalf) ;
        float clwmr(time, pfull, grid_yt, grid_xt) ;
        float delz(time, pfull, grid_yt, grid_xt) ;
        float dpres(time, pfull, grid_yt, grid_xt) ;
        float dzdt(time, pfull, grid_yt, grid_xt) ;
        float grle(time, pfull, grid_yt, grid_xt) ;
        float hgtsfc(time, grid_yt, grid_xt) ;
        float icmr(time, pfull, grid_yt, grid_xt) ;
        float nicp(time, pfull, grid_yt, grid_xt) ;
        float ntrnc(time, pfull, grid_yt, grid_xt) ;
        float o3mr(time, pfull, grid_yt, grid_xt) ;
        float pressfc(time, grid_yt, grid_xt) ;
        float rwmr(time, pfull, grid_yt, grid_xt) ;
        float snmr(time, pfull, grid_yt, grid_xt) ;
        float spfh(time, pfull, grid_yt, grid_xt) ;
        float tmp(time, pfull, grid_yt, grid_xt) ;
        float ugrd(time, pfull, grid_yt, grid_xt) ;
        float vgrd(time, pfull, grid_yt, grid_xt) ;

In contrast the global_4denvar ctest sigf06 contains

        float pfull(pfull) ;
        float phalf(phalf) ;
        float clwmr(time, pfull, grid_yt, grid_xt) ;
        float delz(time, pfull, grid_yt, grid_xt) ;
        float dpres(time, pfull, grid_yt, grid_xt) ;
        float dzdt(time, pfull, grid_yt, grid_xt) ;
        float grle(time, pfull, grid_yt, grid_xt) ;
        float hgtsfc(time, grid_yt, grid_xt) ;
        float icmr(time, pfull, grid_yt, grid_xt) ;
        float nccice(time, pfull, grid_yt, grid_xt) ;
        float nconrd(time, pfull, grid_yt, grid_xt) ;
        float o3mr(time, pfull, grid_yt, grid_xt) ;
        float pressfc(time, grid_yt, grid_xt) ;
        float rwmr(time, pfull, grid_yt, grid_xt) ;
        float snmr(time, pfull, grid_yt, grid_xt) ;
        float spfh(time, pfull, grid_yt, grid_xt) ;
        float tmp(time, pfull, grid_yt, grid_xt) ;
        float ugrd(time, pfull, grid_yt, grid_xt) ;
        float vgrd(time, pfull, grid_yt, grid_xt) ;

Note that the fields you identified are named differently between the two sets of backgrounds. Your sigf06 has nicp and ntrnc whereas the ctest sigf06 has nccice and nconrd.

Not sure if we need to change the variable names in the background fields, entries in anavinfo, and/or the setting of gsi namelist variable(s).

Seems like we need to do some old fashioned debugging by adding print statements to your older pre-merge code that works and the updated post-merge code that does not work.

@RussTreadon-NOAA
Copy link
Contributor

@jack-woollen , I modified a stand-alone rungsi script to use what I think is input for your 20230601 06Z case. The script ran develop gsi.x in 3dvar mode with only the 6 hour backgrounds. The job ran to completion.

Here are key files and directories

  • run script: /work/noaa/da/rtreadon/git/gsi/scripts/rungsi_debug.sh
  • run directory: /work/noaa/stmp/rtreadon/tmp94/debug_gdas.2023060106
  • gsiexec: /work/noaa/da/Russ.Treadon/git/gsi/develop/install/bin/gsi.x
  • fixgsi: /work/noaa/da/Russ.Treadon/git/gsi/develop/fix

I see that my gsiparm.anl has imp_physics=11. Perhaps your imp_physics and anavinfo are not consistent with the background atmospheric guess.

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for your time on this. Given what you found, I think I can track down the seg fault and move on.

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA After some more debugging it turns out setting imp_physics=11 (GFDL) instead of imp_physics=8 (Thompson) will allow the merged code and scripts to run as is. Some relatively recent changes to general_read_fv3atm added an if block for imp_physics=8 which tried to read netcdf variable name(s) not defined in the scout run forecast records. Not sure what the implication of switching imp_physics from 8 to 11 is. Maybe @jswhit2 has some insight about this setting in the gsi for reanalysis use.

A note about the newly merged code is it has a refactored version of read_satwnd which enables the use of all platforms back to 1979, speeds execution up a fair amount compared to the current develop, and it gives identical satwnd counts in gsi observer testing with 2023 data.

@RussTreadon-NOAA
Copy link
Contributor

Good to hear @jack-woollen that you got gsi.x to run to completion. I don't think we want to use imp_physics=11 if the model was not run with Thompson microphysics. In addition to @jswhit , we can ask @emilyhcliu about this. It may be better to modify the GSI to work without the expected fields when imp_physics=8.

The satwind speed up you mention sounds interesting. Do you have your changes in a branch? Processing the satwnd file takes a long time. @BrettHoover-NOAA was testing the splitting up of satwnd by subset and processing the subset files in parallel.

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA Thanks. Its good to get it right. The merged fork is found at https://github.com/jack-woollen/GSI.

@BrettHoover-NOAA The refactored read_satwnd is about 15% faster. Parallel reads could speed it up. Maybe something like read_bufrtovs would work.

@jswhit2
Copy link
Contributor

jswhit2 commented Jan 29, 2025

I think this patch to general_read_gfsatm.f90 might work (it tells the code to try using the old variable names if it can't read the new ones)

--- a/src/gsi/general_read_gfsatm.f90
+++ b/src/gsi/general_read_gfsatm.f90
@@ -3600,7 +3600,10 @@ subroutine general_read_gfsatm_allhydro_nc(grd,sp_a,filename,uvflag,vordivflag,z
              kr = levs+1-k ! netcdf is top to bottom, need to flip

              if (mype==mype_use(icount)) then
-                call read_vardata(filges, 'nccice', rwork3d0, nslice=kr, slicedim=3)
+                call read_vardata(filges, 'nccice', rwork3d0, nslice=kr, slicedim=3, errcode=iret)
+                if (iret .ne. zero) then
+                    call read_vardata(filges, 'nicp', rwork3d0, nslice=kr, slicedim=3)
+                endif
                 ! cloud ice water number concentration.
                 if ( diff_res ) then
                    grid_b=rwork3d0(:,:,1)
@@ -3630,7 +3633,10 @@ subroutine general_read_gfsatm_allhydro_nc(grd,sp_a,filename,uvflag,vordivflag,z
              kr = levs+1-k ! netcdf is top to bottom, need to flip

              if (mype==mype_use(icount)) then
-                call read_vardata(filges, 'nconrd', rwork3d0, nslice=kr, slicedim=3)
+                call read_vardata(filges, 'nconrd', rwork3d0, nslice=kr, slicedim=3, errcode=iret)
+                if (iret .ne. zero) then
+                    call read_vardata(filges, 'ntrnc', rwork3d0, nslice=kr, slicedim=3, errcode=iret)
+                endif
                 ! rain number concentration.
                 if ( diff_res ) then
                    grid_b=rwork3d0(:,:,1)

@jswhit2
Copy link
Contributor

jswhit2 commented Jan 30, 2025

apparently there is a problem with module_ncio error handling in MPI codes, and this patch should work around that

diff --git a/src/gsi/general_read_gfsatm.f90 b/src/gsi/general_read_gfsatm.f90
index e1a5406a1..b93893d8c 100755
--- a/src/gsi/general_read_gfsatm.f90
+++ b/src/gsi/general_read_gfsatm.f90
@@ -2824,7 +2824,7 @@ subroutine general_read_gfsatm_allhydro_nc(grd,sp_a,filename,uvflag,vordivflag,z
    use constants, only: two,pi,half,deg2rad,r60,r3600
    use gsi_bundlemod, only: gsi_bundle
    use gsi_bundlemod, only: gsi_bundlegetpointer
-   use module_ncio, only: Dataset, Variable, Dimension, open_dataset,&
+   use module_ncio, only: Dataset, Variable, Dimension, open_dataset, has_var,&
                           close_dataset, get_dim, read_vardata,get_idate_from_time_units
    use gfsreadmod, only: general_reload2, general_reload_sfc
    use ncepnems_io, only: imp_physics
@@ -3600,7 +3600,11 @@ subroutine general_read_gfsatm_allhydro_nc(grd,sp_a,filename,uvflag,vordivflag,z
              kr = levs+1-k ! netcdf is top to bottom, need to flip

              if (mype==mype_use(icount)) then
-                call read_vardata(filges, 'nccice', rwork3d0, nslice=kr, slicedim=3)
+                if (has_var(filges, 'nccice')) then
+                   call read_vardata(filges, 'nccice', rwork3d0, nslice=kr, slicedim=3)
+                else
+                   call read_vardata(filges, 'nicp', rwork3d0, nslice=kr, slicedim=3)
+                endif
                 ! cloud ice water number concentration.
                 if ( diff_res ) then
                    grid_b=rwork3d0(:,:,1)
@@ -3630,7 +3634,11 @@ subroutine general_read_gfsatm_allhydro_nc(grd,sp_a,filename,uvflag,vordivflag,z
              kr = levs+1-k ! netcdf is top to bottom, need to flip

              if (mype==mype_use(icount)) then
-                call read_vardata(filges, 'nconrd', rwork3d0, nslice=kr, slicedim=3)
+                if (has_var(filges, 'nconrd')) then
+                   call read_vardata(filges, 'nconrd', rwork3d0, nslice=kr, slicedim=3)
+                else
+                   call read_vardata(filges, 'ntrnc', rwork3d0, nslice=kr, slicedim=3)
+                endif
                 ! rain number concentration.
                 if ( diff_res ) then
                    grid_b=rwork3d0(:,:,1)

@jack-woollen
Copy link
Contributor Author

@jswhit2 Yup, that work around does work around.

@RussTreadon-NOAA @BrettHoover-NOAA I could use this issue to make a new pull request. Comments before I do?

@RussTreadon-NOAA
Copy link
Contributor

@jack-woollen and @jswhit2 : Thank you for developing and testing a work around change for src/gsi/general_read_gfsatm.f90.

Some comments and questions:

  1. The patch to general_read_gfsatm.f90 does not address the problem with module_ncio error handling in MPI codes. Seems we need to open an issue in NOAA-EMC/NCEPLIBS-ncio, to fix this problem, right?
  2. Is it true that independent of 1 we need the proposed patch to general_read_gfsatm.f90 so gsi.x can work in the scout run? I'm not keen on committing to develop a specialized change to address a non-operational need.
  3. Do you want to bundle the satwnd speed up change along with the general_read_gfsatm.f90 change in a single PR? Since they are separate items, it is preferable to open two PRs. The reduced gsi.x wall time associated with the satwnd change benefit operations.

@jack-woollen
Copy link
Contributor Author

@RussTreadon-NOAA 3. I'm happy to PR just the read_satwnd code. Then we can think more about 1. and 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants