Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C768 analysis tasks Fail on Hera #2498

Closed
spanNOAA opened this issue Apr 16, 2024 · 43 comments · Fixed by #2819
Closed

C768 analysis tasks Fail on Hera #2498

spanNOAA opened this issue Apr 16, 2024 · 43 comments · Fixed by #2819
Assignees
Labels
bug Something isn't working

Comments

@spanNOAA
Copy link

What is wrong?

The gdassfcanl, gfssfcanl, and gdasanalcalc tasks encounter failure from the second cycle. Regardless of the time wall set for the job, the tasks consistently exceed the time limit.

I am attempting to run the simulations starting from 2023021018 and ending 202302261800.

Brief snippet of error from gdassfcanl.log and gfssfcanl.log file for 2023021100 forecast cycle:
0: update OUTPUT SFC DATA TO: ./fnbgso.001
0:
0: CYCLE PROGRAM COMPLETED NORMALLY ON RANK: 0
0: slurmstepd: error: *** STEP 58349057.0 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 58349057 ON h34m13 CANCELLED AT 2024-04-16T21:54:15 DUE TO TIME LIMIT ***


Start Epilog on node h34m13 for job 58349057 :: Tue Apr 16 21:54:17 UTC 2024
Job 58349057 finished for user Sijie.Pan in partition hera with exit code 0:0


End Epilogue Tue Apr 16 21:54:17 UTC 2024

Brief snippet of error from gdasanalcalc.log file for 2023021100 forecast cycle:

  • . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
    PROGRAM INTERP_INC HAS BEGUN. COMPILED 2019100.00 ORG: EMC
    STARTING DATE-TIME APR 15,2024 17:16:27.299 106 MON 2460416
  • READ SETUP NAMELIST
  • WILL INTERPOLATE TO GAUSSIAN GRID OF DIMENSION 3072 1536
  • OPEN OUTPUT FILE: inc.fullres.03
  • OPEN INPUT FILE: siginc.nc.03
  • PROCESS RECORD: u_inc
  • PROCESS RECORD: T_inc
  • PROCESS RECORD: sphum_inc
  • PROCESS RECORD: delp_inc
  • PROCESS RECORD: delz_inc
  • PROCESS RECORD: liq_wat_inc
  • PROCESS RECORD: o3mr_inc
  • PROCESS RECORD: icmr_inc
    srun: Complete StepId=58250207.0 received
    slurmstepd: error: *** STEP 58250207.0 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT ***
    slurmstepd: error: *** JOB 58250207 ON h1m01 CANCELLED AT 2024-04-15T17:36:15 DUE TO TIME LIMIT ***

Start Epilog on node h1m01 for job 58250207 :: Mon Apr 15 17:36:18 UTC 2024
Job 58250207 finished for user Sijie.Pan in partition bigmem with exit code 0:0


End Epilogue Mon Apr 15 17:36:18 UTC 2024

What should have happened?

The tasks 'gdassfcanl', 'gfssfcanl', and 'gdasanalcalc' generate the respective files required for the remainder of the workflow to use.

What machines are impacted?

Hera

Steps to reproduce

  1. Set up experiment and generate xml file.
    ./setup_expt.py gfs cycled --app ATM --pslot C768_6hourly_0210 --nens 80 --idate 2023021018 --edate 2023022618 --start cold --gfs_cyc 4 --resdetatmos 768 --resensatmos 384 --configdir /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/parm/config/gfs --comroot ${COMROOT} --expdir ${EXPDIR} --icsdir /scratch2/BMC/wrfruc/Guoqing.Ge/ufs-ar/ICS/2023021018C768C384L128/output
  2. Change time wall for the gdassfcanl, gfssfcanl, and gdasanalcalc tasks.
  3. use rocoto to start the workflow.

Additional information

You can find gdassfcanl.log, gfssfcanl.log and gdasanalcalc.log in the following directory:
/scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/comroot/C768_6hourly_0210/logs/2023021100

Do you have a proposed solution?

No response

@spanNOAA spanNOAA added bug Something isn't working triage Issues that are triage labels Apr 16, 2024
Copy link
Contributor

@spanNOAA Have you compiled UFS with the unstructured wave grids (option -w)?

@spanNOAA
Copy link
Author

@spanNOAA Have you compiled UFS with the unstructured wave grids (option -w)?

No, I compiled the global workflow only using the '-g' option.

@HenryRWinterbottom
Copy link
Contributor

@spanNOAA Are you using the top of the develop branch for the g-w?

@spanNOAA
Copy link
Author

@spanNOAA Are you using the top of the develop branch for the g-w?

Yes, I'm using the develop branch.

@spanNOAA
Copy link
Author

FYI, this problem was only observed with C768. I have no issue with C384.

@HenryRWinterbottom
Copy link
Contributor

@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?

@spanNOAA
Copy link
Author

@spanNOAA Can please point me to your g-w develop branch path on RDHPCS Hera?

I can locate the local repo at: /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs.

@HenryRWinterbottom
Copy link
Contributor

@spanNOAA Thank you.

Can you please check out and/or update your current develop branch and recompile the UFS model? You can do so as follows.

user@host:$ cd sorc
user@host:$ ./build_ufs.sh -w

That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?

@spanNOAA
Copy link
Author

@spanNOAA Thank you.

Can you please check out and/or update your current develop branch and recompile the UFS model? You can do so as follows.

user@host:$ cd sorc
user@host:$ ./build_ufs.sh -w

That will ensure that both the executable is up-to-date and can use the unstructured wave grids. Can you then rerun your C768 experiment to see if the same exceptions are raised?

Certainly. But before doing so, may I ask two questions:

  1. Will utilizing the -w option have any impact on the analysis or forecast outcomes?
  2. Has the latest develop branch been updated with the kchunk3d bug fixes for the ufs model that were merged yesterday?

@HenryRWinterbottom
Copy link
Contributor

  1. Yes, I would assume there to be some differences between the use of a structured versus unstructured wave grid;
  2. Can you send me the tag for that branch? It likely has not been updated. However, you can clone the respective branch, that you are referencing, into sorc/ufs_model.fd and then build. Make sure to run sorc/link_workflow.sh once you make the updates.

@WalterKolczynski-NOAA
Copy link
Contributor

These are analysis jobs and have nothing to do with -w, don't worry about it.

C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.

How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?

@spanNOAA
Copy link
Author

When you mention checking out and/or updating my current develop branch, are you indicating that the entire global workflow needs updating, or is it solely the ufs model that requires to be updated?
The version I mentioned is 281b32f.

@WalterKolczynski-NOAA WalterKolczynski-NOAA changed the title gdassfcanl, gfssfcanl, and gdasanalcalc Tasks Fail on Hera (Rocky 8) C768 analysis tasks Fail on Hera Apr 16, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Apr 16, 2024
@HenryRWinterbottom
Copy link
Contributor

@spanNOAA Thank you for the tag. We are currently testing hash 281b32fb but encountering errors when executing the forecast (e.g., ufs_model.exe) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.

@WalterKolczynski-NOAA
Copy link
Contributor

Additionally, when you tried increasing the wallclock, did you regenerate your rocoto XML afterwards?

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Apr 16, 2024

@spanNOAA Thank you for the tag. We are currently testing hash 281b32fb but encountering errors when executing the forecast (e.g., ufs_model.exe) for C768 resolutions. As a result, the referenced UFSWM tag will not work at the moment. Please see issue #2490.

These failures are in the analysis job. It is unlikely anything with UFS or its build is the problem here.

@spanNOAA
Copy link
Author

These are analysis jobs and have nothing to do with -w, don't worry about it.

C768 is not a resolution we test regularly, and we tend to discourage people from running C768 on Hera anyway because the machine is small.

How much larger did you try making the wallclock? Have you tried increasing the number of cores instead/as well?

I attempted wallclock settings ranging from 10 to 40 minutes, but none of them works. When the wallclock was set to 20 minutes or more, the program consistently stalled at the same point.
I haven't tried to increase the number of cores. According to the log, the program seems to terminate normally. However, the slurm job continues afterward for unknown reason.
And I manually extended the wallclock duration by directly editing the XML file instead of using config files.

@WalterKolczynski-NOAA
Copy link
Contributor

Okay, I'm going to check your full log and see if I can find anything, otherwise might need to get a specialist to look at it.

@spanNOAA
Copy link
Author

I really appreciate it.

@WalterKolczynski-NOAA
Copy link
Contributor

looking at sfcanl, the problem seems to be in global_cycle. Ranks 0-2 finish but 3-5 never do:

> egrep '^3:' gdassfcanl.log.1 
3:  
3:  STARTING CYCLE PROGRAM ON RANK            3
3:  RUNNING WITH            6 TASKS
3:  AND WITH            1  THREADS.
3:  
3:  READ NAMCYC NAMELIST.
3:  
3:  
3:  IN ROUTINE SFCDRV,IDIM=         768 JDIM=         768 FH=
3:   0.000000000000000E+000
3:  - RUNNING WITH FRACTIONAL GRID.
3:  
3:  READ FV3 GRID INFO FROM: ./fngrid.004
3:  
3:  READ FV3 OROG INFO FROM: ./fnorog.004
3:  
3:  WILL PROCESS NSST RECORDS.
3:  
3:  READ INPUT SFC DATA FROM: ./fnbgsi.004
3:  - WILL PROCESS FOR NOAH-MP LSM.
3:  
3:  WILL READ NSST RECORDS.
3:  
3:  USE UNFILTERED OROGRAPHY.
3:  
3:  SAVE FIRST GUESS MASK
3:  
3:  CALL SFCCYCLE TO UPDATE SURFACE FIELDS.

Since the ranks are tiles, they should all be similar run times. I think this points back to a memory issue. Try changing the resource request to:

<nodes>6:ppn=40:tpp=1</nodes>

That should be overkill, but if it works we can try dialing it back.

@spanNOAA
Copy link
Author

The problem remains despite increasing the nodes to 6.

@WalterKolczynski-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

@GeorgeGayno-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.

export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)

@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?

@spanNOAA
Copy link
Author

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.

export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)

@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?

It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5.

@GeorgeGayno-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA I'm out of simple ideas, can you take a look at this issue?

Sure. The global_cycle program should run in under 5 minutes at C768. And is not memory intensive since it works on 2D surface fields. Here is how I run a C768 regression test on Hera.

export OMP_NUM_THREADS_CY=2
TEST1=$(sbatch --parsable --ntasks-per-node=6 --nodes=1 -t 0:05:00 -A $PROJECT_CODE -q $QUEUE -J c768.fv3gfs \
      -o $LOG_FILE -e $LOG_FILE ./C768.fv3gfs.sh)

@spanNOAA - Is it always the same tiles/mpi tasks that hang or is it random? Are you using a recent version of 'develop', which works on Rocky 8?

It's not random. Every time, the tasks for tiles 3-5 stall. While I'm not using the latest version of the 'develop' branch, it does support Rocky 8. The hash of the global workflow I'm using is d6be3b5.

Let me try to run the cycle step myself. Don't delete your working directories.

@GeorgeGayno-NOAA
Copy link
Contributor

I was able to run your test case using my own stand-alone script - /scratch1/NCEPDEV/da/George.Gayno/cycle.broke

If I just run tile 1, there is a bottleneck in the interpolation of the GLDAS soil moisture to the tile:

  in fixrdc for mon=           1  fngrib=
 /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/fix/am/global_soilmgldas.statsgo.t15
 34.3072.1536.grb

The interpolation for month=1 takes 6:30 minutes. And there are many uninterpolated points:

unable to interpolate. filled with nearest point value at 359656 points

The UFS_UTILS C768 regression test, which uses a non-fractional grid, runs very quickly. And there are very few uninterpolated points:

unable to interpolate. filled with nearest point value at 309 points

The C48 regression test uses a fractional grid. It runs quickly, but there is a very high percentage of uninterpolated points:

  in fixrdc for mon=           3  fngrib=
 /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/global_cycl
 e/../../fix/am/global_soilmgldas.statsgo.t94.192.96.grb

  unable to interpolate.  filled with nearest point value at         1308  points

Maybe there is a problem with how the interpolation mask is being set up for fractional grids?

@spanNOAA
Copy link
Author

Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.

@GeorgeGayno-NOAA
Copy link
Contributor

Could you provide guidance on setting up the interpolation mask correctly for fractional grids? Also, as we're going to run 3-week analysis-forecast cycles, I'm curious about the potential impact of using non-fractional grids instead of fractional grids.

I think the mask problem is a bug in the global_cycle code. I will need to run some tests.

@GeorgeGayno-NOAA
Copy link
Contributor

@spanNOAA - I found the problem and have a fix. What hash of the ccpp-physics are you using?

@spanNOAA
Copy link
Author

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

@guoqing-noaa
Copy link
Contributor

guoqing-noaa commented Apr 23, 2024

For sorc/ufs_utils.fd/ccpp-physics:
the hash is:

commit 3a306a493a9a0b6c3c39c7b50d356f0ddb7c5c94 (HEAD)
Merge: eda81a58 17c73687
Author: Grant Firl <[email protected]>
Date:   Tue May 9 13:14:47 2023 -0400
    Merge pull request #65 from Qingfu-Liu/update_HR2  
    PBL and Convection and Microphysics update for HR2

for sorc/ufs_model.fd/FV3/ccpp/physics
the hash is:

commit 9b0ac7b16a45afe5e7f1abf9571d3484158a5b43 (HEAD, origin/ufs/dev, origin/HEAD, ufs/dev)
Merge: 98396808 7fa55935
Author: Grant Firl <[email protected]>
Date:   Wed Mar 27 11:26:20 2024 -0400
    Merge pull request #184 from lisa-bengtsson/cloudPR
    Introduce namelist flag xr_cnvcld to control if suspended grid-mean convective cloud condensate should be included in cloud fraction and optical depth calculation in the GFS suite

@GeorgeGayno-NOAA
Copy link
Contributor

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

I have a fix. Replace the version of sfcsub.F in /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics with the version here: /scratch1/NCEPDEV/da/George.Gayno/cycle.broke Then, recompile ufs_utils.

It should run now with only six mpi tasks - one task per tile.

@spanNOAA
Copy link
Author

I checked the CMakeList file located at sorc/ufs_utils.fd/ccpp-physics, and it shows that the ccpp version is 5.0.0.

I have a fix. Replace the version of sfcsub.F in /scratch2/BMC/wrfruc/Sijie.Pan/ufs-ar/arfs/sorc/ufs_utils.fd/ccpp-physics/physics with the version here: /scratch1/NCEPDEV/da/George.Gayno/cycle.broke Then, recompile ufs_utils.

It should run now with only six mpi tasks - one task per tile.

The fix now successfully resolves issues for both gdassfcanl and gfssfcanl. Both tasks can be completed without any problems.
Another issue is about the gdasanalcalc task, which also becomes stuck at a particular point until it exceeds the wall clock. Could you please investigate this problem as well?

@RussTreadon-NOAA
Copy link
Contributor

C768 gdasanalcalc failure on Hera examined with the following findings.

Job gdasanalcalc copies interp_inc.x to chgres_inc.x. g-w ush/calcanl_gfs.py executes chgres_inc.x. As reported in this issue, chgres_inc.x hangs on Hera when processing C768 files.

Able to reproduce this behavior in stand-alone shell script which executes interp_inc.x. Script test_gw.sh in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi-utils uses the same job configuration as gdasanalcalc. interp_inc.x hangs and the job runs until the specified job wall time is reached.

Script test.sh (same directory) alters the job configuration and interp_inc.x runs to completion.

test_gw.sh specifies

#SBATCH --nodes=4
#SBATCH --tasks-per-node=40

whereas test.sh specifies

#SBATCH --nodes=1
#SBATCH --tasks-per-node=10

Both scripts execute interp_inc.x as srun -l -n 10 --verbose --export=ALL -c 1 $interpexe. The only differences is in the indicated SBATCH lines.

calcanal_gfs.py executes chgres_inc.x. gdasanalcalc runs chgres_inc.x with

srun -n 10 --verbose --export=ALL -c 1 --distribution=arbitrary --cpu-bind=cores 

The parallel xml specifies

        <nodes>4:ppn=40:tpp=1</nodes>

for the gfs and gdas analcalc job.

The analcalc job runs several executables. interp_inc.x runs 10 tasks. calc_anal.x runs 127 tasks. gaussian_sfcanl.x runs 1 tasks. This is why the xml for analcalc specifies 4 nodes with 40 tasks per node.

I do not have a solution for the Hera hang in gdasanalcalc at C768. I am simply sharing what tests reveal.

@spanNOAA
Copy link
Author

spanNOAA commented Jun 7, 2024

Hi @RussTreadon-NOAA, just following up on the Hera hang issue in gdasanalcalc at C768 that we discussed about a month ago. You mentioned that there wasn't a solution available at that time and shared some test results.

I wanted to check in to see if there have been any updates or progress on resolving this issue since then.

@RussTreadon-NOAA
Copy link
Contributor

@spanNOAA , no updates from me. I am not actively working on this issue.

@guoqing-noaa
Copy link
Contributor

@SamuelTrahanNOAA Could you take a look at this issue? Thanks!

@DavidHuber-NOAA DavidHuber-NOAA self-assigned this Jul 16, 2024
@DavidHuber-NOAA
Copy link
Contributor

I am looking into this. Presently, I am not able to cycle past the first half-cycle due to OOM errors, so that will need to be resolved first.

@DavidHuber-NOAA
Copy link
Contributor

I do not have a solution for this yet, either, but I do have some additional details. The hang occurs at line 390 of driver.F90
line 390 of driver.F90. The mpi_send is successful in that the corresponding mpi_recv at line 413 is able to pick up the data and continue processing, but stops at line 422 waiting for the next mpi_send at line 392, which never comes. It is not clear to me why the mpi_send at line 390 does not return after sending the data.

@DavidHuber-NOAA
Copy link
Contributor

The issue appears to be the size of the buffer that is sent via mpi_send and may reflect a bug in MPI, though I am not certain of that, based on this discussion. I have found a workaround and have a draft PR in place (NOAA-EMC/GSI-utils#49). This needs to be tested for reproducibility. Is that something you could help with @guoqing-noaa?

@guoqing-noaa
Copy link
Contributor

@DavidHuber-NOAA
Thanks a lot for the help. We will test your PR#49 and update you on how it goes.

@DavidHuber-NOAA
Copy link
Contributor

@guoqing-noaa I have opened PR #2819. The branch has other C768 fixes in it that will be helpful for testing. I had another problem with the analysis UPP job, so this is still a work in progress.

@guoqing-noaa
Copy link
Contributor

Thanks, @DavidHuber-NOAA

@spanNOAA
Copy link
Author

spanNOAA commented Aug 9, 2024

@DavidHuber-NOAA I have no issues with the C768 gdasanalcalc task after applying this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants