Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES Sat phenology test failing restart starting in ctsm5.1.dev174 #2478

Closed
ekluzek opened this issue Apr 17, 2024 · 5 comments · Fixed by #2436
Closed

FATES Sat phenology test failing restart starting in ctsm5.1.dev174 #2478

ekluzek opened this issue Apr 17, 2024 · 5 comments · Fixed by #2436
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 17, 2024

Brief summary of bug

The test on Derecho

ERS_Ld30.f45_f45_mg37.I2000Clm51FatesSpCruRsGs.derecho_intel.clm-FatesColdSatPhen

fails the restart comparison.

General bug information

CTSM version you are using: Seen in ctsm5.1.dev175, ctsm5.1.dev176 and in what will be ctsm5.2.0

Does this bug cause significantly incorrect results in the model's science? Maybe? Restart issues can be a problem in the science...

Configurations affected: FATES SP mode.

Details of bug

@glemieux found it works in ctsm5.1.dev174 but fails in later versions.

Important details of your setup / configuration so we can reproduce the bug

Interestingly this test

ERP_P128x2_Ld30.f45_f45_mg37.I2000Clm60FatesSpCruRsGs.derecho_intel.clm-FatesColdSatPhen

has been passing. I've verified that there are only two differences:

ERS vs ERP
PE layout of 128x2 vs the default

The test difference just means the restart portion is on half the number of processors. So running 64 MPI tasks with one thread.

The PE layout difference is sequential with 2 nodes, threaded verses the default which is a concurrent one with a node for datm, and 4 nodes for CTSM, with no threads.

Important output or errors that show the problem

The cprnc.out files in TestStatus.log show no differences in the cpl files. But, four fields differ in the CTSM history files as follows:

 RMS FATES_MEANLIQVOL_DROUGHTPHEN_PF  3.3974E-02            NORMALIZED  6.8104E-02
 RMS FATES_L2FR_CANOPY_REC_PF         3.0387E-02            NORMALIZED  3.0401E-02
 RMS FATES_L2FR_USTORY_REC_PF         3.0387E-02            NORMALIZED  3.0401E-02
 RMS FATES_MEANLIQVOL_DROUGHTPHEN_PF  3.3974E-02            NORMALIZED  6.8104E-02
@ekluzek ekluzek added bug something is working incorrectly next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Apr 17, 2024
@glemieux
Copy link
Collaborator

Note this was passing with ctsm5.1.dev174 which can be seen via the fates-sci.1.72.5_api.34.0.0-ctsm5.1.dev174 baseline.

I think the culprit was the change to the 4x5 pe layout in dev175: 1bab27e#diff-0adb226a99bab4341e445fe78998c621f9fb5c44ab0c0d0186e4658a5e280f84

@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 17, 2024

Well -- the change from a single node to four nodes does seem to be a problem. However, we expect that NO change in PE layout should break restarts. So the fact that it was working with a a single node isn't the real issue. And what we want to fix is to get it to work for any PE layout.

@glemieux
Copy link
Collaborator

glemieux commented Apr 17, 2024

I've been able to narrow down the fates tag that this came in with by checking out a API33, updating the 4x5 pe layout to match the current ctsm master, and working through the fates tags. The issue crops up with https://github.com/NGEET/fates/releases/tag/sci.1.71.3_api.33.0.0. I'm only seeing two history variables with issue now:

RMS FATES_TVEG                              NaN            NORMALIZED  0.0000E+00
RMS FATES_MEANLIQVOL_DROUGHTPHEN_PF  2.5649E-02            NORMALIZED  5.1367E-02

@wwieder wwieder removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Apr 25, 2024
@rgknox
Copy link
Collaborator

rgknox commented Apr 25, 2024

This might have been fixed via: NGEET/fates#1189 ?

@ekluzek
Copy link
Collaborator Author

ekluzek commented Apr 26, 2024

@rgknox yep you are right! @glemieux figured that out and I confirmed that the update in ctsm5.2.002 fixed it. So that PR closed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants