Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540

Open
GeorgeVandenberghe-NOAA opened this issue Dec 18, 2024 · 32 comments
Labels
bug Something isn't working

Comments

@GeorgeVandenberghe-NOAA
Copy link
Collaborator

When ufs-weather-model (tested is hr.v4 ) is run at C1152 resolution on Gaea C5 with ESMF managed threading, it hangs or fails when run 128 MPI ranks per node. ESMF managed threading requires 128 ranks per node for full use of the node because it disables traditional threading so we cannot run C1152 with ESMF managed threading. It is possible to get full use of the node by running with traditional threading and plural threads per task (two threads, 64 ranks per node or four threads 32 ranks per node) but other components which do not thread well then use their nodes inefficiently. It is hypothesizes the 2GB/core memory limit is insufficent to run this configuration fully packed, 128 ranks per node but then this begs the question, WHAT is using so much memory even at very high rank counts.? It has failed with 256 ranks per I/O task and two ESMF threads, and 512 ranks per I/O task and two ESMF threads.

@GeorgeVandenberghe-NOAA GeorgeVandenberghe-NOAA added the bug Something isn't working label Dec 18, 2024
@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

the issue can be mitigated by running with traditional threads and 64 or fewer ranks per node.

@theurich
Copy link
Collaborator

@GeorgeVandenberghe-NOAA Have you been able to attempt ESMF-managed threading runs at full core capacity with the custom Verbosity setting as per:

# EARTH #
EARTH_component_list: MED ATM OCN ICE WAV
EARTH_attributes::
  Verbosity = 32563
::

This should dump a lot of memory tracing information into the ESMF PET* log files. It might give us a clue as to where/why memory pressure is growing to the point of failure. If you have PET* log files with that extra info, I would like to look at them. Thanks!

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

theurich commented Dec 19, 2024

What do I toggle to get those PET logs turned on?

In ufs.configure:

# ESMF #
logKindFlag:            ESMF_LOGKIND_MULTI
globalResourceControl:  true

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

@GeorgeVandenberghe-NOAA I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading?

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

theurich commented Dec 20, 2024

You could try running WAV with different threading levels under ESMF-managed threading. E.g.

# WAV #
WAV_model:                      ww3
WAV_petlist_bounds:             8296 9293
WAV_omp_num_threads:            2
WAV_attributes::
  Verbosity = 0
  OverwriteSlice = false
  mesh_wav = mesh.uglo_m1g16.nc
  user_sets_restname = false
::

To run 2x threaded, therefore using 64 tasks per node, or with

WAV_omp_num_threads:            4

for 4x way threaded, using 32 tasks per node. Still using 998 cores in total for any of those cases, just changing the threading level. Would be curious to see how that changes things.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@JessicaMeixner-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading?

Thanks @GeorgeVandenberghe-NOAA and @theurich - I just wanted to acknowledge here that the wave people have seen this. @DeniseWorthen has also observed the wave memory issues and has done some work to address some of the issues, which can be seen in a draft PR here: NOAA-EMC/WW3#1317

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@DeniseWorthen
Copy link
Collaborator

It doesn't look like the case @GeorgeVandenberghe-NOAA is pointing to (/gpfs/f5/scratch/gwv/hr4j/da) has the PIO in WW3 enabled. Is that intentional?

@JessicaMeixner-NOAA
Copy link
Collaborator

I can try to get @GeorgeVandenberghe-NOAA a test case with PIO enabled by the end of the day - with @sbanihash help we almost have a PR ready for g-w to generate a new test case

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@DeniseWorthen
Copy link
Collaborator

You can toggle inline post off in model_configure (write_dopost: .false.).

To toggle PIO for WW3 the model needs to have been compiled w/ PIO in the switch for WW3. I don't know if you case has that or not.

@theurich
Copy link
Collaborator

@DeniseWorthen and @JessicaMeixner-NOAA It's great to know wave people are aware of the memory pressure coming from WW3, and even greater there is already a PR to address it! Do you think George should be testing here with WW3 changes from that PR?

@GeorgeVandenberghe-NOAA is the next step to attempts a run on same layout (as far as tasks and threading is concerned for each component), but with inline post off, and PIO active for WW3? We would expect a successful run. After that turn inline post back on, and observe what happens?

@DeniseWorthen
Copy link
Collaborator

@theurich There are two things that have been/can be done w/rt WW3 memory pressure. The first was implementing PIO for WW3 restarts. That has been committed, it requires compiling WW3 w/ the PIO ifdef and some additional settings in ufs.configure. It may not yet be in G-W though.

The second is a draft PR to eliminate duplicate fields. That has sat in draft because I ran into a test case---Hera+GNU+Release which did not reproduce baselines. All other cases did. I also ran cases on Hercules and Gaea and everything passed.

Hera uses a more recent GNU version though. Since the GNU+Debug passed, my supposition is that there is an optimization which is changing answers, but I have not had time to debug.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

It sounds right to focus on the inline post memory pressure issue. Do you have memory logging from a run where WAV isn't running out of memory, but where inline post is causing the issue, that I can look at? Thanks.

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

@GeorgeVandenberghe-NOAA It looks like the PET* log files under /gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs. That makes it very hard to post process and analyze. Could you post PET* log files somewhere from just one single run that fails due to inline post?

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@theurich
Copy link
Collaborator

What happens if you increase the threading level for ATM, e.g. to 4x?

# ATM #
ATM_model:                      fv3
ATM_petlist_bounds:             0 7935
ATM_omp_num_threads:            4
ATM_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
::

This change requires you also change model_configure to keep giving the same number of packed cores to the WRT components:

quilting:                .true.
quilting_restart:        .true.
write_groups:            2
write_tasks_per_group:   128

With this, the FCST component still gets the first 6912 cores (now using them with 1728 tasks 4x threaded). The two WRT comps get each 128x4 = 512 cores as before, now using those cores with 128 tasks 4x threaded.

Does that reduce the memory pressure?

@theurich
Copy link
Collaborator

In future runs, could you set Verbosity=high for all of the components (that currently set it Verbosity = 0). Just more context for when looking at the PET* logs. Thanks!

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

@GeorgeVandenberghe-NOAA
Copy link
Collaborator Author

Adding four ESMF threads rather than two and maintaining 256 ranks per I/O group (and post group) allows it to run through. There are two sources of memory pressure. WAVE (not closely examined but specifying two wave threads rather than one fixes that side of it), and UPP. UPP just plain runs out of memory with only 256 GB available on 128 core nodes. I put in some summary() calls to print out getrusage stats. Post memory usage per rank scales inversely with the number of ranks (good) but total is huge, roughly 10x the size of the 81GB history state it's trying to post process. To post process a C1152 forecast needs about 1.3 TB of total memory spread between the ranks and the sum of all of the ranks on one node can't exceed 250GB or so. This will have to be reexamined with even modest increases in resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants