[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

LiamBindle · 2020-12-15T15:16:59Z

Hi Maria, I opened this as a separate issue because I think it might be of interest to other users, and this will make it easier to find.

I'm opening this question on behalf of @mtsivlidou who writes

For my project, I need to do a 30-year simulation with monthly output. I used 6 nodes and 36 cores per node which leads to 216 total cores for a single-segment. One-month simulation under these setting (c48/ 2x2.5 resolution) needs ~ 9 hours (real time) which I find rather expensive.
How many nodes and cores should I use in order to optimize the time that the simulation needs? I was thinking also to change the timestep of chemistry (and trasport) from 600seconds (1200seconds) to 900seconds (1800 sec)
Thank you in advance,
Maria

LiamBindle · 2020-12-15T16:06:11Z

Here's a figure of GCHP scalability that @lizziel and I worked on. Note that this is for GCHP v13 (in a general sense I suspect it would translate to 12.9 okay).

The dashed line is GCHP on Odessey (a cluster at Harvard; Intel compilers & MPI), and the solid line is on Compute1 (a cluster at WashU; GNU compilers & OpenMPI). Note that since generating this figure, the Compute1 sysadmins did discover a bug that was degrading performance at high core counts, and I haven't had time to remake this figure.

Some notes:

At C48 on ~200 cores, Lizzie got ~150 days/day and I got ~80 days/day. My timing is similar to yours (1-month per 9 wall hours).
200 cores for C48 was too much for Compute1 (at the time). My throughput was ~100 days/day on 100 cores (25% faster with half as many cores). It's possible you would also see better throughput at lower core counts.
OpenMP multithreading (note: not OpenMPI) might be causing some issues in recent versions of GEOS-Chem/HEMCO. See: [BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP HEMCO#57

Ideas

Here are some ideas you could consider:

Try setting export OMP_NUM_THREADS=1 before your launch GCHP. Does it improve your throughput? See: [BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP HEMCO#57
Try some 2-week timing tests (out-of-the-box standard rundir is okay). I would try: 48 cores, 96 cores, 144 cores, 192 cores, 384 cores. This is helpful for finding the sweet spot without getting into technical details. If you do do this, please let us know---these are really valuable to the community.
Ask your sysadmins to monitor a running simulation. Specifically, I would ask if they could check that the node-node interconnects are operating as expected for Infiniband.

Another thing I would try is setting WRITE_RESTART_BY_OSERVER: YES in GCHP.rc. GMAO has recommended this for OpenMPI, but this only affects writing checkpoint files, so it will probably only make a small difference in your case.

I hope this is helpful. Please let me know if you have any questions.
-Liam

LiamBindle · 2020-12-15T16:13:46Z

@sdeastham and @WilliamDowns is there anything else that you would suggest?

sdeastham · 2020-12-15T16:21:06Z

I think @LiamBindle gave a great summary! The only additional thing I would say is that I would be very careful about changing the transport timestep. I am not certain how robust the FV3 system is to CFL exceedances in the current verison; this is a standing question (FV3 in theory has a substepping option but I do not know if that is currently used, because part of the point of the cubed-sphere layout is to avoid CFL exceedances). Changing the chemistry timestep may help but I would try other options first, as increases in the chemistry timestep are known to degrade the accuracy of the solution.

LiamBindle · 2020-12-21T15:12:12Z

I'm going to close this. Feel free to reopen if you have any questions.

mtsivlidou · 2020-12-21T15:22:18Z

Hello Liam,
Thank you all for your suggestions. I am still working on it, I will let you know about the outcome when the runs are finished.
I am not sure if I understand your 3 suggestion
(3. Ask your sysadmins to monitor a running simulation. Specifically, I would ask if they could check that the node-node interconnects are operating as expected for Infiniband.)
I would appreciate it if you can explain to me what you mean.
Thank you in advance,
Maria

mtsivlidou · 2020-12-21T15:26:14Z

Also, I would like to ask you whether you have done simiral estimations for the resolutions c90 and c180. If yes, could you please provide me the respective plot for the model thoughput versus the number of cores?

Thank you,
Maria

LiamBindle · 2020-12-21T16:11:20Z

Hi Maria,

Sorry, I didn't realize—I've reopened this. I suspect 1&2 will be the most effective.

For 3, what I meant is that GCHP's performance is increasingly sensitive to your cluster's network at high core counts, so anything you can do to make sure it's performing "properly" could be useful. This probably isn't necessary/useful if you're using a large or mature cluster, but it might be useful if you're using a smaller/newer cluster where large MPI jobs aren't well tested. This is a bit of a catchall answer, but sysadmin's expertise combined with their elevated viewing privileges means they might see/notice something we couldn't (e.g. part of the network that isn't performing well, or an environment setting that your sysadmins would recommend for your cluster). This one is definitely optional, but it might be worth persuing if you notice anything weird/unreliable.

Also, I would like to ask you whether you have done simiral estimations for the resolutions c90 and c180. If yes, could you please provide me the respective plot for the model thoughput versus the number of cores?

We only ran C96 and C192 for these timing tests (so grid-boxes per CPU was constant), but you can compare these to your C90 and C180 runs (they're close in terms of compute complexity). You can scale the C96 line by 0.88 (90/96 squared), so it's more directly comparable to your C90 run, and the scaling factor is the same for C192→C180.

Let me know if you have any questions,
Liam

mtsivlidou · 2020-12-29T15:56:07Z

Hello,
I have done some tests as you suggested (with slightly adjusted number of cores because the maximum number of cores per node in the supercomputer I use is 36), and I get the results below:

48 cores: 29 days/day
108 cores: 53 days/day
144 cores: 68 days/day
180 cores: 73 days/day
396 cores: 105 days/day

So, for ~100 cores, I get 53 days/day, while you get almost double (100days/day). Do you think this might be related to the supercomputer I use? Or the way I set the model?
I contacted the support team of the supercomputer I use. I will let you know if I have any update.
Thank you for your time,
Maria

LiamBindle · 2021-01-04T16:44:42Z

Hi Maria,

It looks like you're getting ok scaling up to ~144 cores, so that seems like a good number to start running your simulations with. It does look like the throughtput low, but it's hard to say if it's outside what we should expect. To me, it does seem a bit slow, but it's not absurd; frequent HISTORY output (our timing tests had output once per week), enabling MAPL debug mode, differences in cluster hardware/configuration, etc. significantly affect model throughput so direct comparisons of these numbers are challenging and potentially misleading.

I would recommend starting your simulations with 144 cores. Make sure that MAPL_DEBUG_LEVEL and MEMORY_DEBUG_LEVEL are both 0 in runConfig.sh (i.e. make sure they are disabled because they will slow your simulation down considerably).

(Optional) While your simulation is running, you could reach out to the sysadmins and ask them what is bottlenecking the performance. Since the simulation is running, they can likely login and check if the nodes are CPU-bound or IO-bound (filesystem or network), and they might have some valuable insight.

Hope this is helpful! Let me know if you have any questions.
Liam

LiamBindle added the category: Question Further information is requested label Dec 15, 2020

LiamBindle mentioned this issue Dec 15, 2020

[BUG/ISSUE] GCHP 12.9.3 multirun option fails after 1st run [error: cap_restart did not update to different date] #67

Closed

LiamBindle changed the title ~~[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation.~~ [QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? Dec 15, 2020

LiamBindle closed this as completed Dec 21, 2020

LiamBindle reopened this Dec 21, 2020

msulprizio assigned LiamBindle Jan 27, 2021

LiamBindle closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

LiamBindle commented Dec 15, 2020

LiamBindle commented Dec 15, 2020

LiamBindle commented Dec 15, 2020

sdeastham commented Dec 15, 2020

LiamBindle commented Dec 21, 2020

mtsivlidou commented Dec 21, 2020

mtsivlidou commented Dec 21, 2020

LiamBindle commented Dec 21, 2020 •

edited

Loading

mtsivlidou commented Dec 29, 2020

LiamBindle commented Jan 4, 2021 •

edited

Loading

[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

Comments

LiamBindle commented Dec 15, 2020

LiamBindle commented Dec 15, 2020

Ideas

LiamBindle commented Dec 15, 2020

sdeastham commented Dec 15, 2020

LiamBindle commented Dec 21, 2020

mtsivlidou commented Dec 21, 2020

mtsivlidou commented Dec 21, 2020

LiamBindle commented Dec 21, 2020 • edited Loading

mtsivlidou commented Dec 29, 2020

LiamBindle commented Jan 4, 2021 • edited Loading

LiamBindle commented Dec 21, 2020 •

edited

Loading

LiamBindle commented Jan 4, 2021 •

edited

Loading