Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69

Closed
LiamBindle opened this issue Dec 15, 2020 · 9 comments
Assignees
Labels
category: Question Further information is requested

Comments

@LiamBindle
Copy link
Contributor

Hi Maria, I opened this as a separate issue because I think it might be of interest to other users, and this will make it easier to find.

I'm opening this question on behalf of @mtsivlidou who writes

For my project, I need to do a 30-year simulation with monthly output. I used 6 nodes and 36 cores per node which leads to 216 total cores for a single-segment. One-month simulation under these setting (c48/ 2x2.5 resolution) needs ~ 9 hours (real time) which I find rather expensive.
How many nodes and cores should I use in order to optimize the time that the simulation needs? I was thinking also to change the timestep of chemistry (and trasport) from 600seconds (1200seconds) to 900seconds (1800 sec)
Thank you in advance,
Maria

@LiamBindle LiamBindle added the category: Question Further information is requested label Dec 15, 2020
@LiamBindle
Copy link
Contributor Author

Here's a figure of GCHP scalability that @lizziel and I worked on. Note that this is for GCHP v13 (in a general sense I suspect it would translate to 12.9 okay).
image

The dashed line is GCHP on Odessey (a cluster at Harvard; Intel compilers & MPI), and the solid line is on Compute1 (a cluster at WashU; GNU compilers & OpenMPI). Note that since generating this figure, the Compute1 sysadmins did discover a bug that was degrading performance at high core counts, and I haven't had time to remake this figure.

Some notes:

Ideas

Here are some ideas you could consider:

  1. Try setting export OMP_NUM_THREADS=1 before your launch GCHP. Does it improve your throughput? See: [BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP HEMCO#57
  2. Try some 2-week timing tests (out-of-the-box standard rundir is okay). I would try: 48 cores, 96 cores, 144 cores, 192 cores, 384 cores. This is helpful for finding the sweet spot without getting into technical details. If you do do this, please let us know---these are really valuable to the community.
  3. Ask your sysadmins to monitor a running simulation. Specifically, I would ask if they could check that the node-node interconnects are operating as expected for Infiniband.

Another thing I would try is setting WRITE_RESTART_BY_OSERVER: YES in GCHP.rc. GMAO has recommended this for OpenMPI, but this only affects writing checkpoint files, so it will probably only make a small difference in your case.

I hope this is helpful. Please let me know if you have any questions.
-Liam

@LiamBindle LiamBindle changed the title [QUESTION] How do I optimize the number of cores and nodes for a C48 simulation. [QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? Dec 15, 2020
@LiamBindle
Copy link
Contributor Author

@sdeastham and @WilliamDowns is there anything else that you would suggest?

@sdeastham
Copy link
Contributor

I think @LiamBindle gave a great summary! The only additional thing I would say is that I would be very careful about changing the transport timestep. I am not certain how robust the FV3 system is to CFL exceedances in the current verison; this is a standing question (FV3 in theory has a substepping option but I do not know if that is currently used, because part of the point of the cubed-sphere layout is to avoid CFL exceedances). Changing the chemistry timestep may help but I would try other options first, as increases in the chemistry timestep are known to degrade the accuracy of the solution.

@LiamBindle
Copy link
Contributor Author

I'm going to close this. Feel free to reopen if you have any questions.

@mtsivlidou
Copy link

Hello Liam,
Thank you all for your suggestions. I am still working on it, I will let you know about the outcome when the runs are finished.
I am not sure if I understand your 3 suggestion
(3. Ask your sysadmins to monitor a running simulation. Specifically, I would ask if they could check that the node-node interconnects are operating as expected for Infiniband.)
I would appreciate it if you can explain to me what you mean.
Thank you in advance,
Maria

@LiamBindle LiamBindle reopened this Dec 21, 2020
@mtsivlidou
Copy link

Also, I would like to ask you whether you have done simiral estimations for the resolutions c90 and c180. If yes, could you please provide me the respective plot for the model thoughput versus the number of cores?

Thank you,
Maria

@LiamBindle
Copy link
Contributor Author

LiamBindle commented Dec 21, 2020

Hi Maria,

Sorry, I didn't realize—I've reopened this. I suspect 1&2 will be the most effective.

For 3, what I meant is that GCHP's performance is increasingly sensitive to your cluster's network at high core counts, so anything you can do to make sure it's performing "properly" could be useful. This probably isn't necessary/useful if you're using a large or mature cluster, but it might be useful if you're using a smaller/newer cluster where large MPI jobs aren't well tested. This is a bit of a catchall answer, but sysadmin's expertise combined with their elevated viewing privileges means they might see/notice something we couldn't (e.g. part of the network that isn't performing well, or an environment setting that your sysadmins would recommend for your cluster). This one is definitely optional, but it might be worth persuing if you notice anything weird/unreliable.

Also, I would like to ask you whether you have done simiral estimations for the resolutions c90 and c180. If yes, could you please provide me the respective plot for the model thoughput versus the number of cores?

We only ran C96 and C192 for these timing tests (so grid-boxes per CPU was constant), but you can compare these to your C90 and C180 runs (they're close in terms of compute complexity). You can scale the C96 line by 0.88 (90/96 squared), so it's more directly comparable to your C90 run, and the scaling factor is the same for C192→C180.

Let me know if you have any questions,
Liam

@mtsivlidou
Copy link

Hello,
I have done some tests as you suggested (with slightly adjusted number of cores because the maximum number of cores per node in the supercomputer I use is 36), and I get the results below:

  • 48 cores: 29 days/day
  • 108 cores: 53 days/day
  • 144 cores: 68 days/day
  • 180 cores: 73 days/day
  • 396 cores: 105 days/day

So, for ~100 cores, I get 53 days/day, while you get almost double (100days/day). Do you think this might be related to the supercomputer I use? Or the way I set the model?
I contacted the support team of the supercomputer I use. I will let you know if I have any update.
Thank you for your time,
Maria

@LiamBindle
Copy link
Contributor Author

LiamBindle commented Jan 4, 2021

Hi Maria,

It looks like you're getting ok scaling up to ~144 cores, so that seems like a good number to start running your simulations with. It does look like the throughtput low, but it's hard to say if it's outside what we should expect. To me, it does seem a bit slow, but it's not absurd; frequent HISTORY output (our timing tests had output once per week), enabling MAPL debug mode, differences in cluster hardware/configuration, etc. significantly affect model throughput so direct comparisons of these numbers are challenging and potentially misleading.

I would recommend starting your simulations with 144 cores. Make sure that MAPL_DEBUG_LEVEL and MEMORY_DEBUG_LEVEL are both 0 in runConfig.sh (i.e. make sure they are disabled because they will slow your simulation down considerably).

(Optional) While your simulation is running, you could reach out to the sysadmins and ask them what is bottlenecking the performance. Since the simulation is running, they can likely login and check if the nodes are CPU-bound or IO-bound (filesystem or network), and they might have some valuable insight.

Hope this is helpful! Let me know if you have any questions.
Liam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants