-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] How do I optimize the number of cores and nodes for a C48 simulation? #69
Comments
Here's a figure of GCHP scalability that @lizziel and I worked on. Note that this is for GCHP v13 (in a general sense I suspect it would translate to 12.9 okay). The dashed line is GCHP on Odessey (a cluster at Harvard; Intel compilers & MPI), and the solid line is on Compute1 (a cluster at WashU; GNU compilers & OpenMPI). Note that since generating this figure, the Compute1 sysadmins did discover a bug that was degrading performance at high core counts, and I haven't had time to remake this figure. Some notes:
IdeasHere are some ideas you could consider:
Another thing I would try is setting I hope this is helpful. Please let me know if you have any questions. |
@sdeastham and @WilliamDowns is there anything else that you would suggest? |
I think @LiamBindle gave a great summary! The only additional thing I would say is that I would be very careful about changing the transport timestep. I am not certain how robust the FV3 system is to CFL exceedances in the current verison; this is a standing question (FV3 in theory has a substepping option but I do not know if that is currently used, because part of the point of the cubed-sphere layout is to avoid CFL exceedances). Changing the chemistry timestep may help but I would try other options first, as increases in the chemistry timestep are known to degrade the accuracy of the solution. |
I'm going to close this. Feel free to reopen if you have any questions. |
Hello Liam, |
Also, I would like to ask you whether you have done simiral estimations for the resolutions c90 and c180. If yes, could you please provide me the respective plot for the model thoughput versus the number of cores? Thank you, |
Hi Maria, Sorry, I didn't realize—I've reopened this. I suspect 1&2 will be the most effective. For 3, what I meant is that GCHP's performance is increasingly sensitive to your cluster's network at high core counts, so anything you can do to make sure it's performing "properly" could be useful. This probably isn't necessary/useful if you're using a large or mature cluster, but it might be useful if you're using a smaller/newer cluster where large MPI jobs aren't well tested. This is a bit of a catchall answer, but sysadmin's expertise combined with their elevated viewing privileges means they might see/notice something we couldn't (e.g. part of the network that isn't performing well, or an environment setting that your sysadmins would recommend for your cluster). This one is definitely optional, but it might be worth persuing if you notice anything weird/unreliable.
We only ran C96 and C192 for these timing tests (so grid-boxes per CPU was constant), but you can compare these to your C90 and C180 runs (they're close in terms of compute complexity). You can scale the C96 line by 0.88 (90/96 squared), so it's more directly comparable to your C90 run, and the scaling factor is the same for C192→C180. Let me know if you have any questions, |
Hello,
So, for ~100 cores, I get 53 days/day, while you get almost double (100days/day). Do you think this might be related to the supercomputer I use? Or the way I set the model? |
Hi Maria, It looks like you're getting ok scaling up to ~144 cores, so that seems like a good number to start running your simulations with. It does look like the throughtput low, but it's hard to say if it's outside what we should expect. To me, it does seem a bit slow, but it's not absurd; frequent HISTORY output (our timing tests had output once per week), enabling MAPL debug mode, differences in cluster hardware/configuration, etc. significantly affect model throughput so direct comparisons of these numbers are challenging and potentially misleading. I would recommend starting your simulations with 144 cores. Make sure that (Optional) While your simulation is running, you could reach out to the sysadmins and ask them what is bottlenecking the performance. Since the simulation is running, they can likely login and check if the nodes are CPU-bound or IO-bound (filesystem or network), and they might have some valuable insight. Hope this is helpful! Let me know if you have any questions. |
Hi Maria, I opened this as a separate issue because I think it might be of interest to other users, and this will make it easier to find.
I'm opening this question on behalf of @mtsivlidou who writes
The text was updated successfully, but these errors were encountered: