mpi: add test that physical cores are not oversubscribed #17

wihobbs · 2024-02-22T22:26:05Z

Problem: A common failure in some MPI implementations is that all the MPI ranks end up packed on the same core. This usually shows up if pmix is configured improperly. It causes lock contention, achieves no parallelization, and is bad, so we want to catch if that's happening.

Add a test to our CI that collects the physical cpu number for each MPI rank, then checks that no two are the same. Note that this will fail if cores are deliberately oversubscribed (but they shouldn't be in CI.

This is work-in-progress. A few things left to do:

vcpu numbers can be the same if you're running across two nodes. So each node needs to have its own array where it checks for collisions.
Currently, every rank in the MPI task checks to make sure there are no collisions. This is probably wasteful, although those cores would be sitting idle anyway, so maybe it's not.
We want to test an MPI_Allreduce (even though that's already tested in hello.c) but currently MPI_Allgather is used.
Memory check -- run under valgrind and make sure memory is being allocated/freed properly.
Clean-up of formatting to be consistent with Flux standards.

grondo

LGTM! Just a couple minor comments. Be sure to remove WIP before merging :-)

grondo · 2024-02-27T16:21:44Z

mpi/vcpu.c

+#include <sched.h>
+#define MASTER 0
+
+int sched_getcpu (void);


This prototype declaration should not be necessary if sched.h is included as documented in the sched_getcpu(3) man page:

Feature Test Macro Requirements for glibc (see feature_test_macros(7)): sched_getcpu(): Since glibc 2.14: _GNU_SOURCE

Do you mind if I leave this? Both cray's compilers and gcc throw a warning if I pull it out:

(s=33,d=0) tioga11 ~/flux-test-collective/mpi (vcpu-test)$ mpicc -o vcpu vcpu.c vcpu.c: In function 'main': vcpu.c:65:11: warning: implicit declaration of function 'sched_getcpu'; did you mean 'sched_getparam'? [-Wimplicit-function-declaration] 65 | cpu = sched_getcpu (); | ^~~~~~~~~~~~ | sched_getparam

You get this warning because you are not defining _GNU_SOURCE before including sched.h as documented in the "Feature Test Macro Requirements" I referenced above.

It is slightly bad practice to define a prototype for an external function if there exists a valid header for it. In the case the prototype is incorrect you may have a runtime crash which would have been easily avoided by including the correct prototype.

Thanks. I just recalled that @garlick had asked for a small blurb at the top of the test describing its pass/fail conditions. Added that too.

mpi/vcpu.c

garlick · 2024-02-27T19:30:16Z

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

wihobbs · 2024-02-27T20:08:58Z

I'm fuzzy on what the actual failure mode is here and what it might have to do with pmix? Can you elaborate or is there an issue to point to?

A failure mode might have output that looks like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 2 (global rank 2) on corona244 vcpu 47
Hello from local rank 3 (global rank 3) on corona244 vcpu 47

On corona, where there are 48 cores per node, it shouldn't be packing 4 MPI ranks on one core by default. (Note that I haven't actually ever seen this happen, I made that output up.)

If I try and force this behavior, it ends up looking like this:

 flux run --cores=1 --tasks-per-core=4 -o verbose=2 ./vcpu
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 SPLIT_TYPE FROM 0
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
PMI_Abort: (0) N/A
Hello from local rank 0 (global rank 0) on corona244 vcpu 47
33.249s: flux-shell[0]: FATAL: doom: rank 0 exited and exit-timeout=30s has expired

Re: what it might have to do with pmix...apparently, if pmix is configured improperly, it can have behavior similar to what I showed above -- packing a ton of MPI ranks on one core instead of spreading them out like it should. Again, I haven't ever seen it do this, but this was something @nhanford suggested we test for in case that behavior ever pops up.

One thing worth mentioning is that we considered using this code instead of the mpi hello-world test from flux-core (originally, that's how it was written) but @grondo suggested keeping it as a separate test to give us a broader testsuite.

Does that help? Sorry if it doesn't, I'm not an expert on this myself...

nhanford · 2024-02-27T21:42:10Z

Hello @garlick, This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured. Also, TCE-32 on the internal LC Jira shows an example along similar lines where affinity was being set incorrectly. However, I don't know if we ever ruled out PMIx on that one, but iti is still worthwhile to detect discrepancies between what Flux set for affinity and where processes actually ended up, IMHO.

wihobbs · 2024-02-27T21:45:03Z

This is a quick way of categorically ruling out Open MPI failing over to singleton mode when PMIx is misconfigured.

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons....this is back to WIP while I investigate that.

garlick · 2024-02-27T22:42:28Z

Maybe a simpler test to detect multiple singletons would be to just have rank 1 print its rank and if it's 1 pass, and if 0 fail?

Edit: Duh, that wouldn't work - but maybe just have it fail if the size is 1 and always run > 1?

Why the communicator split?

garlick · 2024-02-28T15:44:14Z

p.s. if @nhanford is on board here, this can be merged. I was just trying to understand.

wihobbs · 2024-02-28T16:46:24Z

I think we need to keep this test, since it's still looking at a behavior that could be problematic (core oversubscription when that shouldn't happen). However, I think we need an additional test to detect if multiple singletons are being created, because this test won't pick that up as its currently designed.

One smoke test would be to have the rank 0 MPI process create a new file in /tmp/fluxci and call MPI_Abort if that file has already been created. Then, if there were multiple rank 0 instances the test would fail.

Thanks @garlick and @nhanford for the comments, you made me realize I had misunderstood what behavior we were looking for!

garlick · 2024-02-28T16:52:02Z

A file check would work but it would be easier and equivalent to fail if the mpi size is 1. (Each singleton will think it is rank 0 of a parallel program of size 1)

…

On Wed, Feb 28, 2024, 8:46 AM William Hobbs ***@***.***> wrote: I think we need to keep this test, since it's still looking at a behavior that could be problematic (core oversubscription when that shouldn't happen). However, I think we need an additional test to detect if multiple singletons are being created, because this test won't pick that up as its currently designed. One smoke test would be to have the rank 0 MPI process create a new file in /tmp/fluxci and call MPI_Abort if that file has already been created. Then, if there were multiple rank 0 instances the test would fail. Thanks @garlick <https://github.com/garlick> and @nhanford <https://github.com/nhanford> for the comments, you made me realize I had misunderstood what behavior we were looking for! — Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJPW36W7LGVZQZAFK2TKLYV5NOZAVCNFSM6AAAAABDVYHGYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGQYDKMRQGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

wihobbs · 2024-02-28T17:04:22Z

Why the communicator split?

I'm going to explain what I thought we were trying to test for in vcpu.c and maybe that will clear up why this was designed in this way.

I thought we were looking for output that looked like this:

Hello from local rank 0 (global rank 0) on corona244 vcpu 47
Hello from local rank 1 (global rank 1) on corona244 vcpu 47
Hello from local rank 0 (global rank 2) on corona243 vcpu 46
Hello from local rank 1 (global rank 3) on corona243 vcpu 46

where it had unnecessarily packed multiple MPI ranks onto one core instead of spreading them out. In order to detect this, I had each rank fetch its rank number and sched_getcpu number and write this to stdout, then hit a barrier, then report all the sched_getcpu numbers to an array via an MPI_Allgather call (this has the added benefit of testing point-to-point and all-to-all MPI communication, which we're already testing anyway, but I digress).

The rank 0 MPI task then iterated over that array to check that each one was unique. This worked well for detecting errors on one node, but failed when I tried to run it across two, because the sched_getcpu numbers can be duplicated -- two nodes have two of core # 1, 2, 3, ..., 96.

In order to solve that, I split the communicators by node, and had each node iterate over its own array to check that no two sched_getcpu numbers were the same. If an overlap was detected, it would call MPI_Abort.

When I said:

Mmmm I think because of the localRank communicator split this might not pick up if they all show up as a bunch of singletons

I was actually wrong; this test won't pick up singletons, but it has nothing to do with the communicator split. This test won't pick up singletons because it relies on the assumption that all the ranks initialized when the test starts will report to each other their core number, which isn't the case if they're singletons (silly Hobbs). (And even if they could all report their core number, they might still be unique.) So, in the case of n singletons being created, you'll have n arrays of size 1 containing 1 sched_getcpu number, which will all pass the uniqueness test and exit with a 0 return code.

wihobbs · 2024-02-28T17:09:40Z

A file check would work but it would be easier and equivalent to fail if the mpi size is 1.

Nice idea. I'll make this in a new PR. If we're good to keep this test, I'm removing WIP.

Problem: A common failure in some MPI implementations is that all the MPI ranks end up packed on the same core. This causes lock contention, achieves no parallelization, and is bad, so we want to catch if that's happening. Add a test to our CI that collects the physical cpu number for each MPI rank, then checks that no two are the same. Note that this will fail if cores are deliberately oversubscribed (but they shouldn't be in CI). Co-authored-by: Nathan Hanford <[email protected]>

Add a clang-format file to standardize all C formatting for future tests.

grondo

LGTM!

wihobbs · 2024-03-01T17:20:21Z

A lot of learning happened in this one. Thanks @grondo @garlick and @nhanford for all the help!

wihobbs · 2024-03-01T17:20:33Z

(and fun too 🎉 )

wihobbs marked this pull request as draft February 22, 2024 22:26

wihobbs force-pushed the vcpu-test branch 2 times, most recently from a0327e0 to 259dd37 Compare February 26, 2024 22:49

wihobbs marked this pull request as ready for review February 26, 2024 22:50

wihobbs requested a review from grondo February 26, 2024 22:50

grondo approved these changes Feb 27, 2024

View reviewed changes

wihobbs changed the title ~~WIP: mpi: add test that physical cores are not oversubscribed~~ mpi: add test that physical cores are not oversubscribed Feb 27, 2024

wihobbs marked this pull request as draft February 27, 2024 21:57

wihobbs marked this pull request as ready for review February 28, 2024 17:09

wihobbs force-pushed the vcpu-test branch from 259dd37 to 87374d1 Compare February 28, 2024 17:53

wihobbs requested a review from grondo February 28, 2024 17:55

wihobbs and others added 2 commits March 1, 2024 09:02

.clang-format: add clang-format file

c636f77

Add a clang-format file to standardize all C formatting for future tests.

wihobbs force-pushed the vcpu-test branch from 87374d1 to c636f77 Compare March 1, 2024 17:03

grondo approved these changes Mar 1, 2024

View reviewed changes

wihobbs merged commit 6fd80ab into flux-framework:main Mar 1, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi: add test that physical cores are not oversubscribed #17

mpi: add test that physical cores are not oversubscribed #17

wihobbs commented Feb 22, 2024 •

edited

Loading

grondo left a comment

grondo Feb 27, 2024

wihobbs Feb 28, 2024

grondo Mar 1, 2024

wihobbs Mar 1, 2024

garlick commented Feb 27, 2024

wihobbs commented Feb 27, 2024

nhanford commented Feb 27, 2024 •

edited

Loading

wihobbs commented Feb 27, 2024

garlick commented Feb 27, 2024 •

edited

Loading

garlick commented Feb 28, 2024

wihobbs commented Feb 28, 2024

garlick commented Feb 28, 2024 via email

wihobbs commented Feb 28, 2024

wihobbs commented Feb 28, 2024

grondo left a comment

wihobbs commented Mar 1, 2024

wihobbs commented Mar 1, 2024

mpi: add test that physical cores are not oversubscribed #17

mpi: add test that physical cores are not oversubscribed #17

Conversation

wihobbs commented Feb 22, 2024 • edited Loading

grondo left a comment

Choose a reason for hiding this comment

grondo Feb 27, 2024

Choose a reason for hiding this comment

wihobbs Feb 28, 2024

Choose a reason for hiding this comment

grondo Mar 1, 2024

Choose a reason for hiding this comment

wihobbs Mar 1, 2024

Choose a reason for hiding this comment

garlick commented Feb 27, 2024

wihobbs commented Feb 27, 2024

nhanford commented Feb 27, 2024 • edited Loading

wihobbs commented Feb 27, 2024

garlick commented Feb 27, 2024 • edited Loading

garlick commented Feb 28, 2024

wihobbs commented Feb 28, 2024

garlick commented Feb 28, 2024 via email

wihobbs commented Feb 28, 2024

wihobbs commented Feb 28, 2024

grondo left a comment

Choose a reason for hiding this comment

wihobbs commented Mar 1, 2024

wihobbs commented Mar 1, 2024

wihobbs commented Feb 22, 2024 •

edited

Loading

nhanford commented Feb 27, 2024 •

edited

Loading

garlick commented Feb 27, 2024 •

edited

Loading