improve scalability of `flux overlay errors` #6593

grondo · 2025-01-31T00:21:04Z

This PR replaces the synchronous RPCs used in flux overlay errors with asynchronous. This improves responsiveness on large clusters, e.g. on elcap:
before:

$ time flux overlay errors >/dev/null

real	0m19.254s
user	0m0.297s
sys	0m0.335s

after:

$ time src/cmd/flux overlay errors >/dev/null

real	0m1.599s
user	0m0.158s
sys	0m0.214s

chu11

review checkpoint - kids waking up :-) in middle of the big commit :-)

chu11 · 2025-01-31T16:08:44Z

src/cmd/builtin/overlay.c

    json_t *topo;
+    struct idset *subtree_ranks;
+
    json_t *topology = get_topology (h);

    if (!(topo = get_subtree_topology (topology, rank))) {
        log_err ("get_subtree_topology");
        return NULL;
    }
-    return topology_subtree_ranks (topo);
+    subtree_ranks = topology_subtree_ranks (topo);
+    return subtree_ranks;
 }


is this a cleanup that is accidentally in the "fix memory leaks" commit?

Oh, leftover debugging, good catch.

chu11 · 2025-01-31T16:16:39Z

src/cmd/builtin/overlay.c

+    /* On large systems with a flat tbon, the default timeout of 0.5s
+     * may not be long enough because the program may spend a long time
+     * simply processing the initial JSON response payload for rank 0, and
+     * meanwhile the then timeout is ticking. Therefore, if the current


"the then timeout" -> "the timeout"?

actually I mean the then() timeout, but you're right that's confusing and it is obvious what's meant without it.

Actually that comment was a bit of a run-on thought-blob so I've simplified it and forced a push.

chu11 · 2025-01-31T18:13:44Z

So there's one part of the big commit that I'm getting confused on and makes me think there is a race. Or I'm missing some subtlety in the code, b/c I presume this is working on elcap.

    /* If no more ranks pending, then disable prep and check watchers.
     * Otherwise, if more RPCs can be sent, start idle watcher so reactor
     * does not block.
     */
    if (idset_count (ctx->pending_ranks) == 0) {
        flux_watcher_stop (ctx->prep);
        flux_watcher_stop (ctx->check);
    }

This part makes sense to me. BUT when pending_ranks becomes > 0, nothing seems to enable the prep/check watchers back?

For example

initialize pending_ranks to have only rank 0.
first time through prep/check, call gather_errors() and get overlay health on rank 0
- pending_ranks should be cleared of rank 0 and be empty
- we're now waiting for response from rank 0
reactor starts its loop over again
- prep callback sees pending ranks is empty, prep & check are stopped
- gather_errors_cb() returns, finds lots of children, gathers some of them until active_rpcs is hit, and then adds other children to pending_ranks
  - shouldn't prep & check be restarted around here?

grondo · 2025-01-31T18:19:46Z

You may be correct. I reworked the prep/check/idle handling at the last minute and it may only work on elcap because that's a flat tbon. Let me fix that if necessary.

chu11 · 2025-01-31T18:23:05Z

You may be correct. I reworked the prep/check/idle handling at the last minute and it may only work on elcap because that's a flat tbon. Let me fix that if necessary.

I think my example above could occur on a flat tbon as well. The main thing is on the first "iteration", pending ranks should become cleared. So unless the rank 0 response is received before the next prep, I think the prep/check will be stopped.

grondo · 2025-01-31T18:36:54Z

I think my example above could occur on a flat tbon as well. The main thing is on the first "iteration", pending ranks should become cleared. So unless the rank 0 response is received before the next prep, I think the prep/check will be stopped.

The first iteration only sends up to max_rpcs. If the cluster's larger than that pending ranks won't be clear. If it is smaller then we don't need prep/check. However, this could definitely happen on a future iteration when ranks have children so point taken and this needs to be fixed I think.

Hm, looking at the branch here I also seem to have pushed the wrong version. I'll let you know when I've sorted that out.

Problem: The `flux overlay errors` command doesn't run clean under valgrind, making it difficult to use valgrind to find newly introduced errors. Free memory where needed to resolve leaks.

chu11 · 2025-01-31T19:05:16Z

The first iteration only sends up to max_rpcs.

By "first iteration", I'm referring to just the rank 0. The recursive step (the "next iteration") is after the response from rank 0.

grondo · 2025-01-31T19:25:09Z

By "first iteration", I'm referring to just the rank 0. The recursive step (the "next iteration") is after the response from rank 0.

Oh, I think that works (in the previous version of the PR) because rank 0 is in the pending_ranks idset, so the idset is not empty and the prep/check watchers are not stopped before the next "iteration".

I did find that I had accidentally addressed your initial comments on an older branch and pushed that (it didn't even work at all on non-flat TBON overlays)

I've pushed the updated version (which does start the prep/check watchers in gather_errors_cb as you noted is necessary for non flat overlays)

chu11 · 2025-01-31T21:21:15Z

I think my example above could occur on a flat tbon as well.

Oh I'm so dumb, I see why it doesn't matter for a flat tbon. You only have to do one "iteration". Rank 0 has all the children :-)

chu11

LGTM, just the one nit below.

chu11 · 2025-01-31T20:51:47Z

src/cmd/builtin/overlay.c

+    struct timespec t0;
+    monotime (&t0);


left over debugging?

Yep, sheesh! 🤦

Problem: The `flux overlay errors` program exhibits scaling issues on large systems because it contacts all ranks serially. Update `flux overlay errors` to operate asynchronously using the Flux reactor. To avoid sending a very large number of messages without entering the reactor, use check/prep/idle watchers to limit the number of outstanding RPCs to 512 (experimentally determined to be a good value for clusters of 10K nodes). Fixes flux-framework#6517

Problem: On systems with ~10K nodes, `flux overlay errors` sometimes reports "Connection timed out" for some ranks for which RPCs are issued on the first iteration. The problem seems to be that the timeout starts immediately when `flux_future_then(3)` is called, but for large systems with a flat TBON the program may not re-enter the reactor for >0.5s due to the size of the initial payload. While one solution would be to delay sending _any_ RPCs until the first time the check watcher is called, this unnecessarily extends the runtime of the program by at least the initial payload processing time. Instead, scale the timeout for large systems (>2K nodes) by the size of system, such that 10K node systems get a roughly 2.5s timeout, which seems to be a safe value. Note that a long timeout is not as much of a problem as in previous versions of the program where overlay.health RPCs were sent serially, since the longer timeout can now happen in parallel.

grondo · 2025-01-31T21:54:41Z

Thanks for your attention to detail on this one @chu11. I'll set MWP here now.

codecov · 2025-01-31T21:56:08Z

Codecov Report

Attention: Patch coverage is 85.26316% with 14 lines in your changes missing coverage. Please review.

Project coverage is 79.45%. Comparing base (ace8d8a) to head (2e38408).
Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
src/cmd/builtin/overlay.c	85.26%	14 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6593      +/-   ##
==========================================
- Coverage   79.46%   79.45%   -0.02%     
==========================================
  Files         531      531              
  Lines       88202    88275      +73     
==========================================
+ Hits        70091    70136      +45     
- Misses      18111    18139      +28

Files with missing lines	Coverage Δ
src/cmd/builtin/overlay.c	`77.74% <85.26%> (+0.66%)`	⬆️

... and 6 files with indirect coverage changes

grondo force-pushed the overlay-errors branch from 69f0cbd to d0bf6e6 Compare January 31, 2025 02:25

chu11 reviewed Jan 31, 2025

View reviewed changes

grondo force-pushed the overlay-errors branch from d0bf6e6 to ddccf90 Compare January 31, 2025 17:13

cmd: free memory before exit in flux overlay errors

8a188a2

Problem: The `flux overlay errors` command doesn't run clean under valgrind, making it difficult to use valgrind to find newly introduced errors. Free memory where needed to resolve leaks.

grondo force-pushed the overlay-errors branch 2 times, most recently from 1302a9a to ee048ea Compare January 31, 2025 19:25

chu11 approved these changes Jan 31, 2025

View reviewed changes

grondo added 2 commits January 31, 2025 13:25

grondo force-pushed the overlay-errors branch from ee048ea to 2e38408 Compare January 31, 2025 21:25

grondo added the merge-when-passing label Jan 31, 2025

mergify bot merged commit ea5af0e into flux-framework:master Jan 31, 2025
35 checks passed

grondo deleted the overlay-errors branch January 31, 2025 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve scalability of `flux overlay errors` #6593

improve scalability of `flux overlay errors` #6593

grondo commented Jan 31, 2025

chu11 left a comment

chu11 Jan 31, 2025

grondo Jan 31, 2025

chu11 Jan 31, 2025

grondo Jan 31, 2025

grondo Jan 31, 2025

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

chu11 left a comment

chu11 Jan 31, 2025

grondo Jan 31, 2025

grondo commented Jan 31, 2025

codecov bot commented Jan 31, 2025

improve scalability of flux overlay errors #6593

improve scalability of flux overlay errors #6593

Conversation

grondo commented Jan 31, 2025

chu11 left a comment

Choose a reason for hiding this comment

chu11 Jan 31, 2025

Choose a reason for hiding this comment

grondo Jan 31, 2025

Choose a reason for hiding this comment

chu11 Jan 31, 2025

Choose a reason for hiding this comment

grondo Jan 31, 2025

Choose a reason for hiding this comment

grondo Jan 31, 2025

Choose a reason for hiding this comment

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

grondo commented Jan 31, 2025

chu11 commented Jan 31, 2025

chu11 left a comment

Choose a reason for hiding this comment

chu11 Jan 31, 2025

Choose a reason for hiding this comment

grondo Jan 31, 2025

Choose a reason for hiding this comment

grondo commented Jan 31, 2025

codecov bot commented Jan 31, 2025

Codecov Report

improve scalability of `flux overlay errors` #6593

improve scalability of `flux overlay errors` #6593