Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve scalability of flux overlay errors #6593

Merged
merged 3 commits into from
Jan 31, 2025

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jan 31, 2025

This PR replaces the synchronous RPCs used in flux overlay errors with asynchronous. This improves responsiveness on large clusters, e.g. on elcap:
before:

$ time flux overlay errors >/dev/null

real	0m19.254s
user	0m0.297s
sys	0m0.335s

after:

$ time src/cmd/flux overlay errors >/dev/null

real	0m1.599s
user	0m0.158s
sys	0m0.214s

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review checkpoint - kids waking up :-) in middle of the big commit :-)

Comment on lines 440 to 448
json_t *topo;
struct idset *subtree_ranks;

json_t *topology = get_topology (h);

if (!(topo = get_subtree_topology (topology, rank))) {
log_err ("get_subtree_topology");
return NULL;
}
return topology_subtree_ranks (topo);
subtree_ranks = topology_subtree_ranks (topo);
return subtree_ranks;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a cleanup that is accidentally in the "fix memory leaks" commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, leftover debugging, good catch.

/* On large systems with a flat tbon, the default timeout of 0.5s
* may not be long enough because the program may spend a long time
* simply processing the initial JSON response payload for rank 0, and
* meanwhile the then timeout is ticking. Therefore, if the current
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the then timeout" -> "the timeout"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I mean the then() timeout, but you're right that's confusing and it is obvious what's meant without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that comment was a bit of a run-on thought-blob so I've simplified it and forced a push.

@chu11
Copy link
Member

chu11 commented Jan 31, 2025

So there's one part of the big commit that I'm getting confused on and makes me think there is a race. Or I'm missing some subtlety in the code, b/c I presume this is working on elcap.

    /* If no more ranks pending, then disable prep and check watchers.
     * Otherwise, if more RPCs can be sent, start idle watcher so reactor
     * does not block.
     */
    if (idset_count (ctx->pending_ranks) == 0) {
        flux_watcher_stop (ctx->prep);
        flux_watcher_stop (ctx->check);
    }

This part makes sense to me. BUT when pending_ranks becomes > 0, nothing seems to enable the prep/check watchers back?

For example

  • initialize pending_ranks to have only rank 0.
  • first time through prep/check, call gather_errors() and get overlay health on rank 0
    • pending_ranks should be cleared of rank 0 and be empty
    • we're now waiting for response from rank 0
  • reactor starts its loop over again
    • prep callback sees pending ranks is empty, prep & check are stopped
    • gather_errors_cb() returns, finds lots of children, gathers some of them until active_rpcs is hit, and then adds other children to pending_ranks
      • shouldn't prep & check be restarted around here?

@grondo
Copy link
Contributor Author

grondo commented Jan 31, 2025

You may be correct. I reworked the prep/check/idle handling at the last minute and it may only work on elcap because that's a flat tbon. Let me fix that if necessary.

@chu11
Copy link
Member

chu11 commented Jan 31, 2025

You may be correct. I reworked the prep/check/idle handling at the last minute and it may only work on elcap because that's a flat tbon. Let me fix that if necessary.

I think my example above could occur on a flat tbon as well. The main thing is on the first "iteration", pending ranks should become cleared. So unless the rank 0 response is received before the next prep, I think the prep/check will be stopped.

@grondo
Copy link
Contributor Author

grondo commented Jan 31, 2025

I think my example above could occur on a flat tbon as well. The main thing is on the first "iteration", pending ranks should become cleared. So unless the rank 0 response is received before the next prep, I think the prep/check will be stopped.

The first iteration only sends up to max_rpcs. If the cluster's larger than that pending ranks won't be clear. If it is smaller then we don't need prep/check. However, this could definitely happen on a future iteration when ranks have children so point taken and this needs to be fixed I think.

Hm, looking at the branch here I also seem to have pushed the wrong version. I'll let you know when I've sorted that out.

Problem: The `flux overlay errors` command doesn't run clean under
valgrind, making it difficult to use valgrind to find newly introduced
errors.

Free memory where needed to resolve leaks.
@chu11
Copy link
Member

chu11 commented Jan 31, 2025

The first iteration only sends up to max_rpcs.

By "first iteration", I'm referring to just the rank 0. The recursive step (the "next iteration") is after the response from rank 0.

@grondo grondo force-pushed the overlay-errors branch 2 times, most recently from 1302a9a to ee048ea Compare January 31, 2025 19:25
@grondo
Copy link
Contributor Author

grondo commented Jan 31, 2025

By "first iteration", I'm referring to just the rank 0. The recursive step (the "next iteration") is after the response from rank 0.

Oh, I think that works (in the previous version of the PR) because rank 0 is in the pending_ranks idset, so the idset is not empty and the prep/check watchers are not stopped before the next "iteration".

I did find that I had accidentally addressed your initial comments on an older branch and pushed that (it didn't even work at all on non-flat TBON overlays)

I've pushed the updated version (which does start the prep/check watchers in gather_errors_cb as you noted is necessary for non flat overlays)

@chu11
Copy link
Member

chu11 commented Jan 31, 2025

I think my example above could occur on a flat tbon as well.

Oh I'm so dumb, I see why it doesn't matter for a flat tbon. You only have to do one "iteration". Rank 0 has all the children :-)

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just the one nit below.

Comment on lines 960 to 961
struct timespec t0;
monotime (&t0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sheesh! 🤦

Problem: The `flux overlay errors` program exhibits scaling issues
on large systems because it contacts all ranks serially.

Update `flux overlay errors` to operate asynchronously using the Flux
reactor. To avoid sending a very large number of messages without
entering the reactor, use check/prep/idle watchers to limit the number
of outstanding RPCs to 512 (experimentally determined to be a good
value for clusters of 10K nodes).

Fixes flux-framework#6517
Problem: On systems with ~10K nodes, `flux overlay errors` sometimes
reports "Connection timed out" for some ranks for which RPCs are
issued on the first iteration. The problem seems to be that the
timeout starts immediately when `flux_future_then(3)` is called, but
for large systems with a flat TBON the program may not re-enter the
reactor for >0.5s due to the size of the initial payload.

While one solution would be to delay sending _any_ RPCs until the
first time the check watcher is called, this unnecessarily extends
the runtime of the program by at least the initial payload processing
time. Instead, scale the timeout for large systems (>2K nodes) by
the size of system, such that 10K node systems get a roughly 2.5s
timeout, which seems to be a safe value.

Note that a long timeout is not as much of a problem as in previous
versions of the program where overlay.health RPCs were sent serially,
since the longer timeout can now happen in parallel.
@grondo
Copy link
Contributor Author

grondo commented Jan 31, 2025

Thanks for your attention to detail on this one @chu11. I'll set MWP here now.

@mergify mergify bot merged commit ea5af0e into flux-framework:master Jan 31, 2025
35 checks passed
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 85.26316% with 14 lines in your changes missing coverage. Please review.

Project coverage is 79.45%. Comparing base (ace8d8a) to head (2e38408).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/builtin/overlay.c 85.26% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6593      +/-   ##
==========================================
- Coverage   79.46%   79.45%   -0.02%     
==========================================
  Files         531      531              
  Lines       88202    88275      +73     
==========================================
+ Hits        70091    70136      +45     
- Misses      18111    18139      +28     
Files with missing lines Coverage Δ
src/cmd/builtin/overlay.c 77.74% <85.26%> (+0.66%) ⬆️

... and 6 files with indirect coverage changes

@grondo grondo deleted the overlay-errors branch January 31, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants