-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve scalability of flux overlay errors
#6593
Conversation
69f0cbd
to
d0bf6e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review checkpoint - kids waking up :-) in middle of the big commit :-)
src/cmd/builtin/overlay.c
Outdated
json_t *topo; | ||
struct idset *subtree_ranks; | ||
|
||
json_t *topology = get_topology (h); | ||
|
||
if (!(topo = get_subtree_topology (topology, rank))) { | ||
log_err ("get_subtree_topology"); | ||
return NULL; | ||
} | ||
return topology_subtree_ranks (topo); | ||
subtree_ranks = topology_subtree_ranks (topo); | ||
return subtree_ranks; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a cleanup that is accidentally in the "fix memory leaks" commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, leftover debugging, good catch.
src/cmd/builtin/overlay.c
Outdated
/* On large systems with a flat tbon, the default timeout of 0.5s | ||
* may not be long enough because the program may spend a long time | ||
* simply processing the initial JSON response payload for rank 0, and | ||
* meanwhile the then timeout is ticking. Therefore, if the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the then timeout" -> "the timeout"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I mean the then()
timeout, but you're right that's confusing and it is obvious what's meant without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that comment was a bit of a run-on thought-blob so I've simplified it and forced a push.
d0bf6e6
to
ddccf90
Compare
So there's one part of the big commit that I'm getting confused on and makes me think there is a race. Or I'm missing some subtlety in the code, b/c I presume this is working on elcap.
This part makes sense to me. BUT when For example
|
You may be correct. I reworked the prep/check/idle handling at the last minute and it may only work on elcap because that's a flat tbon. Let me fix that if necessary. |
I think my example above could occur on a flat tbon as well. The main thing is on the first "iteration", pending ranks should become cleared. So unless the rank 0 response is received before the next prep, I think the prep/check will be stopped. |
The first iteration only sends up to max_rpcs. If the cluster's larger than that pending ranks won't be clear. If it is smaller then we don't need prep/check. However, this could definitely happen on a future iteration when ranks have children so point taken and this needs to be fixed I think. Hm, looking at the branch here I also seem to have pushed the wrong version. I'll let you know when I've sorted that out. |
Problem: The `flux overlay errors` command doesn't run clean under valgrind, making it difficult to use valgrind to find newly introduced errors. Free memory where needed to resolve leaks.
By "first iteration", I'm referring to just the rank 0. The recursive step (the "next iteration") is after the response from rank 0. |
1302a9a
to
ee048ea
Compare
Oh, I think that works (in the previous version of the PR) because rank 0 is in the pending_ranks idset, so the idset is not empty and the prep/check watchers are not stopped before the next "iteration". I did find that I had accidentally addressed your initial comments on an older branch and pushed that (it didn't even work at all on non-flat TBON overlays) I've pushed the updated version (which does start the prep/check watchers in gather_errors_cb as you noted is necessary for non flat overlays) |
Oh I'm so dumb, I see why it doesn't matter for a flat tbon. You only have to do one "iteration". Rank 0 has all the children :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just the one nit below.
src/cmd/builtin/overlay.c
Outdated
struct timespec t0; | ||
monotime (&t0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left over debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sheesh! 🤦
Problem: The `flux overlay errors` program exhibits scaling issues on large systems because it contacts all ranks serially. Update `flux overlay errors` to operate asynchronously using the Flux reactor. To avoid sending a very large number of messages without entering the reactor, use check/prep/idle watchers to limit the number of outstanding RPCs to 512 (experimentally determined to be a good value for clusters of 10K nodes). Fixes flux-framework#6517
Problem: On systems with ~10K nodes, `flux overlay errors` sometimes reports "Connection timed out" for some ranks for which RPCs are issued on the first iteration. The problem seems to be that the timeout starts immediately when `flux_future_then(3)` is called, but for large systems with a flat TBON the program may not re-enter the reactor for >0.5s due to the size of the initial payload. While one solution would be to delay sending _any_ RPCs until the first time the check watcher is called, this unnecessarily extends the runtime of the program by at least the initial payload processing time. Instead, scale the timeout for large systems (>2K nodes) by the size of system, such that 10K node systems get a roughly 2.5s timeout, which seems to be a safe value. Note that a long timeout is not as much of a problem as in previous versions of the program where overlay.health RPCs were sent serially, since the longer timeout can now happen in parallel.
ee048ea
to
2e38408
Compare
Thanks for your attention to detail on this one @chu11. I'll set MWP here now. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6593 +/- ##
==========================================
- Coverage 79.46% 79.45% -0.02%
==========================================
Files 531 531
Lines 88202 88275 +73
==========================================
+ Hits 70091 70136 +45
- Misses 18111 18139 +28
|
This PR replaces the synchronous RPCs used in
flux overlay errors
with asynchronous. This improves responsiveness on large clusters, e.g. on elcap:before:
after: