-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: preserve allocations enriched during placement as 'informational' #24960
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems promising!
{ | ||
name: "Count 3, 2 allocs failed, 1 stopped, no reschedule", | ||
count: 3, | ||
stoppedCount: 1, | ||
failedCount: 2, | ||
reschedulePolicy: disabledReschedulePolicy, | ||
expectPlace: 2, | ||
expectStop: 1, | ||
expectIgnore: 1, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case appears to cover your comment @tgross, does it not? The desired behavior, if I'm correct, should be 2 placed allocs, 1 stopped and 1 ignored in this case, which is exactly what we're getting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?
Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace
where those come from to help future readers.
expectPlace: 2, | ||
expectStop: 1, | ||
expectIgnore: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a 100% confident about these. Intuitively, I would expect 2 allocs to place, 0 to ignore and 0 to stop. This might have to do with what nodes are available in this case, I will look into this and do some additional manual testing to be sure that we're setting alloc desired status and client status correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, this one looks funny.
Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?
{ | ||
name: "Count 3, 2 allocs failed, 1 stopped, no reschedule", | ||
count: 3, | ||
stoppedCount: 1, | ||
failedCount: 2, | ||
reschedulePolicy: disabledReschedulePolicy, | ||
expectPlace: 2, | ||
expectStop: 1, | ||
expectIgnore: 1, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?
Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace
where those come from to help future readers.
expectPlace: 2, | ||
expectStop: 1, | ||
expectIgnore: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, this one looks funny.
Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?
// set host volume IDs on running allocations to make sure their presence doesn't | ||
// interfere with reconciler behavior | ||
alloc.HostVolumeIDs = []string{"host-volume1", "host-volume2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a great quick-and-dirty addition for avoiding regressions 👍
Converting this PR to draft for the time being, as we're exploring other avenues. |
During the work on stateful deployments we discovered that if a job uses stateful deployments and gets deployed into a cluster with multiple volumes that have the same label, and we drain the node that job is running on, the scheduler would replace it on another node—ignoring the logic that requires a feasible node to have a particular volume ID. The reason for this behavior is the design of the reconciler: allocations that are being
migrated
(notrescheduled
) are marked asignored
by the reconciler, and new allocations do not know about theirpreviousAllocations
, they have no "history." This has not been a problem until now, but stateful deployments "enrich" allocations with host volume IDs during placement, the task group itself carries no information other than the volume label.This PR introduces a new category of allocations in the reconciler, we call them
informational.
These allocations are still ignored as it was before, but before landing in theignore
bucket they are retained for future reference. For now, only stateful deployments use these allocations, but future Nomad features (or indeed a refactoring of the reconciler...) could well use this new category.This PR fixes the issue discovered in #24869