Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: preserve allocations enriched during placement as 'informational' #24960

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pkazmierczak
Copy link
Contributor

During the work on stateful deployments we discovered that if a job uses stateful deployments and gets deployed into a cluster with multiple volumes that have the same label, and we drain the node that job is running on, the scheduler would replace it on another node—ignoring the logic that requires a feasible node to have a particular volume ID. The reason for this behavior is the design of the reconciler: allocations that are being migrated (not rescheduled) are marked as ignored by the reconciler, and new allocations do not know about their previousAllocations, they have no "history." This has not been a problem until now, but stateful deployments "enrich" allocations with host volume IDs during placement, the task group itself carries no information other than the volume label.

This PR introduces a new category of allocations in the reconciler, we call them informational. These allocations are still ignored as it was before, but before landing in the ignore bucket they are retained for future reference. For now, only stateful deployments use these allocations, but future Nomad features (or indeed a refactoring of the reconciler...) could well use this new category.

This PR fixes the issue discovered in #24869

@pkazmierczak pkazmierczak requested review from a team as code owners January 27, 2025 18:21
@pkazmierczak pkazmierczak added this to the 1.10.0 milestone Jan 27, 2025
@pkazmierczak pkazmierczak self-assigned this Jan 27, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems promising!

scheduler/reconcile_util.go Show resolved Hide resolved
scheduler/reconcile_util_test.go Show resolved Hide resolved
Comment on lines +1125 to +1134
{
name: "Count 3, 2 allocs failed, 1 stopped, no reschedule",
count: 3,
stoppedCount: 1,
failedCount: 2,
reschedulePolicy: disabledReschedulePolicy,
expectPlace: 2,
expectStop: 1,
expectIgnore: 1,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case appears to cover your comment @tgross, does it not? The desired behavior, if I'm correct, should be 2 placed allocs, 1 stopped and 1 ignored in this case, which is exactly what we're getting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?

Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace where those come from to help future readers.

Comment on lines +1153 to +1155
expectPlace: 2,
expectStop: 1,
expectIgnore: 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a 100% confident about these. Intuitively, I would expect 2 allocs to place, 0 to ignore and 0 to stop. This might have to do with what nodes are available in this case, I will look into this and do some additional manual testing to be sure that we're setting alloc desired status and client status correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, this one looks funny.

Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?

Comment on lines +1125 to +1134
{
name: "Count 3, 2 allocs failed, 1 stopped, no reschedule",
count: 3,
stoppedCount: 1,
failedCount: 2,
reschedulePolicy: disabledReschedulePolicy,
expectPlace: 2,
expectStop: 1,
expectIgnore: 1,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?

Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace where those come from to help future readers.

Comment on lines +1153 to +1155
expectPlace: 2,
expectStop: 1,
expectIgnore: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, this one looks funny.

Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?

Comment on lines +391 to +393
// set host volume IDs on running allocations to make sure their presence doesn't
// interfere with reconciler behavior
alloc.HostVolumeIDs = []string{"host-volume1", "host-volume2"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a great quick-and-dirty addition for avoiding regressions 👍

@pkazmierczak pkazmierczak marked this pull request as draft January 29, 2025 18:48
@pkazmierczak
Copy link
Contributor Author

Converting this PR to draft for the time being, as we're exploring other avenues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants