scheduler: preserve allocations enriched during placement as 'informational' #24960

pkazmierczak · 2025-01-27T18:21:01Z

During the work on stateful deployments we discovered that if a job uses stateful deployments and gets deployed into a cluster with multiple volumes that have the same label, and we drain the node that job is running on, the scheduler would replace it on another node—ignoring the logic that requires a feasible node to have a particular volume ID. The reason for this behavior is the design of the reconciler: allocations that are being migrated (not rescheduled) are marked as ignored by the reconciler, and new allocations do not know about their previousAllocations, they have no "history." This has not been a problem until now, but stateful deployments "enrich" allocations with host volume IDs during placement, the task group itself carries no information other than the volume label.

This PR introduces a new category of allocations in the reconciler, we call them informational. These allocations are still ignored as it was before, but before landing in the ignore bucket they are retained for future reference. For now, only stateful deployments use these allocations, but future Nomad features (or indeed a refactoring of the reconciler...) could well use this new category.

This PR fixes the issue discovered in #24869

…tional'

tgross

This approach seems promising!

scheduler/reconcile_util.go

scheduler/reconcile_util_test.go

pkazmierczak · 2025-01-28T19:04:23Z

scheduler/reconcile_test.go

+		{
+			name:             "Count 3, 2 allocs failed, 1 stopped, no reschedule",
+			count:            3,
+			stoppedCount:     1,
+			failedCount:      2,
+			reschedulePolicy: disabledReschedulePolicy,
+			expectPlace:      2,
+			expectStop:       1,
+			expectIgnore:     1,
+		},


This case appears to cover your comment @tgross, does it not? The desired behavior, if I'm correct, should be 2 placed allocs, 1 stopped and 1 ignored in this case, which is exactly what we're getting.

~~If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?~~

Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace where those come from to help future readers.

pkazmierczak · 2025-01-28T19:06:18Z

scheduler/reconcile_test.go

+			expectPlace:      2,
+			expectStop:       1,
+			expectIgnore:     0,


I'm not a 100% confident about these. Intuitively, I would expect 2 allocs to place, 0 to ignore and 0 to stop. This might have to do with what nodes are available in this case, I will look into this and do some additional manual testing to be sure that we're setting alloc desired status and client status correctly.

Yeah, agreed, this one looks funny.

Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?

tgross · 2025-01-28T19:47:24Z

scheduler/reconcile_test.go

+		{
+			name:             "Count 3, 2 allocs failed, 1 stopped, no reschedule",
+			count:            3,
+			stoppedCount:     1,
+			failedCount:      2,
+			reschedulePolicy: disabledReschedulePolicy,
+			expectPlace:      2,
+			expectStop:       1,
+			expectIgnore:     1,
+		},


~~If rescheduling is disabled, can't we only replace the stopped allocation? Where does the other placement come from?~~

Oh, I see... this test is actually quite complicated, as the first failed alloc is on the down node and the second failed alloc is on the disconnected node. So the 2nd failed alloc is resulting in a replacement for the disconnect? We should probably leave a comment on the expectPlace where those come from to help future readers.

tgross · 2025-01-28T19:49:04Z

scheduler/reconcile_test.go

+			expectPlace:      2,
+			expectStop:       1,
+			expectIgnore:     0,


Yeah, agreed, this one looks funny.

Like the one above, I'd annotate the expectations here because it's not intuitive. You've got no failed allocs, so one stopped alloc is sitting on a down node, and the other stopped alloc is disconnected. So I'd expect 1 placement for the down node, and 1 temporary replacement for the disconnected alloc. Where's the stop come from? Are we calling stop for an allocation that's already been stopped?

tgross · 2025-01-28T19:59:28Z

scheduler/reconcile_test.go

+		// set host volume IDs on running allocations to make sure their presence doesn't
+		// interfere with reconciler behavior
+		alloc.HostVolumeIDs = []string{"host-volume1", "host-volume2"}


These are a great quick-and-dirty addition for avoiding regressions 👍

pkazmierczak · 2025-01-29T18:48:46Z

Converting this PR to draft for the time being, as we're exploring other avenues.

scheduler: preserve allocations enriched during placement as 'informa…

383559b

…tional'

pkazmierczak requested review from schmichael and tgross January 27, 2025 18:21

pkazmierczak requested review from a team as code owners January 27, 2025 18:21

pkazmierczak added the theme/scheduling label Jan 27, 2025

pkazmierczak added this to the 1.10.0 milestone Jan 27, 2025

pkazmierczak self-assigned this Jan 27, 2025

tgross reviewed Jan 27, 2025

View reviewed changes

scheduler/reconcile_util.go Show resolved Hide resolved

scheduler/reconcile_util_test.go Show resolved Hide resolved

make sure we don't introduce any regressions

9155b10

vercel bot deployed to Preview – nomad-ui January 28, 2025 13:41 View deployment

TestReconciler_InformationalAllocs

043579b

vercel bot deployed to Preview – nomad-ui January 28, 2025 19:03 View deployment

pkazmierczak commented Jan 28, 2025

View reviewed changes

pkazmierczak mentioned this pull request Jan 28, 2025

E2E: dynamic host volume tests for sticky volumes #24869

Draft

tgross reviewed Jan 28, 2025

View reviewed changes

pkazmierczak marked this pull request as draft January 29, 2025 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: preserve allocations enriched during placement as 'informational' #24960

scheduler: preserve allocations enriched during placement as 'informational' #24960

pkazmierczak commented Jan 27, 2025

tgross left a comment

pkazmierczak Jan 28, 2025

tgross Jan 28, 2025

pkazmierczak Jan 28, 2025

tgross Jan 28, 2025

tgross Jan 28, 2025

tgross Jan 28, 2025

tgross Jan 28, 2025

pkazmierczak commented Jan 29, 2025

scheduler: preserve allocations enriched during placement as 'informational' #24960

Are you sure you want to change the base?

scheduler: preserve allocations enriched during placement as 'informational' #24960

Conversation

pkazmierczak commented Jan 27, 2025

tgross left a comment

Choose a reason for hiding this comment

pkazmierczak Jan 28, 2025

Choose a reason for hiding this comment

tgross Jan 28, 2025

Choose a reason for hiding this comment

pkazmierczak Jan 28, 2025

Choose a reason for hiding this comment

tgross Jan 28, 2025

Choose a reason for hiding this comment

tgross Jan 28, 2025

Choose a reason for hiding this comment

tgross Jan 28, 2025

Choose a reason for hiding this comment

tgross Jan 28, 2025

Choose a reason for hiding this comment

pkazmierczak commented Jan 29, 2025