dynamic host volumes: serialize ops per volume #24852

gulducat · 2025-01-13T23:01:46Z

Let only one of any create/register/delete run at a time per volume ID.

DHV plugins can assume that Nomad will not run concurrent operations for the same volume
We avoid interleaving client RPCs with raft writes

so DHV plugins can assume that Nomad will not run concurrent operations for the same volume, and for us to avoid interleaving client RPCs with raft writes

tgross

I was a little worried this was going to be tough to do without either a coarse-grained mutex or a hot loop, but your implementation elegantly avoids both problems. Nice work on this so far.

nomad/host_volume_endpoint_test.go

tgross

LGTM!

nomad/host_volume_endpoint_test.go

feelin pretty good about this one

gulducat · 2025-01-16T01:12:41Z

nomad/host_volume_endpoint_test.go

+	// block until something runs unblockCurrent()
+	if bc := v.getBlockChan(); bc != nil {
+		bc <- "create"
+	}


I refactored it (again), and while it's still convoluted, it does actually test the serialization.

instead of these mock client methods blocking on a receive, they block on a send. then unblockCurrent() unblocks them by receiving, and it returns what it got, i.e. whichever client RPC was being called at the time.

then the test code makes sure that matches which server RPC completed, and we can check the state as appropriate for the operation type.

aside from me thinking it works, I can tell it does because if I don't use the same volume ID as initial create, then stuff happens in random orders and there are chaotic test errors.

tgross

LGTM!

We can probably reuse this logic for tightening up the serialization of CSI operations too, so it's great that we're spending the effort to make this solid.

tgross · 2025-01-16T14:15:43Z

nomad/host_volume_endpoint_test.go

+			}
+		case <-time.After(time.Second):
+			t.Error("timeout waiting for an RPC to finish")
+			break LOOP


This will deadlock the test if we break while there are any pending funcs (i.e. if the first or second times out), because we'll never unblock the channel for the remaining functions but call funcs.Wait() after the loop breaks. A deadlocked test manifests as the whole test run blowing up in a panic after the 20min timeout, which is pretty awful to debug on GHA because it's not obvious which test is panicking without a lot of rummaging in the logs.

We could fix this by having a shutdown context on the mockVolumeClient. Just before we break the loop we can close that context. And then the send looks something like this:

// block until something runs unblockCurrent() if bc := v.getBlockChan(); bc != nil { select { case bc <- "delete": case <-v.shutdownCtx.Done(): v.setBlockChan(nil) } }

That would allow the funcs.Wait() to run to completion and close out the test. We'd still end up failing the test because of the t.Error here.

All that being said, I'll 👍 this PR as-is, as it'd be a bit of effort to fix and only matters in the case we break the test. Up to you.

I thought I had caught all the deadlocks! I caused enough of them while writing this 😋 but you're totally right. I pushed another commit to cover it.

so it's great that we're spending the effort to make this solid.

thanks for saying so; I was worried I was deadlocked myself, spending too much time on it 😅

tgross

LGTM!

dynamic host volumes: serialize ops per volume

c0dc579

so DHV plugins can assume that Nomad will not run concurrent operations for the same volume, and for us to avoid interleaving client RPCs with raft writes

gulducat requested a review from tgross January 13, 2025 23:01

tgross reviewed Jan 14, 2025

View reviewed changes

add concurrency test

5d7ad68

vercel bot deployed to Preview – nomad-ui January 14, 2025 20:18 View deployment

gulducat commented Jan 14, 2025

View reviewed changes

nomad/host_volume_endpoint_test.go Outdated Show resolved Hide resolved

block on channel instead of sleep

a792c0d

vercel bot deployed to Preview – nomad-ui January 15, 2025 15:38 View deployment

gulducat marked this pull request as ready for review January 15, 2025 17:46

gulducat requested review from a team as code owners January 15, 2025 17:46

tgross previously approved these changes Jan 15, 2025

View reviewed changes

nomad/host_volume_endpoint_test.go Outdated Show resolved Hide resolved

nomad/host_volume_endpoint_test.go Outdated Show resolved Hide resolved

gulducat dismissed tgross’s stale review via 54367db January 16, 2025 00:58

refactor the test entirely, again

2a771c1

feelin pretty good about this one

vercel bot deployed to Preview – nomad-ui January 16, 2025 00:59 View deployment

gulducat force-pushed the dhv-serialize-ops-per-vol branch from 54367db to 2a771c1 Compare January 16, 2025 00:59

vercel bot deployed to Preview – nomad-ui January 16, 2025 01:00 View deployment

gulducat commented Jan 16, 2025

View reviewed changes

tgross previously approved these changes Jan 16, 2025

View reviewed changes

shutdownCtx to prevent deadlock

aeef274

gulducat dismissed tgross’s stale review via aeef274 January 16, 2025 15:38

vercel bot deployed to Preview – nomad-ui January 16, 2025 15:39 View deployment

tgross approved these changes Jan 16, 2025

View reviewed changes

gulducat merged commit 4807e74 into main Jan 17, 2025
30 checks passed

gulducat deleted the dhv-serialize-ops-per-vol branch January 17, 2025 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic host volumes: serialize ops per volume #24852

dynamic host volumes: serialize ops per volume #24852

gulducat commented Jan 13, 2025 •

edited

Loading

tgross left a comment

tgross left a comment

gulducat Jan 16, 2025 •

edited

Loading

tgross left a comment

tgross Jan 16, 2025

gulducat Jan 16, 2025

tgross left a comment

dynamic host volumes: serialize ops per volume #24852

dynamic host volumes: serialize ops per volume #24852

Conversation

gulducat commented Jan 13, 2025 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

gulducat Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Jan 16, 2025

Choose a reason for hiding this comment

gulducat Jan 16, 2025

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

gulducat commented Jan 13, 2025 •

edited

Loading

gulducat Jan 16, 2025 •

edited

Loading