Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic host volumes: serialize ops per volume #24852

Merged
merged 5 commits into from
Jan 17, 2025
Merged

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented Jan 13, 2025

Let only one of any create/register/delete run at a time per volume ID.

  • DHV plugins can assume that Nomad will not run concurrent operations for the same volume
  • We avoid interleaving client RPCs with raft writes

so DHV plugins can assume that Nomad will not run
concurrent operations for the same volume, and for us
to avoid interleaving client RPCs with raft writes
@gulducat gulducat requested a review from tgross January 13, 2025 23:01
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little worried this was going to be tough to do without either a coarse-grained mutex or a hot loop, but your implementation elegantly avoids both problems. Nice work on this so far.

@gulducat gulducat marked this pull request as ready for review January 15, 2025 17:46
@gulducat gulducat requested review from a team as code owners January 15, 2025 17:46
tgross
tgross previously approved these changes Jan 15, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

nomad/host_volume_endpoint_test.go Outdated Show resolved Hide resolved
nomad/host_volume_endpoint_test.go Outdated Show resolved Hide resolved
feelin pretty good about this one
Comment on lines 1013 to 1016
// block until something runs unblockCurrent()
if bc := v.getBlockChan(); bc != nil {
bc <- "create"
}
Copy link
Member Author

@gulducat gulducat Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it (again), and while it's still convoluted, it does actually test the serialization.

instead of these mock client methods blocking on a receive, they block on a send. then unblockCurrent() unblocks them by receiving, and it returns what it got, i.e. whichever client RPC was being called at the time.

then the test code makes sure that matches which server RPC completed, and we can check the state as appropriate for the operation type.

aside from me thinking it works, I can tell it does because if I don't use the same volume ID as initial create, then stuff happens in random orders and there are chaotic test errors.

tgross
tgross previously approved these changes Jan 16, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

We can probably reuse this logic for tightening up the serialization of CSI operations too, so it's great that we're spending the effort to make this solid.

}
case <-time.After(time.Second):
t.Error("timeout waiting for an RPC to finish")
break LOOP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will deadlock the test if we break while there are any pending funcs (i.e. if the first or second times out), because we'll never unblock the channel for the remaining functions but call funcs.Wait() after the loop breaks. A deadlocked test manifests as the whole test run blowing up in a panic after the 20min timeout, which is pretty awful to debug on GHA because it's not obvious which test is panicking without a lot of rummaging in the logs.

We could fix this by having a shutdown context on the mockVolumeClient. Just before we break the loop we can close that context. And then the send looks something like this:

	// block until something runs unblockCurrent()
	if bc := v.getBlockChan(); bc != nil {
		select {
			case bc <- "delete":
			case <-v.shutdownCtx.Done():
				v.setBlockChan(nil)
		}
	}

That would allow the funcs.Wait() to run to completion and close out the test. We'd still end up failing the test because of the t.Error here.

All that being said, I'll 👍 this PR as-is, as it'd be a bit of effort to fix and only matters in the case we break the test. Up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had caught all the deadlocks! I caused enough of them while writing this 😋 but you're totally right. I pushed another commit to cover it.

so it's great that we're spending the effort to make this solid.

thanks for saying so; I was worried I was deadlocked myself, spending too much time on it 😅

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gulducat gulducat merged commit 4807e74 into main Jan 17, 2025
30 checks passed
@gulducat gulducat deleted the dhv-serialize-ops-per-vol branch January 17, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants