Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch task queue user data persistence updates #7039

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

dnr
Copy link
Member

@dnr dnr commented Dec 30, 2024

What changed?

Multiple user data updates coming in for task queues in the same namespace within a short period of time get batched into a smaller number of persistence operations (transactions).

Since multiple updates are batched into a transaction, conflicts in one can cause unrelated ones to fail. This is detected and a non-retryable error is returned for the conflicting one, while a retryable error is returned for the other ones.

Why?

With deployments, we sometimes have to update user data on multiple task queues at once (all in the same namespace), and on cassandra, these updates go through an LWT. This could cause a backup since the throughput of LWTs is fairly low.

This change allows batching of multiple updates in one persistence operation (LWT on cassandra or transaction on sql). The batching is transparent: updates that come in within a short period of time automatically get batched (in matching engine).

How did you test it?

  • unit test for batcher component
  • added some persistence tests for user data, including conflict behavior (there were none before)
  • existing tests for user data updates

Potential risks

  • small extra latency on all user data updates

@dnr dnr requested a review from a team as a code owner December 30, 2024 20:52
Copy link
Collaborator

@ShahabT ShahabT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, but someone else should review line-by-line.

if err != nil {
return err
if m.Db.IsDupEntryError(err) {
return &persistence.ConditionFailedError{Msg: err.Error()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this type of error handling you will end up with "partially updated" state, but you don't really know which one are passed.
(unless I miss something).
Is there a reason to stop on error? or may it make sense to move forward, and return an array of failed updates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a transaction so you can't end up with things partially updated, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not clear that all of this happening inside transaction. Probably worth a comment

common/persistence/task_manager.go Show resolved Hide resolved
common/stream_batcher/batcher.go Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
service/matching/matching_engine_test.go Show resolved Hide resolved
// try to add more items. stop after a gap of MaxGap, total time of MaxTotalWait, or
// MaxItems items.
maxWaitC, maxWaitT := s.clock.NewTimer(s.opts.MaxDelay)
loop:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand all the possible implications, so I guess I will trust this work and covered by tests.

Copy link
Contributor

@stephanos stephanos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First half of my review. I'll review the stream_batcher next.

NamespaceID string
TaskQueue string
NamespaceID string
Updates map[string]*SingleTaskQueueUserDataUpdate // key is task queue name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: I usually name maps with ambiguous keys byXY, ie UpdatesByTaskQueue. Another option that seems reasonable is to have a new type taskQueueName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these struct names are so long already... 😥

service/matching/matching_engine.go Outdated Show resolved Hide resolved
service/matching/matching_engine.go Outdated Show resolved Hide resolved
service/matching/matching_engine.go Outdated Show resolved Hide resolved
}

previous := make(map[string]interface{})
previous := make(map[string]any)
applied, iter, err := d.Session.MapExecuteBatchCAS(batch, previous)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retries/conflicts are not handled well yet: on a version conflict, all updates in the batch will fail and none will be retried, but the non-conflicting ones should be retried

From a user perspective, is it acceptable to be released like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what the user data change is.. if it's something initiated by a user rpc (versioning-1 or -2), they can just get an error and re-run it, so probably fine. For versioning-3 it'll be initiated by a deployment workflow, for registering a new task queue with a deployment, and for changing the current deployment. The registration will get retried so that's okay. For changing the current... I'm not sure if the error will get propagated back correctly or retried.

Basically the answer is: probably not. So I'll plan to fix it in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now

common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
common/stream_batcher/batcher.go Outdated Show resolved Hide resolved
// process batch
r := s.fn(items)

// send responses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call out here (or somewhere else) that all all batch items receive the same response. (without the PR context, that would have been surprising)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the processor function is only called once per batch and returns one value.. how can they receive anything other than the same response?

clk.AdvanceNext() // first Add
time.Sleep(time.Millisecond)
clk.AdvanceNext() // second Add
time.Sleep(time.Millisecond)
Copy link
Contributor

@stephanos stephanos Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tangential question: the time.Sleeps are necessary because of the use of context.Context, right? I wonder if there's also a "fake context" impl that we could connect with the timesource to make things like this fully deterministic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing to do with Context, it's just that it needs other goroutines to advance to get themselves blocked on the fake clock so that the AdvanceNext does the right thing. (the fake clock already has a fake context, btw.)

the new synctest thing in go 1.24 fixes this by integrating with the runtime.. I think it will make these tests a lot nicer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants