Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ErrStateMachineNotFound handling in HSM state replication #7032

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

justinp-tt
Copy link
Contributor

@justinp-tt justinp-tt commented Dec 23, 2024

What changed?

  • Modified HSMStateReplicatorImpl.syncHSMNode() to handle ErrStateMachineNotFound gracefully
  • Added debug logging with correct field reference to OriginalExecutionRunId
  • Added unit test TestSyncHSM_StateMachineNotFound to verify behavior

Why?

After adding support for state deletion in terminal states in Nexus, nightly tests started failing when sync HSM tasks tried to replicate state machines that had been legitimately deleted. Since the deletion is intentional for terminal states, we should gracefully handle these cases by logging and continuing replication of other state machines.

How did you test it?

  • Added unit test verifying graceful handling of ErrStateMachineNotFound
  • Existing nightly test failures should be resolved by this change

Potential risks

  • If there are cases where a state machine is temporarily unavailable (rather than legitimately deleted), we might incorrectly continue processing
  • However, based on the HSM implementation, state machines are either present in persistence or not - there is no transient state
  • Suppressing ErrStateMachineNotFound could potentially mask other issues if the error occurs for unexpected reasons

Documentation

No documentation changes required as this is an internal implementation detail handling error cases in the replication path.

Is hotfix candidate?

No - while this fixes test failures, it's not causing production issues that would warrant a hotfix.

@justinp-tt justinp-tt changed the title Ignore state machine not found during sync Drop sync HSM tasks when state machine not found Dec 23, 2024
@justinp-tt justinp-tt self-assigned this Dec 23, 2024
@justinp-tt justinp-tt requested a review from bergundy December 23, 2024 18:55
Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with the replication code path, there may be a better place to put this error handling logic. Were the errors you observed on the source or target cluster?

@justinp-tt justinp-tt changed the title Drop sync HSM tasks when state machine not found Fix ErrStateMachineNotFound handling in HSM state replication Dec 23, 2024
// Based on 1 and 2, node should always be found here.
// The node may not be found if:
// 1. The state machine was deleted (e.g. terminal state cleanup)
// 2. We're missing events that created this node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true based on the comment that you deleted.

Already done history resend if needed before,
// and node creation today always associated with an event

I would also clarify that creation and deletion are always associated with an event.

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify the code comments, otherwise LGTM.

tag.WorkflowID(mutableState.GetExecutionInfo().WorkflowId),
tag.WorkflowRunID(mutableState.GetExecutionInfo().OriginalExecutionRunId),
)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add a sanity check that version history in the mutable state is > the one in the request (same as the one on L265. or just return that info from compareVersionHistory), and return an error otherwise.

Copy link
Contributor Author

@justinp-tt justinp-tt Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already check this in compareVersionHistory? In other words, an error will be returned by compareVersionHistory if the condition you mention exists, so we won't even get to the point of a ErrStateMachineNotFound error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not sure I follow. Error is returned from compareVersionHistory if last version history item of the (local) mutable state is < that in the request. The check I mentioned is for > (also not the same as the >= checked in compareVersionHistory)

@justinp-tt justinp-tt marked this pull request as ready for review January 21, 2025 17:41
@justinp-tt justinp-tt requested a review from a team as a code owner January 21, 2025 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants