Heads stuck in a state without being able to progress snapshots #1773

twwu123 · 2025-01-03T07:53:32Z

Context & versions

Hydra version: 0.19.0

At some point, some of the hydra-nodes failed in signing new snapshots, creating a situation where the local ledger state of each hydra-node deviates from each other, essentially forked states. It is still possible to submit transactions to the hydra-nodes, but doing so is useless, as the snapshots do not update unless all of the hydra-nodes sign it, and each of the nodes start to accept transactions based on completely different states and stop agreeing on what UTxOs have been spent or not.

Actual behavior

Firstly, I'm unsure why the hydra-nodes stop signing new snapshots, but once they do, it seems to be unrecoverable. There is no way to forcibly sync the snapshots of the hydra-nodes, nor do the nodes attempt to sign any new states, it doesn't even seem like they attempt to update the snapshots in any way.

Unfortunately, my use case will result in a state that is impossible to close for the majority of the time, and right before a planned close, we will reconcile the state to a closable state. This means that whenever this happens, the head is completely doomed, and unrecoverable.

Expected behavior

Hopefully allow some way to recover from such a situation, there does seem to be a snapshot that all the hydra-nodes agree on, but somehow the local ledger states start to deviate from each other. I suspect the easiest solution would be to allow some way to reset the nodes' ledger state to the most recently agreed upon snapshot.

noonio · 2025-01-08T13:47:12Z

One of the main aims of #1468 is to solve this problem.

Let us know if you think that would help you! If so, it's on our roadmap very soon, so hopefully that will help :)

twwu123 · 2025-01-13T05:47:19Z

One of the main aims of #1468 is to solve this problem.

Let us know if you think that would help you! If so, it's on our roadmap very soon, so hopefully that will help :)

This will definitely help in terms of being able to fanout a subset of usable UTxOs, there may be something I'm missing, but does it also help in terms of reconciling forked local states?

ch1bo · 2025-01-15T07:50:06Z

@twwu123 Thanks for reporting this and evaluating the Hydra project in-depth!

You raise very valid points and your understanding is just right. The hydra-nodes seem to have reached different local views on what to snapshot and at the same time, snapshot signing seems to have stalled in your situation. We also called such case that the Hydra head got "stuck", see for example all issues mentioning "stuck".

As you can see from past issues, this can happen if networking, persistence or version incompatibility was preventing smooth off-chain protocol progress. Besides making the individual components more resilient to faults (e.g. by following up a successful experiment with #1720), we also discussed fallback mechanisms for situations where its purely a technical hiccup and not a loss in consensus that is preventing progress, for example: #1284

Your expected behavior indicates that such a clearing of the diverging local views or reset to the last confirmed snapshot would be a solution for you?

twwu123 · 2025-01-15T07:56:25Z

Your expected behavior indicates that such a clearing of the diverging local views or reset to the last confirmed snapshot would be a solution for you?

The most ideal solution for me would be a way to allow one hydra-node to send their local state for everyone else to sign, such that everyone is sync'd again to one node's version of the ledger state.

If this isn't possible, then at the very least, resetting everyone's local state to the latest agreed on snapshot would be my second option.

My understanding is that both of these should be possible. However, the first option may have small security concerns, and might require some sort of manual "approval" of a specific state for every hydra-node.

ch1bo · 2025-01-15T08:17:41Z

My understanding is that both of these should be possible. However, the first option may have small security concerns, and might require some sort of manual "approval" of a specific state for every hydra-node.

Yes, exactly. A node can't just trust another node by adopting their state.

That being said, we should make the snapshotting logic more defensive and retry more, as this could be an issue here too.

twwu123 added the bug 🐛 Something isn't working label Jan 3, 2025

noonio mentioned this issue Jan 8, 2025

Partial fanout #1468

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heads stuck in a state without being able to progress snapshots #1773

Heads stuck in a state without being able to progress snapshots #1773

twwu123 commented Jan 3, 2025 •

edited

Loading

noonio commented Jan 8, 2025

twwu123 commented Jan 13, 2025

ch1bo commented Jan 15, 2025

twwu123 commented Jan 15, 2025 •

edited

Loading

ch1bo commented Jan 15, 2025

Heads stuck in a state without being able to progress snapshots #1773

Heads stuck in a state without being able to progress snapshots #1773

Comments

twwu123 commented Jan 3, 2025 • edited Loading

Context & versions

Actual behavior

Expected behavior

noonio commented Jan 8, 2025

twwu123 commented Jan 13, 2025

ch1bo commented Jan 15, 2025

twwu123 commented Jan 15, 2025 • edited Loading

ch1bo commented Jan 15, 2025

twwu123 commented Jan 3, 2025 •

edited

Loading

twwu123 commented Jan 15, 2025 •

edited

Loading