Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heads stuck in a state without being able to progress snapshots #1773

Open
twwu123 opened this issue Jan 3, 2025 · 5 comments
Open

Heads stuck in a state without being able to progress snapshots #1773

twwu123 opened this issue Jan 3, 2025 · 5 comments
Labels
bug 🐛 Something isn't working

Comments

@twwu123
Copy link

twwu123 commented Jan 3, 2025

Context & versions

Hydra version: 0.19.0

At some point, some of the hydra-nodes failed in signing new snapshots, creating a situation where the local ledger state of each hydra-node deviates from each other, essentially forked states. It is still possible to submit transactions to the hydra-nodes, but doing so is useless, as the snapshots do not update unless all of the hydra-nodes sign it, and each of the nodes start to accept transactions based on completely different states and stop agreeing on what UTxOs have been spent or not.

Actual behavior

Firstly, I'm unsure why the hydra-nodes stop signing new snapshots, but once they do, it seems to be unrecoverable. There is no way to forcibly sync the snapshots of the hydra-nodes, nor do the nodes attempt to sign any new states, it doesn't even seem like they attempt to update the snapshots in any way.

Unfortunately, my use case will result in a state that is impossible to close for the majority of the time, and right before a planned close, we will reconcile the state to a closable state. This means that whenever this happens, the head is completely doomed, and unrecoverable.

Expected behavior

Hopefully allow some way to recover from such a situation, there does seem to be a snapshot that all the hydra-nodes agree on, but somehow the local ledger states start to deviate from each other. I suspect the easiest solution would be to allow some way to reset the nodes' ledger state to the most recently agreed upon snapshot.

@twwu123 twwu123 added the bug 🐛 Something isn't working label Jan 3, 2025
@noonio
Copy link
Contributor

noonio commented Jan 8, 2025

One of the main aims of #1468 is to solve this problem.

Let us know if you think that would help you! If so, it's on our roadmap very soon, so hopefully that will help :)

@noonio noonio mentioned this issue Jan 8, 2025
5 tasks
@twwu123
Copy link
Author

twwu123 commented Jan 13, 2025

One of the main aims of #1468 is to solve this problem.

Let us know if you think that would help you! If so, it's on our roadmap very soon, so hopefully that will help :)

This will definitely help in terms of being able to fanout a subset of usable UTxOs, there may be something I'm missing, but does it also help in terms of reconciling forked local states?

@ch1bo
Copy link
Collaborator

ch1bo commented Jan 15, 2025

@twwu123 Thanks for reporting this and evaluating the Hydra project in-depth!

You raise very valid points and your understanding is just right. The hydra-nodes seem to have reached different local views on what to snapshot and at the same time, snapshot signing seems to have stalled in your situation. We also called such case that the Hydra head got "stuck", see for example all issues mentioning "stuck".

As you can see from past issues, this can happen if networking, persistence or version incompatibility was preventing smooth off-chain protocol progress. Besides making the individual components more resilient to faults (e.g. by following up a successful experiment with #1720), we also discussed fallback mechanisms for situations where its purely a technical hiccup and not a loss in consensus that is preventing progress, for example: #1284

Your expected behavior indicates that such a clearing of the diverging local views or reset to the last confirmed snapshot would be a solution for you?

@twwu123
Copy link
Author

twwu123 commented Jan 15, 2025

Your expected behavior indicates that such a clearing of the diverging local views or reset to the last confirmed snapshot would be a solution for you?

The most ideal solution for me would be a way to allow one hydra-node to send their local state for everyone else to sign, such that everyone is sync'd again to one node's version of the ledger state.

If this isn't possible, then at the very least, resetting everyone's local state to the latest agreed on snapshot would be my second option.

My understanding is that both of these should be possible. However, the first option may have small security concerns, and might require some sort of manual "approval" of a specific state for every hydra-node.

@ch1bo
Copy link
Collaborator

ch1bo commented Jan 15, 2025

My understanding is that both of these should be possible. However, the first option may have small security concerns, and might require some sort of manual "approval" of a specific state for every hydra-node.

Yes, exactly. A node can't just trust another node by adopting their state.

That being said, we should make the snapshotting logic more defensive and retry more, as this could be an issue here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants