-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dragonboat silently ignores statemachines with missed data #369
Comments
@antmat thanks for raising the issue. My understanding is that you have a snapshot applied into the SM, let's say the index of the snapshot is 100, then after a restart of the replica, your SM returns a last applied index value lower than 100. You expect the snapshot to be applied again as a part of the recovery procedure, but that didn't happen. Am I correct? |
I think this issue is relate to #156. You may want to have a look. Please note that what you mentioned as "erase some unsynced data from state machine" is really corrupting the SM. I agree that should such corruption exists in the SM, it should just panic to report such corruption. Regarding "shrunk" snapshots, they are snapshots with their SM content removed. Once a snapshot is applied into the SM, you no longer need two copy of the same data (one copy in the snapshot, one copy is in the SM) sitting on the disk, so there is a procedure in dragonboat to remove the actual content of the snapshot to only keep some tiny metadata of the snapshot, such compacted snapshot is considered as "shrunk". They still need to be applied as those metadata is really a part of the SM state - they are the SM state managed by dragonboat itself. |
Yes, that's right. It is related to #156, thanks for the hint.
In my case I'm talking about my implementation of IOnDiskStateMachine. For simplicity you can imagine a state machine that just writes indexes into the file and periodically sync. File was synced on index, say, 33, then several entries were applied and power failure occured without sync. SM reports OnDiskIndex 33 and snapshot has for example 35. In this case my sm state is not corrupted, but just incomplete. Would this implementation be correct? Anyway, it looks like severe issue, maybe it's possible to backport it to v3? And btw what is the status of stability of master and v4 version? Thanks in advance. |
@antmat After checking #156 and your reply, I have the feeling that your described problem is actually identical to one addressed in #156, your experienced issue has been fixed in the master branch long time ago. v4 is pretty stable, it is the version I am using in two of my projects, it is also the version used in some other projects that I am familiar with, e.g. I know a project using v4 with dozens of devs and a dedicated testing team, they have been using v4 for quite long time and there is never stability related issues reported. please try v4 if you like. the only reason why v4 has never been released is because I have the feeling that I might be able to further simplify the APIs a little bit. |
Thanks, we will try to migrate to v4, but currently we have an installation on v3. |
The error you saw is related to the NodeHostID change. Are you using the gossip feature? Back in release-3.3, NodeHostID values required by the gossip feature are just random 8 bytes integers. It is changed to UUID in v4. I don't have a dedicated tool to convert and migrate the data at this stage as that integer NodeHostID is pretty short lived and it is the only breaking data change, the good news is almost all code required to do the conversion is available in the repo - so the goal is to convert all those existing integer NodeHostID values to UUID values as defined in internal/id/nhid.go in the master branch. you first need to convert the NODEHOST.ID file in your data folder. It is a so called "flag file" (see internal/fileutil/utils.go for the CreateFlagFile and the GetFlagFileContent functions) with the NodeHostID value stored (see internal/server/environment.go for how NODEHOST.ID is stored and loaded). You should be able to easily manipulate such NODEHOST.ID files using functions available in those two .go files. then you need to convert all those NodeHostID values stored in your snapshots. They are actually in the metadata part of your snapshots, which are also stored as so called "flag files". you basically need to load the existing snapshot from disk, specify how you want to change the involved raft membership info and store the converted snapshot back to overwrite your existing ones. tools/import.go is actually doing something very very similar - please have a look at the godoc of the ImportSnapshot function. Please feel free to post your questions here, will be more than happy to help. Good Luck. |
No, we are not using it, so all membership info is stored as hostnames, however it seems, that nodeid is generated despite it's usage. It looks like in this case is it enough to just reassign new nodeid. Thanks for your help, I'll try. |
I have not checked code yet, but I guess you can just delete that NODEHOST.ID file to see whether that will help. |
Dragonboat version
v3.3.8
Expected behavior
If you start OnDiskStateMachine which reports lower index than that in a previously applied snapshot, snapshot recovery should happen / OR / panic about inconsistency should happen
Actual behavior
it does nothin
Steps to reproduce the behavior
apply some operations to on disk state machine, add new node, stop that node, erase some unsynced data from state machine (i tried all data, leaving state machine index on 0), start again, snapshot recovery is silently skipped.
It looks like the problem is on that line https://github.com/lni/dragonboat/blob/release-3.3/internal/rsm/statemachine.go#L298
which is missing in trunk. After that we just skip snapshot recovery and just set new Index, leaving state machine without some operations applied.
Also I tried just commenting those lines, but it likes it is impossible to recover from Shrunk(what does that mean?) snapshot, but recovery is still required. In any case it looks like those situations should be observed somehow.
Could you please take a look?
Thanks in advance
The text was updated successfully, but these errors were encountered: