Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after restart, bbolt db failed to get all reachable pages #778

Open
timannguyen opened this issue Jun 26, 2024 · 20 comments
Open

after restart, bbolt db failed to get all reachable pages #778

timannguyen opened this issue Jun 26, 2024 · 20 comments

Comments

@timannguyen
Copy link

After restarting the process running with open telemetry file extension, we get

panic: freepages: failed to get all reachable pages (page 138577: multiple references (stack: [110806 65199 138577]))

goroutine 98 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
	go.etcd.io/[email protected]/db.go:1202 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 69

I'll see if i can get the db pulled from the running machine. Sorry for no db blob.

is there any means of recovery from this issue ?

@ahrtr
Copy link
Member

ahrtr commented Jun 26, 2024

After restarting the process

Did the machine/VM restarted?

Do you have a detailed steps to reproduce this issue by any chance?

I'll see if i can get the db pulled from the running machine

Yes, it would be helpful if you could provide the db file.

is there any means of recovery from this issue ?

We have surgery commands, but again we need to the db file firstly.

@ahrtr
Copy link
Member

ahrtr commented Jun 27, 2024

This seems like also a potential issue of the freelist management, e.g a page was allocated twice, accordingly it's referenced by multiple places.

@timannguyen
Copy link
Author

I could not get an approval to provide the db file. sorry for the trouble.

@fuweid
Copy link
Member

fuweid commented Jul 2, 2024

Hi @timshannon just want to confirm that the issue show up after the host or vm where the process run has been restarted? Thanks

@timshannon
Copy link
Contributor

@fuweid I assume you meant to call out @timannguyen not @timshannon right?

@ahrtr
Copy link
Member

ahrtr commented Jul 2, 2024

@fuweid I assume you meant to call out @timannguyen not @timshannon right?

Right, sorry for the confusion.

@fuweid
Copy link
Member

fuweid commented Jul 2, 2024

sorry @timshannon for the confusion. I should ping @timannguyen

@timannguyen
Copy link
Author

@fuweid sorry for the delay, the process was restarted and the issue occurs.

@ahrtr
Copy link
Member

ahrtr commented Jul 8, 2024

@fuweid sorry for the delay, the process was restarted and the issue occurs.

Thanks for the feedback but the questions isn't answered. The question is "Did the machine/VM restarted right before the issue being occurred?".

Also please also provide the following info,

  • GOOS and version;
  • GOARCH;
  • filesystem;

I could not get an approval to provide the db file. sorry for the trouble.

Could you dump the db with all the value redacted?

./bbolt page  --all --format-value=redacted path-2-db

@lnnt
Copy link

lnnt commented Aug 28, 2024

@fuweid sorry for the delay, the process was restarted and the issue occurs.

Thanks for the feedback but the questions isn't answered. The question is "Did the machine/VM restarted right before the issue being occurred?".

Also please also provide the following info,

  • GOOS and version;
  • GOARCH;
  • filesystem;

I could not get an approval to provide the db file. sorry for the trouble.

Could you dump the db with all the value redacted?

./bbolt page  --all --format-value=redacted path-2-db

@ahrtr I met the same issue.
Here is my db dump:
db.txt

@lnnt
Copy link

lnnt commented Aug 29, 2024

@fuweid sorry for the delay, the process was restarted and the issue occurs.

Thanks for the feedback but the questions isn't answered. The question is "Did the machine/VM restarted right before the issue being occurred?".
Also please also provide the following info,

  • GOOS and version;
  • GOARCH;
  • filesystem;

I could not get an approval to provide the db file. sorry for the trouble.

Could you dump the db with all the value redacted?

./bbolt page  --all --format-value=redacted path-2-db

@ahrtr I met the same issue. Here is my db dump: db.txt

My machine met unexpected power down, then bbolt db say:

panic: freepages: failed to get all reachable pages (page 4271: multiple references (stack: [1079 3745 4271]))
goroutine 160 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
         go.etcd.io/[email protected]/db.go:1202 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 159
         go.etcd.io/[email protected]/db.go:1200 +0x1e5

Here is my db file:
db.zip

@ahrtr
Copy link
Member

ahrtr commented Aug 29, 2024

Here is my db file:
db.zip

Thanks. I will take a look later.

@ahrtr
Copy link
Member

ahrtr commented Aug 30, 2024

My machine met unexpected power down, then bbolt db say:

panic: freepages: failed to get all reachable pages (page 4271: multiple references (stack: [1079 3745 4271]))
goroutine 160 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
         go.etcd.io/[email protected]/db.go:1202 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 159
         go.etcd.io/[email protected]/db.go:1200 +0x1e5

Here is my db file: db.zip

The last transaction before the unexpected poweroff somehow corrupted the db. One possible reason is some page (i.e. 1079) wasn't successfully synced to disk even the fdatasync systemcall returned successfully.

The workaround to fix the corrupted db is to rollback the last transaction,

$ ./bbolt surgery revert-meta-page ./db.db --output ./new.db
The meta page is reverted.

$ ./bbolt check ./new.db 
OK

Reference

F5WPP69Q7H:bbolt wachao$ pwd
/Users/wachao/go/src/github.com/ahrtr/bbolt
F5WPP69Q7H:bbolt wachao$ ./bbolt page db.db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=5150>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11865>
Txn ID:     10789638
Checksum:   4e1e8a40b76f0de1

F5WPP69Q7H:bbolt wachao$ ./bbolt page db.db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=11036>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11865>
Txn ID:     10789637
Checksum:   bb9a5cce3b6710a5

F5WPP69Q7H:bbolt wachao$ ./bbolt page ./db.db 5150
Page ID:    5150
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=1079,seq=0>
"lease": <pgid=6891,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

F5WPP69Q7H:bbolt wachao$ ./bbolt page ./db.db 1079
Page ID:    1079
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 52

0000000000fd22985f0000000000000000: <pgid=4271>
0000000000fd229d5f0000000000000000: <pgid=3874>
0000000000fd22a15f0000000000000000: <pgid=3875>
0000000000fd22a55f0000000000000000: <pgid=4288>
0000000000fd22aa5f0000000000000000: <pgid=7785>
0000000000fd22ae5f0000000000000000: <pgid=1132>
0000000000fd22b35f0000000000000000: <pgid=8839>
0000000000fd22b75f0000000000000000: <pgid=10820>
0000000000fd22bc5f0000000000000000: <pgid=11084>
0000000000fd22c15f0000000000000000: <pgid=6097>
0000000000fd22c65f0000000000000000: <pgid=4104>
0000000000fd22cb5f0000000000000000: <pgid=7786>
0000000000fd22d15f0000000000000000: <pgid=10821>
0000000000fd22d55f0000000000000000: <pgid=9570>
0000000000fd22d95f0000000000000000: <pgid=2861>
0000000000fd22de5f0000000000000000: <pgid=5190>
0000000000fd22e35f0000000000000000: <pgid=9571>
0000000000fd22e75f0000000000000000: <pgid=7264>
0000000000fd22eb5f0000000000000000: <pgid=2860>
0000000000fd22f15f0000000000000000: <pgid=7787>
0000000000fd22f55f0000000000000000: <pgid=11035>
0000000000fd22fb5f0000000000000000: <pgid=7084>
0000000000fd22ff5f0000000000000000: <pgid=7265>
0000000000fd23055f0000000000000000: <pgid=10822>
0000000000fd230a5f0000000000000000: <pgid=2907>
0000000000fd230e5f0000000000000000: <pgid=2906>
0000000000fd23135f0000000000000000: <pgid=2151>
0000000000fd23175f0000000000000000: <pgid=9010>
0000000000fd231c5f0000000000000000: <pgid=2150>
0000000000fd23205f0000000000000000: <pgid=7788>
0000000000fd23255f0000000000000000: <pgid=11185>
0000000000fd232a5f0000000000000000: <pgid=2213>
0000000000fd232e5f0000000000000000: <pgid=2212>
0000000000fd23345f0000000000000000: <pgid=2629>
0000000000fd23385f0000000000000000: <pgid=5821>
0000000000fd233d5f0000000000000000: <pgid=2628>
0000000000fd23425f0000000000000000: <pgid=7639>
0000000000fd23465f0000000000000000: <pgid=5820>
0000000000fd234c5f0000000000000000: <pgid=419>
0000000000fd23505f0000000000000000: <pgid=11187>
0000000000fd23565f0000000000000000: <pgid=11186>
0000000000fd235a5f0000000000000000: <pgid=8300>
0000000000fd235f5f0000000000000000: <pgid=7630>
0000000000fd23635f0000000000000000: <pgid=10824>
0000000000fd23675f0000000000000000: <pgid=10823>
0000000000fd236b5f0000000000000000: <pgid=626>
0000000000fd23705f0000000000000000: <pgid=9930>
0000000000fd23755f0000000000000000: <pgid=7093>
0000000000fd237b5f0000000000000000: <pgid=7789>
0000000000fd237f5f0000000000000000: <pgid=4512>
0000000000fd23855f0000000000000000: <pgid=9931>
0000000000fd238a5f0000000000000000: <pgid=3745>

F5WPP69Q7H:bbolt wachao$ 
F5WPP69Q7H:bbolt wachao$ 
F5WPP69Q7H:bbolt wachao$ ./bbolt page ./db.db 3745
Page ID:    3745
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 68

0000000000fd22985f0000000000000000: <pgid=4271>
0000000000fd229d5f0000000000000000: <pgid=3874>
0000000000fd22a15f0000000000000000: <pgid=3875>
0000000000fd22a55f0000000000000000: <pgid=4288>
0000000000fd22aa5f0000000000000000: <pgid=7785>
0000000000fd22ae5f0000000000000000: <pgid=1132>
0000000000fd22b35f0000000000000000: <pgid=8839>
0000000000fd22b75f0000000000000000: <pgid=10820>
0000000000fd22bc5f0000000000000000: <pgid=11084>
0000000000fd22c15f0000000000000000: <pgid=6097>
0000000000fd22c65f0000000000000000: <pgid=4104>
0000000000fd22cb5f0000000000000000: <pgid=7786>
0000000000fd22d15f0000000000000000: <pgid=10821>
0000000000fd22d55f0000000000000000: <pgid=9570>
0000000000fd22d95f0000000000000000: <pgid=2861>
0000000000fd22de5f0000000000000000: <pgid=5190>
0000000000fd22e35f0000000000000000: <pgid=9571>
0000000000fd22e75f0000000000000000: <pgid=7264>
0000000000fd22eb5f0000000000000000: <pgid=2860>
0000000000fd22f15f0000000000000000: <pgid=7787>
0000000000fd22f55f0000000000000000: <pgid=11035>
0000000000fd22fb5f0000000000000000: <pgid=7084>
0000000000fd22ff5f0000000000000000: <pgid=7265>
0000000000fd23055f0000000000000000: <pgid=10822>
0000000000fd230a5f0000000000000000: <pgid=2907>
0000000000fd230e5f0000000000000000: <pgid=2906>
0000000000fd23135f0000000000000000: <pgid=2151>
0000000000fd23175f0000000000000000: <pgid=9010>
0000000000fd231c5f0000000000000000: <pgid=2150>
0000000000fd23205f0000000000000000: <pgid=7788>
0000000000fd23255f0000000000000000: <pgid=11185>
0000000000fd232a5f0000000000000000: <pgid=2213>
0000000000fd232e5f0000000000000000: <pgid=2212>
0000000000fd23345f0000000000000000: <pgid=2629>
0000000000fd23385f0000000000000000: <pgid=5821>
0000000000fd233d5f0000000000000000: <pgid=2628>
0000000000fd23425f0000000000000000: <pgid=7639>
0000000000fd23465f0000000000000000: <pgid=5820>
0000000000fd234c5f0000000000000000: <pgid=419>
0000000000fd23505f0000000000000000: <pgid=11187>
0000000000fd23565f0000000000000000: <pgid=11186>
0000000000fd235a5f0000000000000000: <pgid=8300>
0000000000fd235f5f0000000000000000: <pgid=7630>
0000000000fd23635f0000000000000000: <pgid=10824>
0000000000fd23675f0000000000000000: <pgid=10823>
0000000000fd236b5f0000000000000000: <pgid=626>
0000000000fd23705f0000000000000000: <pgid=9930>
0000000000fd23755f0000000000000000: <pgid=7093>
0000000000fd237b5f0000000000000000: <pgid=7789>
0000000000fd237f5f0000000000000000: <pgid=4512>
0000000000fd23855f0000000000000000: <pgid=9931>
0000000000fd238a5f0000000000000000: <pgid=2496>
0000000000fd238f5f0000000000000000: <pgid=4526>
0000000000fd23935f0000000000000000: <pgid=10422>
0000000000fd23985f0000000000000000: <pgid=6237>
0000000000fd239c5f0000000000000000: <pgid=11188>
0000000000fd23a05f0000000000000000: <pgid=10423>
0000000000fd23a65f0000000000000000: <pgid=1306>
0000000000fd23ac5f0000000000000000: <pgid=8301>
0000000000fd23b15f0000000000000000: <pgid=1305>
0000000000fd23b55f0000000000000000: <pgid=6890>
0000000000fd23b95f0000000000000000: <pgid=9932>
0000000000fd23be5f0000000000000000: <pgid=2475>
0000000000fd23c25f0000000000000000: <pgid=8302>
0000000000fd23c85f0000000000000000: <pgid=781>
0000000000fd23cc5f0000000000000000: <pgid=456>
0000000000fd23d05f0000000000000000: <pgid=9933>
0000000000fd23d65f0000000000000000: <pgid=2252>

@ahrtr
Copy link
Member

ahrtr commented Aug 30, 2024

@timshannon @lnnt Could you also share your systeminfo using https://github.com/ahrtr/systeminfo?

@timshannon
Copy link
Contributor

@timannguyen

@lnnt
Copy link

lnnt commented Aug 30, 2024

@timshannon @lnnt Could you also share your systeminfo using https://github.com/ahrtr/systeminfo?

cpu.IsBigEndian: false
os.Getpagesize(): 4096
runtime.GOOS: linux
runtime.GOARCH: amd64

@ahrtr
Copy link
Member

ahrtr commented Sep 4, 2024

As mentioned in previous comment, the last transaction (10789638) before the unexpected poweroff somehow corrupted the db.

After rolling back the last transaction, the db is back to normal again. The max revision under the last good state is 16589739

old (good state before TXN 10789638):
Root: 11036 --> 6750 (leaf: key, index: 22/22) --> 3745 (index: 57/67) --> 1306 (index: 5/5, rev: 16589739)

The etcd-dump-db scan-keys command was added in etcd-io/etcd#18539

$ ./etcd-dump-db scan-keys ~/tmp/etcd/778/db.db 16589739 2>/dev/null | grep "/registry/configmaps/istio-system/istio-namespace-controller-election"
pageID=1306, index=5, rev={Revision:{Main:16589739 Sub:0} tombstone:false}, value=[key "/registry/configmaps/istio-system/istio-namespace-controller-election" | val "k8s\x00\n\x0f\n\x02v1\x12\tConfigMap\x12\xeb\x03\n\xe8\x03\n#istio-namespace-controller-election\x12\x00\x1a\fistio-system\"\x00*$bb696087-260d-4167-bf06-17d3361f9b5f2\x008\x00B\b\b\x9e\xbe\xed\xb5\x06\x10\x00b\xe6\x01\n(control-plane.alpha.kubernetes.io/leader\x12\xb9\x01{\"holderIdentity\":\"istiod-d56968787-txq2d\",\"holderKey\":\"default\",\"leaseDurationSeconds\":30,\"acquireTime\":\"2024-08-13T13:26:54Z\",\"renewTime\":\"2024-08-27T06:16:13Z\",\"leaderTransitions\":0}\x8a\x01\x90\x01\n\x0fpilot-discovery\x12\x06Update\x1a\x02v1\"\b\b\xad\u07b5\xb6\x06\x10\x002\bFieldsV1:[\nY{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:control-plane.alpha.kubernetes.io/leader\":{}}}}B\x00\x1a\x00\"\x00" | created 9612546 | mod 16589739 | ver 157604]
pageID=4737, index=4, rev={Revision:{Main:16589786 Sub:0} tombstone:false}, value=[key "/registry/configmaps/istio-system/istio-namespace-controller-election" | val "k8s\x00\n\x0f\n\x02v1\x12\tConfigMap\x12\xeb\x03\n\xe8\x03\n#istio-namespace-controller-election\x12\x00\x1a\fistio-system\"\x00*$bb696087-260d-4167-bf06-17d3361f9b5f2\x008\x00B\b\b\x9e\xbe\xed\xb5\x06\x10\x00b\xe6\x01\n(control-plane.alpha.kubernetes.io/leader\x12\xb9\x01{\"holderIdentity\":\"istiod-d56968787-txq2d\",\"holderKey\":\"default\",\"leaseDurationSeconds\":30,\"acquireTime\":\"2024-08-13T13:26:54Z\",\"renewTime\":\"2024-08-27T06:16:21Z\",\"leaderTransitions\":0}\x8a\x01\x90\x01\n\x0fpilot-discovery\x12\x06Update\x1a\x02v1\"\b\b\xb5\u07b5\xb6\x06\x10\x002\bFieldsV1:[\nY{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:control-plane.alpha.kubernetes.io/leader\":{}}}}B\x00\x1a\x00\"\x00" | created 9612546 | mod 16589786 | ver 157605]
new (bad state after TXN 10789638):
Root: 5150  --> 1079 (leaf: key, index: ?) --> ??? --> 4737 (index: 4/4, rev: 16589786)

The problem is the leaf page 4737 and root page 5150 were persisted successfully, but some branch pages were not. As far as I can tell, most likely it's the file system's issue.

  • Note that even after TXN 10789638, the page 11036, 6750, 3745 and 1306 all kept unchanged. It's expected behavior.
  • After rolling back TXN 10789638, the db is back to normal. It means all the existing pages (including the ones mentioned above) kept unchanged. It's expected.
  • The page 1079 is obviously in a bad state. The problem is that its parent (5150) was successfully persisted to disk, but It wasn't. So its parent points to a bad page, which is exactly the reason for the corruption.
  • Note that all 5150, 1079, and 4737 were free pages before executing TXN 10789638.

As a followup, it's helpful if we can figure out the missing intermediate branch page between 1079 and 4737. We need to add a command something like "find-parent", e.g. find the parent of 4737 in this example.

@serathius
Copy link
Member

@ahrtr 2 questions:

  • Would it be possible to reproduce this case using LazyFS to have a solid understanding of the issue and maybe consider making bbolt resiliant to it?
  • Could we consider bbolt automatically reverting to a previous TXN if data on the tree on current TXN is broken?

@ahrtr
Copy link
Member

ahrtr commented Oct 4, 2024

  • Would it be possible to reproduce this case using LazyFS to have a solid understanding of the issue and maybe consider making bbolt resiliant to it?

We already introduced dmflakey, which is similar to LazyFS, and without the possible downside of LazyFS, which (LazyFS) may not respect the fdatasync semantics. Please refer to fuweid/go-dmflakey#1

Could we consider bbolt automatically reverting to a previous TXN if data on the tree on current TXN is broken?

Yes, It's kind of good point. I already thought about it, but not automatically done by bbolt. Based on the data corruption cases in the past around 3 years, IIRC, only minor parts could be fixed by bbolt surgery revert-meta-page. Also we are trying to simplify the code base instead of adding more logic into it, especially for such upper layer or users' logic. Instead, I would suggest to provide a guide for users to manually try it for now. We will revisit this after we have more confidence on the data corruption issue.

It also has downside... there may be data loss after reverting to previous TXN.

@serathius
Copy link
Member

serathius commented Oct 4, 2024

We already introduced dmflakey, which is similar to LazyFS

Looks like issue is closed. Do you mean #568? Repeating the ask, could we cover this exact scenario using dmflakey?

It also has downside... there may be data loss after reverting to previous TXN.

Just asking for this to be considered, because I would want to better understand the risk of revert. I think high level question is, is there a chance of previous transaction to be silently broken if the latest transaction was corrupted?

Don't think this needs to be a default option, might just be a feature in etcd as I think it would benefit it's ability to recover. For systems like etcd, bbolt is not the source of truth, just the kv database. We can still recover the last transaction from WAL. For cases where only the latest transaction was corrupted, instead of etcd giving up the bootstrap, we would just revert it and replay the transaction again from the WAL.

If the last transaction can be trusted I would propose that etcd should always try to revert meta page if the state of btree on the latest meta page is broken. But before implementing this we should verify and test health of previous transaction under different corruption scenarios.

@github-actions github-actions bot added the stale label Jan 3, 2025
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2025
@ahrtr ahrtr reopened this Jan 25, 2025
@ahrtr ahrtr removed the stale label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants