Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Bookie #2528

Open
thinker0 opened this issue Jan 7, 2021 · 8 comments
Open

Broken Bookie #2528

thinker0 opened this issue Jan 7, 2021 · 8 comments
Labels

Comments

@thinker0
Copy link
Member

thinker0 commented Jan 7, 2021

BUG REPORT

Describe the bug

Broken bookie

To Reproduce

Expected behavior

23:38:25.443 [BookieWriteThreadPool-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.bookie.SortedLedgerStorage - Reached size CheckpointList{checkpoints=[LogMark: logFileId - 1609816262610 , logFileOffset - 720319488]}
23:38:25.443 [SortedLedgerStorage-0] INFO  org.apache.bookkeeper.bookie.SortedLedgerStorage - Started flushing mem table.
23:39:06.854 [GarbageCollectorThread-8-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Extracting entry log meta from entryLogId: 482
23:39:06.854 [GarbageCollectorThread-8-1] INFO  org.apache.bookkeeper.bookie.EntryLogger - Failed to get ledgers map index from: 482.log : Cannot deserialize ledgers map from ledger 158913789951
23:39:06.854 [GarbageCollectorThread-8-1] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught 
java.lang.IllegalArgumentException: Negative position
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:785) ~[?:?]
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:93) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:65) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.EntryLogger.readFromLogChannel(EntryLogger.java:418) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.EntryLogger.scanEntryLog(EntryLogger.java:996) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.EntryLogger.extractEntryLogMetadataByScanning(EntryLogger.java:1137) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.EntryLogger.getEntryLogMetadata(EntryLogger.java:1046) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:557) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:328) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.safeRun(GarbageCollectorThread.java:309) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
	at java.lang.Thread.run(Thread.java:830) [?:?]
23:39:11.802 [BookieWriteThreadPool-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.bookie.SortedLedgerStorage - Reached size CheckpointList{checkpoints=[LogMark: logFileId - 1609816262610 , logFileOffset - 790255616]}

A clear and concise description of what you expected to happen.

Screenshots
image

21:28:50.832 [BookieWriteThreadPool-OrderedExecutor-0-0] WARN  org.apache.bookkeeper.proto.WriteEntryProcessor - BookieServer is running in readonly mode, so rejecting the request from the client!
21:28:50.832 [BookieWriteThreadPool-OrderedExecutor-0-0] WARN  org.apache.bookkeeper.proto.WriteEntryProcessor - BookieServer is running in readonly mode, so rejecting the request from the client!
21:28:50.832 [BookieWriteThreadPool-OrderedExecutor-0-0] WARN  org.apache.bookkeeper.proto.WriteEntryProcessor - BookieServer is running in readonly mode, so rejecting the request from the client!
21:28:51.128 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.SyncThread - Flush ledger storage at checkpoint CheckpointList{checkpoints=[LogMark: logFileId - 1609816262598 , logFileOffset - 550679552]}.
21:28:51.128 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.Journal - Shutting down Journal
21:28:51.129 [ForceWriteThread] INFO  org.apache.bookkeeper.bookie.Journal - ForceWrite thread interrupted
21:28:51.130 [BookieJournal-31117] INFO  org.apache.bookkeeper.bookie.Journal - Journal exits when shutting down
21:28:51.130 [BookieJournal-31117] INFO  org.apache.bookkeeper.bookie.Journal - Journal exited loop!
21:28:51.131 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.Journal - Finished Shutting down Journal thread
21:28:51.131 [Bookie-31117] INFO  org.apache.bookkeeper.bookie.Bookie - Journal thread(s) quit.
21:28:51.132 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.InterleavedLedgerStorage - Shutting down InterleavedLedgerStorage
21:28:51.132 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.InterleavedLedgerStorage - Shutting down GC thread
21:28:51.132 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Shutting down GarbageCollectorThread
21:28:51.133 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.InterleavedLedgerStorage - Shutting down entry logger
21:28:51.133 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.EntryLogger - Stopping EntryLogger
21:28:51.148 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.EntryLoggerAllocator - Stopped entry logger preallocator.
21:28:51.169 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.InterleavedLedgerStorage - Complete shutting down Ledger Storage
21:28:51.169 [component-shutdown-thread] INFO  org.apache.bookkeeper.bookie.LedgerDirsMonitor - Shutting down LedgerDirsMonitor
21:28:51.273 [component-shutdown-thread] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x576ab041718049d closed
21:28:51.273 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x576ab041718049d
21:28:51.322 [component-shutdown-thread] INFO  org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@5c73f672{HTTP/1.1, (http/1.1)}{0.0.0.0:31114}
21:28:51.349 [component-shutdown-thread] INFO  org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@3a01773b{/,null,STOPPED}
21:28:51.351 [component-shutdown-thread] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Closed component bookie-server in shutdown hook successfully. Exiting.
[2021-01-06T21:28:51.352+0900][119872.452s][info][gc,heap,exit ] Heap
[2021-01-06T21:28:51.352+0900][119872.452s][info][gc,heap,exit ]  garbage-first heap   total 8388608K, used 2609464K [0x0000000600000000, 0x0000000800000000)
[2021-01-06T21:28:51.352+0900][119872.452s][info][gc,heap,exit ]   region size 4096K, 487 young (1994752K), 2 survivors (8192K)
[2021-01-06T21:28:51.352+0900][119872.452s][info][gc,heap,exit ]  Metaspace       used 40794K, capacity 41694K, committed 41856K, reserved 1087488K
[2021-01-06T21:28:51.352+0900][119872.452s][info][gc,heap,exit ]   class space    used 4335K, capacity 4728K, committed 4736K, reserved 1048576K

Additional context
Restarted bookie instance

@thinker0
Copy link
Member Author

One pulsar-2.8.0 Bookies node was a problem.
pulsar-2.7.2 downgrade

18:00:37.715 [SortedLedgerStorage-0] INFO  org.apache.bookkeeper.bookie.EntryLogManagerForSingleEntryLog - Rolling entry logger since it reached size limitation
18:00:37.715 [SortedLedgerStorage-0] INFO  org.apache.bookkeeper.bookie.EntryLogManagerBase - Creating a new entry log file due to reaching log limit after flushing memtable : logIdBeforeFlush = 12784, logIdAfterFlush = 12784
18:00:37.716 [SortedLedgerStorage-0] INFO  org.apache.bookkeeper.bookie.EntryLogManagerBase - Flushing entry logger 12784 back to filesystem, pending for syncing entry loggers : [BufferedChannel{logId=12784, logFile=/data/bookkeeper/ledgers/current/31f0.log, ledgerIdAssigned=-1}].
18:00:37.716 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogManagerBase - Creating a new entry log file due to preparing checkpoint : numBytesFlushed = 119500
18:00:37.716 [pool-5-thread-1] INFO  org.apache.bookkeeper.bookie.EntryLoggerAllocator - Created new entry log file /data/bookkeeper/ledgers/current/31f2.log for logId 12786.
18:00:37.717 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogManagerBase - Flushing entry logger 12785 back to filesystem, pending for syncing entry loggers : [BufferedChannel{logId=12784, logFile=/data/bookkeeper/ledgers/current/31f0.log, ledgerIdAssigned=-1}, BufferedChannel{logId=12785, logFile=/data/bookkeeper/ledgers/current/31f1.log, ledgerIdAssigned=-1}].
18:00:37.717 [pool-5-thread-1] INFO  org.apache.bookkeeper.bookie.EntryLoggerAllocator - Created new entry log file /data/bookkeeper/ledgers/current/31f3.log for logId 12787.
18:00:37.757 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogManagerForSingleEntryLog - Synced entry logger 12784 to disk.
18:00:37.766 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogManagerForSingleEntryLog - Synced entry logger 12785 to disk.
18:01:02.256 [GarbageCollectorThread-8-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Extracting entry log meta from entryLogId: 11410
18:01:02.256 [GarbageCollectorThread-8-1] INFO  org.apache.bookkeeper.bookie.EntryLogger - Failed to get ledgers map index from: 11410.log : Cannot deserialize ledgers map from ledger 433791696895
18:01:02.271 [GarbageCollectorThread-8-1] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught 
java.lang.IllegalArgumentException: Negative position
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:785) ~[?:?]
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:93) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:65) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.EntryLogger.readFromLogChannel(EntryLogger.java:418) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.EntryLogger.scanEntryLog(EntryLogger.java:996) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.EntryLogger.extractEntryLogMetadataByScanning(EntryLogger.java:1136) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.EntryLogger.getEntryLogMetadata(EntryLogger.java:1045) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.extractMetaFromEntryLogs(GarbageCollectorThread.java:607) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:348) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.bookie.GarbageCollectorThread.safeRun(GarbageCollectorThread.java:329) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.1.jar:4.14.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
	at java.lang.Thread.run(Thread.java:830) [?:?]

@Vanlightly
Copy link
Contributor

Vanlightly commented Jul 23, 2021 via email

@hamadodene
Copy link

We encountered the same problem in production. We have seen several Exceptions such as:

21-09-29-02-11-13       Failed to compact entry log 8277 due to unexpected error
21-09-29-02-11-13       java.lang.IllegalArgumentException: Negative position
java.lang.IllegalArgumentException: Negative position
        at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:785)
        at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:93)
        at org.apache.bookkeeper.bookie.BufferedReadChannel.read(BufferedReadChannel.java:65)
        at org.apache.bookkeeper.bookie.EntryLogger.readFromLogChannel(EntryLogger.java:418)
        at org.apache.bookkeeper.bookie.EntryLogger.scanEntryLog(EntryLogger.java:996)
        at org.apache.bookkeeper.bookie.EntryLogCompactor.compact(EntryLogCompactor.java:61)
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.compactEntryLog(GarbageCollectorThread.java:518)
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.doCompactEntryLogs(GarbageCollectorThread.java:455)
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.runWithFlags(GarbageCollectorThread.java:360)
        at org.apache.bookkeeper.bookie.GarbageCollectorThread.safeRun(GarbageCollectorThread.java:309)
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)

After several days we have The Herddb service which bases its replication on bookkeeper had to recover its tables.
This recovery failed for Exception:

21-10-07-16-24-17 herddb.core.DBManager Oct 07, 2021 4:24:17 PM herddb.core.DBManager manageTableSpaces
SEVERE: cannot handle tablespace q103
herddb.log.LogNotAvailableException: org.apache.bookkeeper.client.BKException$BKDigestMatchException: Entry digest does not match
at herddb.cluster.BookkeeperCommitLog.recovery(BookkeeperCommitLog.java:683)
at herddb.core.TableSpaceManager.recover(TableSpaceManager.java:325)
at herddb.core.TableSpaceManager.start(TableSpaceManager.java:250)
at herddb.core.DBManager.handleTableSpace(DBManager.java:571)
at herddb.core.DBManager.manageTableSpaces(DBManager.java:1226)
at herddb.core.DBManager.executeActivator(DBManager.java:1172)
at herddb.core.DBManager.access$500(DBManager.java:120)
at herddb.core.DBManager$Activator.run(DBManager.java:1115)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.bookkeeper.client.BKException$BKDigestMatchException: Entry digest does not match
at org.apache.bookkeeper.client.BKException.create(BKException.java:70)
at org.apache.bookkeeper.client.PendingReadOp.submitCallback(PendingReadOp.java:640)
at org.apache.bookkeeper.client.PendingReadOp$LedgerEntryRequest.fail(PendingReadOp.java:171)
at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.sendNextRead(PendingReadOp.java:393)
at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.logErrorAndReattemptRead(PendingReadOp.java:436)
at org.apache.bookkeeper.client.PendingReadOp$LedgerEntryRequest.complete(PendingReadOp.java:142)
at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.complete(PendingReadOp.java:442)
at org.apache.bookkeeper.client.PendingReadOp.readEntryComplete(PendingReadOp.java:590)
at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$1.readEntryComplete(PerChannelBookieClient.java:1836)
at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.handleReadResponse(PerChannelBookieClient.java:1917)
at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.handleV3Response(PerChannelBookieClient.java:1892)
at org.apache.bookkeeper.proto.PerChannelBookieClient$3.safeRun(PerChannelBookieClient.java:1447)
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more

We think this is due to corruption of the entry log.

Do you have any idea how we can solve this problem?

cc @eolivelli

@eolivelli
Copy link
Contributor

@hamadodene which version of BK ?

it looks like there is some corrupted data on the BK.

the first error is the GC thread, and the bookie is probably not able to release disk space.

the second error is a corrupted entry during a read.
it looks like there is some problem on the disk or the bookie was killed in a very bad way during a flush to the ledgers directory.

you can probably delete the bad file manually

@hamadodene
Copy link

@eolivelli the BK version is 4.11.1

@eolivelli
Copy link
Contributor

Can you share your replication parameters?

It looks like we have two problems.
If you have a good number of other replicas recovery should be able to complete.

The other problem is about GC, we should deal with this case as well

@hamadodene
Copy link

hamadodene commented Oct 12, 2021

In our case we have Bookkeeper in standalone mode. So replication 1.
We know that it is good to have at least 3 bookies and we are aware of these.
However, it would be interesting to understand why the data is corrupted. I don't think it's normal.

@munish1789
Copy link

Similar kind of error was observed in another issue #3949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants