Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock occurs during shutdown #1178

Open
funky-eyes opened this issue Jan 23, 2025 · 3 comments
Open

Deadlock occurs during shutdown #1178

funky-eyes opened this issue Jan 23, 2025 · 3 comments

Comments

@funky-eyes
Copy link
Contributor

Describe the bug

    public void destroy() {
        Optional.ofNullable(raftGroupService).ifPresent(r -> {
            r.shutdown();
            try {
                r.join();
            } catch (InterruptedException e) {
                logger.warn("Interrupted when RaftServer destroying", e);
            }
        });
    }

可以看到jraft的groupshutdown是启动一个新的线程去shutdown,此时NodeImpl的shutdown后也会去执行join方法

"JRaft-Group-Default-Executor-3" #182 [221814] daemon prio=5 os_prio=0 cpu=62441.04ms elapsed=3535428.38s tid=0x00007f4c0411f200 nid=221814 waiting on condition  [0x00007f4bde594000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000007017fb0f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:371)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:519)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3780)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3725)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1707)
	at com.alipay.sofa.jraft.util.CountDownEvent.await(CountDownEvent.java:69)
	at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.join(SnapshotExecutorImpl.java:748)
	at com.alipay.sofa.jraft.core.NodeImpl.join(NodeImpl.java:2891)
	- locked <0x00000007017e6d30> (a com.alipay.sofa.jraft.core.NodeImpl)
	at com.alipay.sofa.jraft.core.NodeImpl.lambda$shutdown$7(NodeImpl.java:2837)
	at com.alipay.sofa.jraft.core.NodeImpl$$Lambda/0x00007f4c57804f80.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:572)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:317)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)

然后业务线程等待这个结果时,join方法是携带了synchronized,导致shutdownhook线程因为拿不到nodeimpl的锁导致一直hang住,应用无法下线

"SpringApplicationShutdownHook" #34 [442820] prio=5 os_prio=0 cpu=320.57ms elapsed=421.49s tid=0x00007f4bfc1ad980 nid=442820 waiting for monitor entry  [0x00007f4bc0946000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.alipay.sofa.jraft.core.NodeImpl.join(NodeImpl.java:2883)
	- waiting to lock <0x00000007017e6d30> (a com.alipay.sofa.jraft.core.NodeImpl)
	at com.alipay.sofa.jraft.RaftGroupService.join(RaftGroupService.java:148)
	- locked <0x0000000701d33790> (a com.alipay.sofa.jraft.RaftGroupService)
	at org.apache.seata.server.cluster.raft.RaftServer.lambda$destroy$0(RaftServer.java:130)
	at org.apache.seata.server.cluster.raft.RaftServer$$Lambda/0x00007f4c577bf0f8.accept(Unknown Source)
	at java.util.Optional.ifPresent([email protected]/Optional.java:178)
	at org.apache.seata.server.cluster.raft.RaftServer.destroy(RaftServer.java:127)
	at org.apache.seata.server.cluster.raft.RaftServer.close(RaftServer.java:122)
	at org.apache.seata.server.cluster.raft.RaftServerManager.lambda$destroy$1(RaftServerManager.java:166)
	at org.apache.seata.server.cluster.raft.RaftServerManager$$Lambda/0x00007f4c577beed0.accept(Unknown Source)
	at java.util.HashMap.forEach([email protected]/HashMap.java:1429)
	at org.apache.seata.server.cluster.raft.RaftServerManager.destroy(RaftServerManager.java:165)
	at org.apache.seata.server.session.SessionHolder.destroy(SessionHolder.java:415)
	at org.apache.seata.server.coordinator.DefaultCoordinator.destroy(DefaultCoordinator.java:684)
	at org.apache.seata.server.ServerRunner.destroy(ServerRunner.java:90)
	at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:213)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559)
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1163)
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520)
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1156)
	at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1123)
	at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1089)
	at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:174)
	at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1035)
	- locked <0x0000000700cb6728> (a java.lang.Object)
	at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145)
	at org.springframework.boot.SpringApplicationShutdownHook$$Lambda/0x00007f4c577b5538.accept(Unknown Source)
	at java.lang.Iterable.forEach([email protected]/Iterable.java:75)
	at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

  • SOFAJRaft version:
  • JVM version (e.g. java -version):
  • OS version (e.g. uname -a):
  • Maven version:
  • IDE version:
@funky-eyes
Copy link
Contributor Author

    @Override
    public synchronized void join() throws InterruptedException {
        if (this.shutdownLatch != null) {
            if (this.readOnlyService != null) {
                this.readOnlyService.join();
            }
            if (this.logManager != null) {
                this.logManager.join();
            }
            if (this.snapshotExecutor != null) {
                this.snapshotExecutor.join();
            }
            if (this.wakingCandidate != null) {
                Replicator.join(this.wakingCandidate);
            }
            this.shutdownLatch.await();
            this.applyDisruptor.shutdown();
            this.applyQueue = null;
            this.applyDisruptor = null;
            this.shutdownLatch = null;
        }
        if (this.fsmCaller != null) {
            this.fsmCaller.join();
        }
    }

snapshot#join 需要等待jobs的触发,而jobs的触发需要做完一次snapshot,而我应用已经下线了,不可能再去做snapshot,所以snapshot就不可能触发,导致直接hang死

@funky-eyes
Copy link
Contributor Author

NodeImpl的join会去等待snapshot的快照,而shutdown的时候,snapshot不可能去做了,业务线程全部关闭了,我snapshot 10分钟一次,等于起码得等10分钟后才能下线,这太奇怪了,由于raftgroupservice的shutdown行为是异步的,所以我才用join去等待下线,而不是直接暴力下线,导致主线程这里的join在等异步线程的join释放锁,而异步线程的join释放锁需要等snapshot做完

@funky-eyes
Copy link
Contributor Author

如果这确定是一个bug的话,我认为有2个解决方法
1.在shutdown时,立刻触发做一次snapshot
2.类似logManager增加一个shutdown专用的CountDownLatch,nodeimpl触发shutdown时触发SnapshotExecutorImpl中的CountDownLatch,来使join可通过,而不是等待

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant