Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty finalizers load classes from closed class loaders #41607

Closed
gsmet opened this issue Jul 2, 2024 · 18 comments · Fixed by #41686
Closed

Netty finalizers load classes from closed class loaders #41607

gsmet opened this issue Jul 2, 2024 · 18 comments · Fixed by #41686
Assignees
Milestone

Comments

@gsmet
Copy link
Member

gsmet commented Jul 2, 2024

@franz1981 @cescoffier I decided to create a specific issue for this one. I think it would have value for us to be able to fully control the Netty lifecycle instead of relying on finalizers.
Obviously, I have no idea how easy it would be.

This is the stacktrace we obtain:

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for NonAppEndpointsDisabledTest@75b45343 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
	at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:730)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:504)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:556)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:504)
	at io.netty.util.internal.shaded.org.jctools.queues.ConcurrentCircularArrayQueue.isEmpty(ConcurrentCircularArrayQueue.java:71)
	at io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.isEmpty(MpscArrayQueue.java:207)
	at io.netty.buffer.PoolThreadCache.checkCacheMayLeak(PoolThreadCache.java:236)
	at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:227)
	at io.netty.buffer.PoolThreadCache$FreeOnFinalize.finalize(PoolThreadCache.java:500)
	at java.base/java.lang.System$2.invokeFinalize(System.java:2297)
	at java.base/java.lang.ref.Finalizer.runFinalizer(Finalizer.java:88)
	at java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)

This is logged when trying to load IndexedQueueSizeUtil to check if the queue is empty.

While it might sound like something that is not that problematic because it works, we actually have a lot of problems coming from the fact that we don't enforce the closing of CLs. When a QuarkusClassLoader is closed, we shouldn't try to load anything from it anymore.
This is not exactly easy and this is just one of the occurrences but I thought it would be better isolated in a specific issue.

Copy link

quarkus-bot bot commented Jul 2, 2024

/cc @cescoffier (netty), @franz1981 (netty), @jponge (netty)

@gsmet
Copy link
Member Author

gsmet commented Jul 2, 2024

Note that this isn't the only issue I have with Netty.

I also get the following for some tests:

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@27494e46 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
	at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:727)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:501)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:503)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

@franz1981
Copy link
Contributor

franz1981 commented Jul 2, 2024

For the finalizer part I have sent a small fix - if you can give it a shot: https://github.com/franz1981/netty/tree/4.1_classloader_leak

Ideally we would like to "just" rely on FastThreadLocalThreads removal, and this could be implemented in a different PR, but the point is that it would assume that the cache is used only within such threads, which is not a safe assumption for many users, hence IDK how the Netty devs will react to such proposal - to open the proverbial "can of worms"

@franz1981
Copy link
Contributor

For the other one re scheduled tasks I think they are not cancelled, because netty/netty#9580 shows that we are currently removing them on cancel.

@franz1981
Copy link
Contributor

@gsmet FYI netty/netty#14155 (comment)

If is accepted - it means an additional property to us :"(

@franz1981
Copy link
Contributor

franz1981 commented Jul 4, 2024

ok @gsmet netty/netty#14155 is ready to go, so before is getting merged can you provide a feedback if it is working for your case?
you need to set an additional sys property i.e. -Dio.netty.allocator.disableCacheFinalizersForFastThreadLocalThreads=true

which disable finalizers for both jboss blocking thread pool and event loops (which should be both FastThreadLocalThreads afaik, @cescoffier am I right?)

@gsmet
Copy link
Member Author

gsmet commented Jul 4, 2024

@franz1981 so the issue mentioned in the description looks solved by your PR.

Still wondering if we could improve the unrelated situation I was mentioning in this comment:
#41607 (comment)

@franz1981
Copy link
Contributor

For that issue I have commented #41607 (comment) which means that we have some scheduled task which is stopped likely (just guessing, but I have no idea actually...)

@gsmet
Copy link
Member Author

gsmet commented Jul 4, 2024

@cescoffier do you have any idea of what could schedule these tasks and what they could be?

@gsmet
Copy link
Member Author

gsmet commented Jul 4, 2024

@franz1981 @cescoffier I added some breakpoints and AFAICS the tasks are from the GlobalEventExecutor. And I see a lot of cancelScheduledTasks() called for the NioEventLoop but I haven't seen a call to the method for GlobalEventExecutor.

Now it's hard to debug as things pop up everywhere but I wonder if it could be an interesting lead?

@franz1981
Copy link
Contributor

franz1981 commented Jul 4, 2024

It seems that https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L55 is a singleton event loop, single threaded, which, if has its single thread started, schedule a periodic task every each second to check if it should stop it, because idle.
It is used as fallback executor, but i don't know why it cannot be unloaded...in theory after 1 second of inactivity, it should stop it's Thread.
If you can, try debugging at https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L262 to print if there's some task which keep on getting run and prevent it to stop.

The executor itself should stop its single thread if there is nothing to do, see https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/GlobalEventExecutor.java#L278 (the break parts) - so we just need to understand why is not happening.
When it will happen, it should be just an empty shell to unload.

@gsmet
Copy link
Member Author

gsmet commented Jul 4, 2024

This very naive patch seems to improve things:

diff --git a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
index e0adae3bef1..30e8102f811 100644
--- a/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
+++ b/extensions/vertx/runtime/src/main/java/io/quarkus/vertx/core/runtime/VertxCoreRecorder.java
@@ -34,6 +34,7 @@
 
 import io.netty.channel.EventLoopGroup;
 import io.netty.util.concurrent.FastThreadLocal;
+import io.netty.util.concurrent.GlobalEventExecutor;
 import io.quarkus.arc.Arc;
 import io.quarkus.arc.InstanceHandle;
 import io.quarkus.runtime.ExecutorRecorder;
@@ -453,6 +454,11 @@ public void handle(AsyncResult<Void> ar) {
             }
             LateBoundMDCProvider.setMDCProviderDelegate(null);
             vertx = null;
+            try {
+                GlobalEventExecutor.INSTANCE.awaitInactivity(0, TimeUnit.SECONDS);
+            } catch (InterruptedException e) {
+                // ignore
+            }
         }
     }
 

Now, I think we would need a proper way to actually shutdown this executor. All the shutdown methods have been disabled for this class.
I'm not entirely sure what I did ensures we won't have any more tasks executed.

@franz1981
Copy link
Contributor

franz1981 commented Jul 4, 2024

The executor is not really "running": it lazily creates a single Thread to handle all tasks, which will stop after 1 second of inactivity (which is using a timer to detect it). While stopped it should be an empty shell, really.

Re #41607 (comment)

beware that if there is no Thread running it will throw NPE, so we need to submit a task and right after await its inactivity. But...if there's something which keep on awake it or keeping it alive, we need to debug which task is.

At best this approach will force at least 1 second of pause between "restart": based on the configured -Dio.netty.globalEventExecutor.quietPeriodSeconds parameter, which is 1 by default.

@cescoffier
Copy link
Member

I agree with Franz, we need to see why the executor is not shutdown nicely.

I don't know if possible (didn't check the code yet) but in dev mode, we could imagine using reflection to shutdown the executor properly.

Do you have a way to debug / trigger this @gsmet, I can try to have a look tomorrow or Monday.

@gsmet
Copy link
Member Author

gsmet commented Jul 4, 2024

@franz1981

From what I can see, you have the GlobalEventExecutor#quietPeriodTask and then a bunch of tasks coming from: https://github.com/netty/netty/blob/ef2ed4e2cda6f764c661e24caccc833f9fd0b6e5/common/src/main/java/io/netty/util/concurrent/DefaultPromise.java#L500-L505

As for reproducing it, I'm trying to get my PR in but the reviewing guys are fighting... :)

@gsmet
Copy link
Member Author

gsmet commented Jul 6, 2024

@cescoffier @franz1981 so to reproduce it, make sure you have built a tree that contains:
#41692

Then you can run:

./mvnw clean verify -f extensions/opentelemetry/deployment/ -Dstart-containers -Dtest-containers -Dquarkus-log-access-to-closed-class-loaders -Dtest=InvalidAttributesConfigTest,NonAppEndpointsDisabledTest

You should see several stack traces similar to the following when the second test starts (you need two tests to see the issue):

Class loader QuarkusClassLoader:QuarkusUnitTest ClassLoader for InvalidAttributesConfigTest@4d157787 has been closed and may not be accessed anymore
java.lang.Exception: Stack trace
	at java.base/java.lang.Thread.dumpStack(Thread.java:1389)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.ensureOpen(QuarkusClassLoader.java:728)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:514)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:520)
	at io.quarkus.bootstrap.classloading.QuarkusClassLoader.loadClass(QuarkusClassLoader.java:509)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:95)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:262)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

@cescoffier
Copy link
Member

Coming back to this one. Will have a look this week.

@cescoffier cescoffier self-assigned this Oct 14, 2024
@cescoffier
Copy link
Member

It seems that it's the following task that is running:

final ScheduledFutureTask<Void> quietPeriodTask = new ScheduledFutureTask<Void>(
            this, Executors.<Void>callable(new Runnable() {
        @Override
        public void run() {
            // NOOP
        }
    }, null),
            // note: the getCurrentTimeNanos() call here only works because this is a final class, otherwise the method
            // could be overridden leading to unsafe initialization here!
            deadlineNanos(getCurrentTimeNanos(), SCHEDULE_QUIET_PERIOD_INTERVAL),
            -SCHEDULE_QUIET_PERIOD_INTERVAL
    );

This is the task that runs before closing the executor (I think @franz1981 knows better).

@cescoffier cescoffier assigned franz1981 and unassigned cescoffier Oct 18, 2024
@quarkus-bot quarkus-bot bot added this to the 3.19 - main milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants