Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

tnagao7 · 2024-12-18T11:07:57Z

Criterion for process aliveness:
To reduce the number of ports consumed, use ProcessUtils.isAlive() over ProcessUtils.isListening() if possible.
Time interval for testing criterion:
To limit the number of ports consumed per a certain period, test the criterion once in a certain amount of time (as in StopDomainCommand#waitForDeath() of GlassFish 7.0.0-M10 or earlier).

Signed-off-by: Takahiro Nagao <[email protected]>

pzygielo · 2024-12-18T11:54:02Z

nucleus/common/common-util/src/main/java/com/sun/enterprise/universal/process/ProcessUtils.java

+                try {
+                    Thread.sleep(100L);
+                } catch (InterruptedException e) {
+                    LOG.log(TRACE, "Interrupted while waiting", e);


Do not swallow InterruptedException.

Also 100 ms of sleeping is quite a lot. It will slow down integration tests - I believe Thread.yield is enough. Those ports should be closed immediately, if they are not, there is a bug somewhere. Is it visible on Linux too?

Yet one idea - something in your system (firewall, antivirus) might be attaching to every open port and then they are not closed immediately when Java closes them. I have no idea how to resolve it reliably except your workaround (sleeping between iterations). Maybe we should make the sleep value configurable (system property).

But maybe we should start with some unit test - creating a fake server, running 1 000 000 iterations in parallel thread and then closing the server. We have a windows machine in GitHub Actions, linux on Jenkins, it should be reproducible if it is a common problem. I can try that in few days, tomorrow I have to work on other things.

But I really don't like solutions like "take a pause for this magical number of millis and all issues vanish".

If it happens to you, it can happen to other people too, so I would like to know all about it ;-)

I confirm that on Ubuntu 24.04.1 LTS at least.

~17k TIME_WAIT ports with 15sec shutdown timeout...

Do not swallow InterruptedException.

I've modified the code not to swallow InterruptedException.

Those ports should be closed immediately, if they are not, there is a bug somewhere.

The behavior that TCP ports remains in state TIME_WAIT for a while (i.e. not immediately closed) comes from the TCP specification. See RFC 9293, Section 3.3.2 [1].

In Linux systems, TCP ports in TIME_WAIT state is kept for 60 seconds by default [2].
In Windows systems, 120 seconds by default [3].

Although it is possible to immediately close the TCP ports (without transitioning to TIME_WAIT state) by using the SO_LINGER option, but it is not recommended because it forcibly close the connection by RST and may violate the TCP quiet time restriction.

[1] https://www.rfc-editor.org/rfc/rfc9293.html#section-3.3.2
[2] https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
[3] https://learn.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

Aha, SO_LINGER rings the bell to me. However in our case we don't use the TCP for communication, we just check that server is listening; another option could be to wait until server closes the successfully established connection.

The TCP state transition diagrams above both show orderly connection termination. There’s another way to terminate a TCP connection and that’s by aborting the connection and sending an RST rather than a FIN. This is usually achieved by setting the SO_LINGER socket option to 0. This causes pending data to be discarded and the connection to be aborted with an RST rather than for the pending data to be transmitted and the connection closed cleanly with a FIN. It’s important to realise that when a connection is aborted any data that might be in flow between the peers is discarded and the RST is delivered straight away; usually as an error which represents the fact that the “connection has been reset by the peer”. The remote peer knows that the connection was aborted and neither peer enters TIME_WAIT.

...

it’s inefficient to rapidly open and close TCP connections to the same server so it makes sense beyond the issue of TIME_WAIT to try and maintain connections for longer periods of time rather than shorter periods of time.

https://serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

https://stackoverflow.com/a/13088864/375449

Signed-off-by: Takahiro Nagao <[email protected]>

OndroMih · 2024-12-19T17:44:55Z

nucleus/common/common-util/src/main/java/com/sun/enterprise/universal/process/ProcessUtils.java

@@ -230,7 +231,12 @@ public static boolean waitFor(Supplier<Boolean> sign, Duration timeout, boolean
                        System.out.flush();
                    }
                }
-                Thread.yield();
+                try {
+                    Thread.sleep(100L);


Maybe sleep(10) would be enough?

It is always just a magic number - the problem is much deeper, because there is some time set by the OS kernel until the port is finally released. When we check again, we need a new local port, that is the cause of the issue; the more frequently we repeat that, the more close we are to exhaustion.

The correct solution really seems to open it just once and wait (!!!) until it closes from the server side - or return immediately if it is not listening at all.

SO_LINGER is not recommended, because it would really resolve the issue, however it is an "evil" way to disconnect, it is like pulling the wire out of the socket. The server then must resolve it as an error state; on the other hand we had no intention to send any data though this connection, so it can remain on the table, but I would try the first recommended option first; however it will probably mean more refactoring, so maybe I will finally accept this and play for a while on my new laptop, because it detected another issue, see #25295 ; I tried to count local ports leading to 4848 - it was 78 at max level.

... even worse - around 200 local ports allocated when finished (and failed) mvn clean install -pl :admin-tests. They are cleaned up automatically after around 60 seconds, maximum seen in build was 315 even after I cherrypicked these two commits to my branch.

watch -d -n 1 "netstat | grep localhost:4848 | wc -l"

There are 2 things:

the sleep is not there for checking the port, but to give some pause between checks whether PID is running. With this PR, local check mainly checks the PID, and only if it's not available, connects over the network. With just yield, the algorithm would repeat "check PID" -> "yield" -> "check PID" -> "yield" ... effectively eating one CPU core only to check whether the process is still running. Would it be better if this wait is only for the PID check, e.g.

deathSign = () -> { boolean isAlive = ProcessUtils.isAlive(watchedPid); if (isAlive) { Thread.sleep(100); } return !isAlive && !ProcessUtils.isListening(addr); }

the check over network (ProcessUtils.isListening(addr)) only connects to the port and then returns true, without any waiting. I agree with what you suggest, @dmatej, that this check should wait until the server disconnects, instead of repeatedly trying to connect. For both local and remote server. For example, check in a loop the value of server.isClosed(), or use NIO to wait for close event.

@OndroMih , yield doesn't usually behave that way. On most systems it just abandons its "window" and yields the control to the scheduler instead of hogging the cpu.

With ports it is much more complicated - if client closes, it doesn't close really anything, it just declares that the application wants to close. That is where my assumption was wrong, I expected that after close the local port is free which is not true.

See

https://en.wikipedia.org/wiki/Yield_(multithreading) (common feature)

https://www.baeldung.com/java-thread-yield (Java)

https://learn.microsoft.com/en-us/dotnet/api/system.threading.thread.yield?view=net-9.0 (.Net)

And for both of us this is interesting, Thread.onSpinWait seems better than yield, however I am still not sure id they finally don't do the same:

https://stackoverflow.com/a/56056272/375449

https://stackoverflow.com/a/44637990/375449

Thread.sleep also means unloading and loading the context which means some load - Thread.sleep(10) might be less effective than yield or onSpinWait.

What we want here is not to iterate with sleeps, but to continue immediately after the condition is true.

I agree with what you suggest, @dmatej, that this check should wait until the server disconnects, instead of repeatedly trying to connect. For both local and remote server. For example, check in a loop the value of server.isClosed(), or use NIO to wait for close event.

What about this:

Open socket on client side

Connect to server

Set client socket timeout

Block on socket.getInputStream().read()

Await one of three conditions:

socket timeout expired

read returns with -1 (EOF) if server socket closed gracefully

SocketException (with cause Connection reset) if server socket closed abnormally (e.g. server was killed)

And for both of us this is interesting, Thread.onSpinWait seems better than yield,

May we use ProcessHandle.onExit().get() instead of any busy-waiting?

Well, after experiments with SO_LINGER, closing streams (recommended by AI) before closing the socket, sleeps, etc, .... I have also found more places with similar issue, and came to some conclusion, but I want to ask you both, @tnagao7 and @OndroMih - do we need to check ports in "local" case?

I think we don't, maybe just when things are broken somehow (like if somebody deleted the domain directory), then it could be used as some backup, but even then - maybe would be better to find the process using the same instance directory and check if it is running and ignore ports.

Another issue which came to my way on the new hardware is #25295 - the highest number of open local ports leading to localhost:4848 in admin-tests was 427, however those are not caused by restarts/stops/starts. I am not sure if it is ok or if there is something we should fix.

For now we can merge this and I can follow in my own PR. The count of open local ports is really issue which must be monitored and resolved regardless of the source of exhaustion.

@avpinchuk The problem with ports is that when the port refuses connection, it doesn't provide any info about the state of the process. And the opposite too, when it is listening, it doesn't mean it is "our" GlassFish instance. If we want to monitor process, we should not monitor its ports at all.
In our case the instance is defined by its directory and by the process id.

With just one exception - asadmin stop-domain is able to stop ANY instance of GF listening on the admin port. Even if it has different PID than is in the pid file (which may not exist together with the instance directory deleted by mistake).

The onExit handle might be useful, however I am not sure if it is reliable ie. when we use --kill parameter? However this is a good idea.

The onExit handle might be useful, however I am not sure if it is reliable ie. when we use --kill parameter?

JavaDoc for the ProcessHandle's destroy() and destroyForcibly() says:

The CompletableFuture from onExit() is completed when the process has terminated.

dmatej · 2024-12-25T18:36:39Z

Well, I thought I have a fix for #25295 , but when I isolated it from a branch based on your, tests failed again, so it seems I have quite good reproducer of both problems together. I will try to fix it properly based on the discussion here.

Fix exhaustion of ephemeral ports

268426f

Signed-off-by: Takahiro Nagao <[email protected]>

pzygielo reviewed Dec 18, 2024

View reviewed changes

Fix waitFor not to swallow InterruptedException

6aaa20e

Signed-off-by: Takahiro Nagao <[email protected]>

dmatej added the bug Something isn't working label Dec 19, 2024

dmatej added this to the 7.0.21 milestone Dec 19, 2024

OndroMih reviewed Dec 19, 2024

View reviewed changes

dmatej mentioned this pull request Dec 23, 2024

The asadmin restart-domain may cause race conditions #25295

Closed

dmatej linked an issue Dec 29, 2024 that may be closed by this pull request

The asadmin restart-domain may cause race conditions #25295

Closed

dmatej removed a link to an issue Dec 29, 2024

The asadmin restart-domain may cause race conditions #25295

Closed

dmatej mentioned this pull request Dec 29, 2024

Fix creating a lot of ephemeral ports when stopping GlassFish. Fix restart on fast machines #25300

Merged

dmatej closed this Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

tnagao7 commented Dec 18, 2024

pzygielo Dec 18, 2024

dmatej Dec 18, 2024

dmatej Dec 18, 2024 •

edited

Loading

avpinchuk Dec 18, 2024 •

edited

Loading

tnagao7 Dec 19, 2024

tnagao7 Dec 19, 2024

dmatej Dec 19, 2024 •

edited

Loading

OndroMih Dec 19, 2024

dmatej Dec 22, 2024

dmatej Dec 22, 2024 •

edited

Loading

OndroMih Dec 22, 2024

dmatej Dec 22, 2024

avpinchuk Dec 22, 2024

avpinchuk Dec 22, 2024

dmatej Dec 23, 2024

dmatej Dec 23, 2024

avpinchuk Dec 23, 2024

dmatej commented Dec 25, 2024

Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

Conversation

tnagao7 commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatej Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

avpinchuk Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatej Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatej Dec 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatej commented Dec 25, 2024

dmatej Dec 18, 2024 •

edited

Loading

avpinchuk Dec 18, 2024 •

edited

Loading

dmatej Dec 19, 2024 •

edited

Loading

dmatej Dec 22, 2024 •

edited

Loading