Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #25292 Ephemeral ports are exhausted if stopping the DAS takes a long time #25293

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,8 @@ private void waitForDeath(File watchedPid) throws CommandException {
final Duration timeout = Duration.ofMillis(DEATH_TIMEOUT_MS);
final Supplier<Boolean> deathSign;
if (isLocal()) {
deathSign = () -> !ProcessUtils.isListening(addr) && !ProcessUtils.isAlive(watchedPid);
// watch process id first to reduce the number of opened ephemeral ports
deathSign = () -> !ProcessUtils.isAlive(watchedPid) && !ProcessUtils.isListening(addr);
} else {
deathSign = () -> !ProcessUtils.isListening(addr);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import static java.lang.System.Logger.Level.DEBUG;
import static java.lang.System.Logger.Level.INFO;
import static java.lang.System.Logger.Level.TRACE;
import static java.lang.System.Logger.Level.WARNING;
import static java.nio.charset.StandardCharsets.ISO_8859_1;

/**
Expand Down Expand Up @@ -230,7 +231,12 @@ public static boolean waitFor(Supplier<Boolean> sign, Duration timeout, boolean
System.out.flush();
}
}
Thread.yield();
try {
Thread.sleep(100L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sleep(10) would be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always just a magic number - the problem is much deeper, because there is some time set by the OS kernel until the port is finally released. When we check again, we need a new local port, that is the cause of the issue; the more frequently we repeat that, the more close we are to exhaustion.

The correct solution really seems to open it just once and wait (!!!) until it closes from the server side - or return immediately if it is not listening at all.

SO_LINGER is not recommended, because it would really resolve the issue, however it is an "evil" way to disconnect, it is like pulling the wire out of the socket. The server then must resolve it as an error state; on the other hand we had no intention to send any data though this connection, so it can remain on the table, but I would try the first recommended option first; however it will probably mean more refactoring, so maybe I will finally accept this and play for a while on my new laptop, because it detected another issue, see #25295 ; I tried to count local ports leading to 4848 - it was 78 at max level.

Copy link
Contributor

@dmatej dmatej Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... even worse - around 200 local ports allocated when finished (and failed) mvn clean install -pl :admin-tests. They are cleaned up automatically after around 60 seconds, maximum seen in build was 315 even after I cherrypicked these two commits to my branch.

watch -d -n 1 "netstat | grep localhost:4848 | wc -l"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 things:

  • the sleep is not there for checking the port, but to give some pause between checks whether PID is running. With this PR, local check mainly checks the PID, and only if it's not available, connects over the network. With just yield, the algorithm would repeat "check PID" -> "yield" -> "check PID" -> "yield" ... effectively eating one CPU core only to check whether the process is still running. Would it be better if this wait is only for the PID check, e.g.
deathSign = () -> { 
boolean isAlive = ProcessUtils.isAlive(watchedPid);
if (isAlive) {
  Thread.sleep(100);
}
return !isAlive && !ProcessUtils.isListening(addr); 
}
  • the check over network (ProcessUtils.isListening(addr)) only connects to the port and then returns true, without any waiting. I agree with what you suggest, @dmatej, that this check should wait until the server disconnects, instead of repeatedly trying to connect. For both local and remote server. For example, check in a loop the value of server.isClosed(), or use NIO to wait for close event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OndroMih , yield doesn't usually behave that way. On most systems it just abandons its "window" and yields the control to the scheduler instead of hogging the cpu.

With ports it is much more complicated - if client closes, it doesn't close really anything, it just declares that the application wants to close. That is where my assumption was wrong, I expected that after close the local port is free which is not true.

See

And for both of us this is interesting, Thread.onSpinWait seems better than yield, however I am still not sure id they finally don't do the same:

Thread.sleep also means unloading and loading the context which means some load - Thread.sleep(10) might be less effective than yield or onSpinWait.

What we want here is not to iterate with sleeps, but to continue immediately after the condition is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree with what you suggest, @dmatej, that this check should wait until the server disconnects, instead of repeatedly trying to connect. For both local and remote server. For example, check in a loop the value of server.isClosed(), or use NIO to wait for close event.

What about this:

  • Open socket on client side
  • Connect to server
  • Set client socket timeout
  • Block on socket.getInputStream().read()
  • Await one of three conditions:
    • socket timeout expired
    • read returns with -1 (EOF) if server socket closed gracefully
    • SocketException (with cause Connection reset) if server socket closed abnormally (e.g. server was killed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for both of us this is interesting, Thread.onSpinWait seems better than yield,

May we use ProcessHandle.onExit().get() instead of any busy-waiting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, after experiments with SO_LINGER, closing streams (recommended by AI) before closing the socket, sleeps, etc, .... I have also found more places with similar issue, and came to some conclusion, but I want to ask you both, @tnagao7 and @OndroMih - do we need to check ports in "local" case?

I think we don't, maybe just when things are broken somehow (like if somebody deleted the domain directory), then it could be used as some backup, but even then - maybe would be better to find the process using the same instance directory and check if it is running and ignore ports.

Another issue which came to my way on the new hardware is #25295 - the highest number of open local ports leading to localhost:4848 in admin-tests was 427, however those are not caused by restarts/stops/starts. I am not sure if it is ok or if there is something we should fix.

For now we can merge this and I can follow in my own PR. The count of open local ports is really issue which must be monitored and resolved regardless of the source of exhaustion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avpinchuk The problem with ports is that when the port refuses connection, it doesn't provide any info about the state of the process. And the opposite too, when it is listening, it doesn't mean it is "our" GlassFish instance. If we want to monitor process, we should not monitor its ports at all.
In our case the instance is defined by its directory and by the process id.

With just one exception - asadmin stop-domain is able to stop ANY instance of GF listening on the admin port. Even if it has different PID than is in the pid file (which may not exist together with the instance directory deleted by mistake).

The onExit handle might be useful, however I am not sure if it is reliable ie. when we use --kill parameter? However this is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The onExit handle might be useful, however I am not sure if it is reliable ie. when we use --kill parameter?

JavaDoc for the ProcessHandle's destroy() and destroyForcibly() says:

The CompletableFuture from onExit() is completed when the process has terminated.

} catch (InterruptedException e) {
LOG.log(WARNING, "Interrupted while waiting", e);
break;
}
}
return false;
} finally {
Expand Down