Support for graceful socket closes when checking server endpoint listening via the admin CLI by carryel · Pull Request #25973 · eclipse-ee4j/glassfish

carryel · 2026-04-08T10:04:08Z

For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below:

Grizzly warnings in the log file glassfish-grizzly#2285

avpinchuk · 2026-04-08T14:41:06Z

+                // Force RST on close
+                socket.setSoLinger(true, 0);
+            }
            socket.connect(endpoint.toInetSocketAddress(), SOCKET_CONNECT_TIMEOUT);


What will happen if we set some timeout after connection and save SO_LINGER option?

The SO_LINGER option was introduced to close a bunch of ephemeral ports in the TIME_WAIT state.

See

Added disconnecting HttpURLConnection and replaced ProcessUtils.isListening body #25811

dmatej · 2026-04-09T02:10:02Z

The problem is that without the RST those connections will not be closed immediately and it brings issues with restarts and number of allocated ports.
There are at least 5 calls of the original method, you replaced just 3 of them. Why? If we don't want to use the RST signal, we should remove it completely.

However looking at https://ipwithease.com/tcp-rst-flag/ I would say that the original was correct, and server should handle that. Probably log it just as a trace/finest level as it may happen any time with any other client.

If there is a router doing NAT, especially a low end router with few resources, it will age the oldest TCP sessions first. To do this it sets the RST flag in the packet that effectively tells the receiving station to (very ungracefully) close the connection. this is done to save resources.

Some firewalls do that if a connection is idle for x number of minutes. Some ISPs set their routers to do that for various reasons as well.

Without that I was able to exhaust some limits.

So I believe the server should just ignore this error.

I have yet one thought ... in the past Java cached connections internally, now it should be possible to close them explicitly. Then it might be possible that we would not need to use the socket.setSoLinger(true, 0); if we would always disconnected. I will check that ... I have to remember how I was counting those connections ...

dmatej · 2026-04-09T02:20:31Z

I asked Claude, it confirmed both mentioned problems - Mac issue and TIMED_WAITING without RST and offered this solution:

public boolean isReachable(Endpoint endpoint) {
    try (SocketChannel channel = SocketChannel.open();
         Selector selector = Selector.open()) {
        
        channel.configureBlocking(false);
        channel.register(selector, SelectionKey.OP_CONNECT);
        channel.connect(endpoint.toInetSocketAddress());

        if (selector.select(SOCKET_CONNECT_TIMEOUT) == 0) {
            return false; // timed out, no established connection → no TIME_WAIT
        }

        SelectionKey key = selector.selectedKeys().iterator().next();
        if (key.isConnectable()) {
            return channel.finishConnect(); // true = success
        }
        return false;
    } catch (IOException e) {
        LOG.log(TRACE, "Socket check to " + endpoint + " failed", e);
        return false;
    }
}

That is very close to the original code I replaced in 55797c6 which probably did not work properly too as it was missing the key.isConnectable part. Now the question is if the Claude's code is completely correct.

dmatej

The problem is that "graceful" disconnect leads to thousands of connections in a state TIMED_WAITING.
I will try the Claude's recommendation, perhaps it could resolve both problems.

dmatej · 2026-04-09T03:09:15Z

Testing:

watch -n 1 "ss -tan state time-wait | grep ':4848' | wc -l"

# Wait until watch goes to 0, then restart the domain and watch the watch. 
asadmin restart-domain

Claude's solution: 900-3000 connections.
Your solution (no RST): 1022-2500
RST with 0 timeout: 2
RST with 1s timeout: 107

Finally Claude recommended just to reduce the logging level in Grizzly.

This "vector" causes the problem:
ServerLifeSignChecker.waitFor -> isListeningOnAnyEndpoint(adminEndpointsSupplier.get()) -> isListening(HostAndPort endpoint)

carryel · 2026-04-09T03:35:11Z

I just checked the past history. Thanks!

There are at least 5 calls to the original method; you replaced just 3 of them. Why? If we don't want to use the RST signal, we should remove it completely.

Considering potential side-effects I might not be aware of, I only changed the minimum number of calls (the ones I was able to verify).

Claude's solution: 900-3000 connections.
Your solution (no RST): 1022-2500
RST with 0 timeout: 2
RST with 1s timeout: 107

Now that I know about that part as well, I’ll look into it a bit more with you.

carryel · 2026-04-09T03:43:00Z

What will happen if we set a timeout after the connection and save the SO_LINGER option?

I will check this as well. It is presumed that doing this will resolve the original issue to some extent. This is because the problem occurred when a connection was closed in an incomplete state quickly due to a large number of connection requests over a very short period, so it is estimated that even a slight delay will prevent the issue.

PS)
I just tested it.

socket.setSoLinger(true, 1);

The java.net.SocketException: Invalid argument error does not disappear, but it is effective as expected, and occurs intermittently in small amounts. However, the large number of TIME_WAIT instances has not improved.

dmatej · 2026-04-09T03:45:48Z

I nearly forgot about it, but @avpinchuk's note and link refreshed my memory. After all I would really just reduced the log level of the warning, because any client/proxy/firewall can do that too (close without FIN) and Grizzly doesn't have any options to do something except throwing the connection away.

Me:

Is there some way to reduce the TIMED_WAITING time? It seems it is around a minute on my system.

Claude:

Yes, but with an important caveat — TIME_WAIT duration is hardcoded in the Linux kernel as 2 × MSL (Maximum Segment Lifetime) = 60 seconds and is not directly configurable via sysctl.

carryel · 2026-04-09T08:29:48Z

I also confirmed that an unnecessarily large number of TIME_WAIT states were generated when running commands like restart-domain locally. (Approximately 100 to 300 connections, matching the number of connections reproduced here.)

Grizzly warnings in the log file glassfish-grizzly#2285 (comment)

Shifting my perspective slightly, I would like to discuss the issue regarding the excessive TIME_WAIT states mentioned in the following issue.

Repeated port listening checks for short periods.

If improvements are made in the following area as well, there will be no unnecessary (meaningless) connection requests, and the resulting TIME_WAIT requests will also decrease.

glassfish/nucleus/common/common-util/src/main/java/com/sun/enterprise/universal/process/ProcessUtils.java

Lines 281 to 286 in b9e2987

    
           while (deadline == null || Instant.now().isBefore(deadline)) { 
        
               if (sign.get()) { 
        
                   return true; 
        
               } 
        
               Thread.onSpinWait(); 
        
           }

I thought it would be good if the while loop and Thread.onSpinWait() were improved for I/O operations such as socket connect. This is because, practically speaking, Thread.onSpinWait() is a very short duration, resulting in many unnecessary connect requests within that short period.

Even when only a 100ms sleep was applied for a brief test, excessive connection requests and the resulting TIME_WAIT did not occur.
Of course, there is likely a better way. I just added a sleep just for verification purposes.

dmatej · 2026-04-09T18:25:02Z

That is exactly what I would like to avoid - to sleep for no reason. We don't need the connection, we really want to close it asap, not to rely on implicit cleanup made by OS after 60 seconds.

EDIT: yet one idea, what if we would use UDP for that? I will try.

EDIT2: That looks quite well, 17. I will try to create a PR so you could test it on Mac - I am not sure how Grizzly can react on UDP ...
Also, for remote connections the UDP might be forbidden, so probably we will need some switcher. However these restarts are usually local.

EDIT3: UDP has new problem - the socket timeout is always there. So I discussed that with Claude again:

Hmm, the problem here is that the SOCKET_CONNECT_TIMEOUT is set to 1 second, and it really times out after one second. On the other hand shorter timeout might be problem on really slow networks ....

Claude:

Exactly — you've hit the fundamental asymmetry of UDP for this use case:

Scenario	TCP	UDP
Port closed	Instant ConnectionRefused	Instant PortUnreachableException
Port open	Instant success	Must wait for timeout

With TCP you get an immediate answer in both cases. With UDP you only get an immediate answer when the port is closed — open requires waiting for silence.

So UDP Doesn't Actually Help You Here

Your polling loop is:
server down → server starting → server up
The slow case you care about is detecting server up as fast as possible. UDP makes that worse — you have to wait the full timeout every time the server is up, which is the hot path you're polling for.

Back to TCP — The Real Options

You've essentially proven empirically that:

RST linger=0 — perfect behavior, Grizzly warns on Mac

Everything else — either slow or accumulates connections

The options remaining are:

Suppress Grizzly's warning — simplest, no code change

Contribute a fix to Grizzly — handle the race condition where socket is closed before options are set

Accept the warnings — they're cosmetic, not functional

Is the Grizzly warning actually causing any real problem beyond noise in the logs? Because RST linger=0 with 2 connections is genuinely the correct solution for what you're doing.

The point is that those warnings really can be just suppressed, they don't mean any warning about disaster closing. Just the channel was unexpectedly closed from the other side while Grizzly tried to use it, and it should just reflect that probably by releasing own resources (if there are some).

carryel · 2026-04-10T00:20:10Z

I am not sure if my intention was conveyed clearly, so I will explain a little further.

My question was whether there is a reason for checking whether a (local or remote) server is listening to be done urgently within a very short time (about 130 connection attempts on Mac; about 1000 attempts on Windows?).

I was not saying that the socket should be closed immediately with RST or gracefully, regardless of whether the connection succeeds or fails.

I discovered this while investigating why a large number of connect requests were coming in on port 4848 during the process of starting/stopping/restarting the domain.

carryel · 2026-04-10T00:23:47Z

PS) Additionally, I plan to submit a pull request to adjust Grizzly's log level. Since it outputs actual errors as warning logs but ignores them to proceed, I don't see any reason to keep it at the warning level.

carryel · 2026-04-10T11:16:53Z

I have additionally modified the excessive port listening check that occurred for a short duration during the start-domain process mentioned above.

Furthermore, considering the issue of unnecessary persistence of TIME_WAIT, I have applied socket.setSoLinger(true, 1);.

With these modifications, the TIME_WAIT level increased to an understandable level, and no errors occurred in Grizzly on my macOS.

Would it be possible to verify this patch version in a Windows environment, where TIME_WAIT occurs frequently?

dmatej · 2026-04-10T13:49:01Z

My question was whether there is a reason for checking whether a (local or remote) server is listening to be done urgently within a very short time (about 130 connection attempts on Mac; about 1000 attempts on Windows?).

Because we want to make restarts fast as much as possible.
Reason 1: Every waiting counts, in tests we do a lot of restarts.
Reason 2: No additional latency here led also to a lot of related discoveries of resource leaks, timing issues etc.

Reason 3: Perhaps Grizzly should react to these events perfectly too. I have yet another half baked PR on Corba-ORB project with similar issues, race conditions, where listeners don't handle start/stop/restart too well. I believe Grizzly is in better shape.

On production GF/Grizzly has to handle thousands of clients in parallel with very different use cases, behaviors, so this is not something what should cause any problems. With any latency and graceful FIN we would leave hundreds open ports, which doesn't make any sense to me. It seems as a step back to me.

avpinchuk · 2026-04-10T17:46:52Z

Contribute a fix to Grizzly — handle the race condition where socket is closed before options are set

Unfortunately, Java does not have a reliable way to determine whether a socket has been closed or has received an RST.

We can't get the OS EINVAL error code without using native tools. And using native tools in the fast NIO framework is very expensive. In Windows it's even worse.

dmatej · 2026-04-10T20:37:54Z

Sorry for the conflicting changes made in #25972 , the Constants class was removed.

Yet another idea - create some PORT_CHECK_SLEEP (maybe a better name) environment option. Unset by default, so it would be so aggressive as now, but it would be possible to slow it down.

Another option is to open own server socket and wait for callback; because we have no problems with shutdown, the problem here is the detection that the server is listening to following asadmin+REST commands. That is maybe the correct solution. So we would not check if the 4848 is listening, but starting server would report itself at the point when it would cross some boundary of the startup process. The socket would be open just for loopback host as the server must be always started by the local asadmin command... then we would check the port just once and we don't even need the RST.

…ening via the admin CLI. - For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below: - eclipse-ee4j/glassfish-grizzly#2285 - Changed the start-domain process to efficiently avoid excessive short-duration port listening check requests: Modified to poll by explicitly specifying an interval.

carryel · 2026-04-13T11:20:35Z

I have resubmitted the PR after resolving the conflict and modifying the variable names as suggested.

However, unfortunately, I was unable to implement the environment option method you suggested. This is because the existing logic references various files such as GlassfishVariable.java, SystemPropertyConstants.java, and CLIConstants.java, and some of them are deprecated, so I was unsure of the proper approach.

OndroMih · 2026-05-01T15:35:45Z

We need to rethink how we address this - I moved this PR to draft to indicate that it's not to be merged for now.

dmatej · 2026-06-04T15:29:13Z

I plan to work on this in 8.0.4 - the callback solution instead of "spamming" the server.

OndroMih approved these changes Apr 8, 2026

View reviewed changes

avpinchuk reviewed Apr 8, 2026

View reviewed changes

dmatej requested changes Apr 9, 2026

View reviewed changes

carryel mentioned this pull request Apr 10, 2026

Adjusting the log level during the initial socket configuration eclipse-ee4j/glassfish-grizzly#2289

Merged

carryel force-pushed the closeGracefully branch 2 times, most recently from 01971a1 to b1ae546 Compare April 10, 2026 11:08

pzygielo reviewed Apr 10, 2026

View reviewed changes

Comment thread nucleus/admin/cli/src/main/java/com/sun/enterprise/admin/cli/CLIConstants.java Outdated

carryel force-pushed the closeGracefully branch from b1ae546 to d7433b2 Compare April 10, 2026 12:17

carryel force-pushed the closeGracefully branch from d7433b2 to 871bdcd Compare April 13, 2026 10:59

OndroMih marked this pull request as draft May 1, 2026 15:35

Uh oh!

Conversation

carryel commented Apr 8, 2026

Uh oh!

avpinchuk Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avpinchuk Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

dmatej commented Apr 9, 2026

Uh oh!

dmatej commented Apr 9, 2026

Uh oh!

dmatej left a comment

Choose a reason for hiding this comment

Uh oh!

dmatej commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carryel commented Apr 9, 2026

Uh oh!

carryel commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmatej commented Apr 9, 2026

Uh oh!

carryel commented Apr 9, 2026

Uh oh!

dmatej commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

So UDP Doesn't Actually Help You Here

Back to TCP — The Real Options

Uh oh!

carryel commented Apr 10, 2026

Uh oh!

carryel commented Apr 10, 2026

Uh oh!

carryel commented Apr 10, 2026

Uh oh!

Uh oh!

dmatej commented Apr 10, 2026

Uh oh!

avpinchuk commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmatej commented Apr 10, 2026

Uh oh!

carryel commented Apr 13, 2026

Uh oh!

OndroMih commented May 1, 2026

Uh oh!

dmatej commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

avpinchuk Apr 8, 2026 •

edited

Loading

dmatej commented Apr 9, 2026 •

edited

Loading

carryel commented Apr 9, 2026 •

edited

Loading

dmatej commented Apr 9, 2026 •

edited

Loading

avpinchuk commented Apr 10, 2026 •

edited

Loading