Skip to content

Support for graceful socket closes when checking server endpoint listening via the admin CLI#25973

Draft
carryel wants to merge 1 commit into
eclipse-ee4j:mainfrom
carryel:closeGracefully
Draft

Support for graceful socket closes when checking server endpoint listening via the admin CLI#25973
carryel wants to merge 1 commit into
eclipse-ee4j:mainfrom
carryel:closeGracefully

Conversation

@carryel

@carryel carryel commented Apr 8, 2026

Copy link
Copy Markdown

For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below:

// Force RST on close
socket.setSoLinger(true, 0);
}
socket.connect(endpoint.toInetSocketAddress(), SOCKET_CONNECT_TIMEOUT);

@avpinchuk avpinchuk Apr 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if we set some timeout after connection and save SO_LINGER option?

The SO_LINGER option was introduced to close a bunch of ephemeral ports in the TIME_WAIT state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatej

dmatej commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

The problem is that without the RST those connections will not be closed immediately and it brings issues with restarts and number of allocated ports.
There are at least 5 calls of the original method, you replaced just 3 of them. Why? If we don't want to use the RST signal, we should remove it completely.

However looking at https://ipwithease.com/tcp-rst-flag/ I would say that the original was correct, and server should handle that. Probably log it just as a trace/finest level as it may happen any time with any other client.

If there is a router doing NAT, especially a low end router with few resources, it will age the oldest TCP sessions first. To do this it sets the RST flag in the packet that effectively tells the receiving station to (very ungracefully) close the connection. this is done to save resources.

Some firewalls do that if a connection is idle for x number of minutes. Some ISPs set their routers to do that for various reasons as well.

Without that I was able to exhaust some limits.

So I believe the server should just ignore this error.

I have yet one thought ... in the past Java cached connections internally, now it should be possible to close them explicitly. Then it might be possible that we would not need to use the socket.setSoLinger(true, 0); if we would always disconnected. I will check that ... I have to remember how I was counting those connections ...

@dmatej

dmatej commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

I asked Claude, it confirmed both mentioned problems - Mac issue and TIMED_WAITING without RST and offered this solution:

public boolean isReachable(Endpoint endpoint) {
    try (SocketChannel channel = SocketChannel.open();
         Selector selector = Selector.open()) {
        
        channel.configureBlocking(false);
        channel.register(selector, SelectionKey.OP_CONNECT);
        channel.connect(endpoint.toInetSocketAddress());

        if (selector.select(SOCKET_CONNECT_TIMEOUT) == 0) {
            return false; // timed out, no established connection → no TIME_WAIT
        }

        SelectionKey key = selector.selectedKeys().iterator().next();
        if (key.isConnectable()) {
            return channel.finishConnect(); // true = success
        }
        return false;
    } catch (IOException e) {
        LOG.log(TRACE, "Socket check to " + endpoint + " failed", e);
        return false;
    }
}

That is very close to the original code I replaced in 55797c6 which probably did not work properly too as it was missing the key.isConnectable part. Now the question is if the Claude's code is completely correct.

@dmatej dmatej left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that "graceful" disconnect leads to thousands of connections in a state TIMED_WAITING.
I will try the Claude's recommendation, perhaps it could resolve both problems.

@dmatej

dmatej commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Testing:

watch -n 1 "ss -tan state time-wait | grep ':4848' | wc -l"
# Wait until watch goes to 0, then restart the domain and watch the watch. 
asadmin restart-domain

Claude's solution: 900-3000 connections.
Your solution (no RST): 1022-2500
RST with 0 timeout: 2
RST with 1s timeout: 107

Finally Claude recommended just to reduce the logging level in Grizzly.

This "vector" causes the problem:
ServerLifeSignChecker.waitFor -> isListeningOnAnyEndpoint(adminEndpointsSupplier.get()) -> isListening(HostAndPort endpoint)

@carryel

carryel commented Apr 9, 2026

Copy link
Copy Markdown
Author

I just checked the past history. Thanks!

There are at least 5 calls to the original method; you replaced just 3 of them. Why? If we don't want to use the RST signal, we should remove it completely.

Considering potential side-effects I might not be aware of, I only changed the minimum number of calls (the ones I was able to verify).

Claude's solution: 900-3000 connections.
Your solution (no RST): 1022-2500
RST with 0 timeout: 2
RST with 1s timeout: 107

Now that I know about that part as well, I’ll look into it a bit more with you.

@carryel

carryel commented Apr 9, 2026

Copy link
Copy Markdown
Author

What will happen if we set a timeout after the connection and save the SO_LINGER option?

I will check this as well. It is presumed that doing this will resolve the original issue to some extent. This is because the problem occurred when a connection was closed in an incomplete state quickly due to a large number of connection requests over a very short period, so it is estimated that even a slight delay will prevent the issue.

PS)
I just tested it.

socket.setSoLinger(true, 1);

The java.net.SocketException: Invalid argument error does not disappear, but it is effective as expected, and occurs intermittently in small amounts. However, the large number of TIME_WAIT instances has not improved.

@dmatej

dmatej commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

I nearly forgot about it, but @avpinchuk's note and link refreshed my memory. After all I would really just reduced the log level of the warning, because any client/proxy/firewall can do that too (close without FIN) and Grizzly doesn't have any options to do something except throwing the connection away.

Me:

Is there some way to reduce the TIMED_WAITING time? It seems it is around a minute on my system.

Claude:

Yes, but with an important caveat — TIME_WAIT duration is hardcoded in the Linux kernel as 2 × MSL (Maximum Segment Lifetime) = 60 seconds and is not directly configurable via sysctl.

@carryel

carryel commented Apr 9, 2026

Copy link
Copy Markdown
Author

I also confirmed that an unnecessarily large number of TIME_WAIT states were generated when running commands like restart-domain locally. (Approximately 100 to 300 connections, matching the number of connections reproduced here.)

Shifting my perspective slightly, I would like to discuss the issue regarding the excessive TIME_WAIT states mentioned in the following issue.

  1. Repeated port listening checks for short periods.

If improvements are made in the following area as well, there will be no unnecessary (meaningless) connection requests, and the resulting TIME_WAIT requests will also decrease.

while (deadline == null || Instant.now().isBefore(deadline)) {
if (sign.get()) {
return true;
}
Thread.onSpinWait();
}

I thought it would be good if the while loop and Thread.onSpinWait() were improved for I/O operations such as socket connect. This is because, practically speaking, Thread.onSpinWait() is a very short duration, resulting in many unnecessary connect requests within that short period.

Even when only a 100ms sleep was applied for a brief test, excessive connection requests and the resulting TIME_WAIT did not occur.
Of course, there is likely a better way. I just added a sleep just for verification purposes.

@dmatej

dmatej commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

That is exactly what I would like to avoid - to sleep for no reason. We don't need the connection, we really want to close it asap, not to rely on implicit cleanup made by OS after 60 seconds.

EDIT: yet one idea, what if we would use UDP for that? I will try.

EDIT2: That looks quite well, 17. I will try to create a PR so you could test it on Mac - I am not sure how Grizzly can react on UDP ...
Also, for remote connections the UDP might be forbidden, so probably we will need some switcher. However these restarts are usually local.

EDIT3: UDP has new problem - the socket timeout is always there. So I discussed that with Claude again:

Hmm, the problem here is that the SOCKET_CONNECT_TIMEOUT is set to 1 second, and it really times out after one second. On the other hand shorter timeout might be problem on really slow networks ....

Claude:

Exactly — you've hit the fundamental asymmetry of UDP for this use case:

Scenario TCP UDP
Port closed Instant ConnectionRefused Instant PortUnreachableException
Port open Instant success Must wait for timeout

With TCP you get an immediate answer in both cases. With UDP you only get an immediate answer when the port is closed — open requires waiting for silence.

So UDP Doesn't Actually Help You Here

Your polling loop is:

server down → server starting → server up

The slow case you care about is detecting server up as fast as possible. UDP makes that worse — you have to wait the full timeout every time the server is up, which is the hot path you're polling for.

Back to TCP — The Real Options

You've essentially proven empirically that:

  1. RST linger=0 — perfect behavior, Grizzly warns on Mac
  2. Everything else — either slow or accumulates connections

The options remaining are:

  • Suppress Grizzly's warning — simplest, no code change
  • Contribute a fix to Grizzly — handle the race condition where socket is closed before options are set
  • Accept the warnings — they're cosmetic, not functional

Is the Grizzly warning actually causing any real problem beyond noise in the logs? Because RST linger=0 with 2 connections is genuinely the correct solution for what you're doing.

The point is that those warnings really can be just suppressed, they don't mean any warning about disaster closing. Just the channel was unexpectedly closed from the other side while Grizzly tried to use it, and it should just reflect that probably by releasing own resources (if there are some).

@carryel

carryel commented Apr 10, 2026

Copy link
Copy Markdown
Author

I am not sure if my intention was conveyed clearly, so I will explain a little further.

My question was whether there is a reason for checking whether a (local or remote) server is listening to be done urgently within a very short time (about 130 connection attempts on Mac; about 1000 attempts on Windows?).

I was not saying that the socket should be closed immediately with RST or gracefully, regardless of whether the connection succeeds or fails.

I discovered this while investigating why a large number of connect requests were coming in on port 4848 during the process of starting/stopping/restarting the domain.

@carryel

carryel commented Apr 10, 2026

Copy link
Copy Markdown
Author

PS) Additionally, I plan to submit a pull request to adjust Grizzly's log level. Since it outputs actual errors as warning logs but ignores them to proceed, I don't see any reason to keep it at the warning level.

@carryel

carryel commented Apr 10, 2026

Copy link
Copy Markdown
Author

I have additionally modified the excessive port listening check that occurred for a short duration during the start-domain process mentioned above.

Furthermore, considering the issue of unnecessary persistence of TIME_WAIT, I have applied socket.setSoLinger(true, 1);.

With these modifications, the TIME_WAIT level increased to an understandable level, and no errors occurred in Grizzly on my macOS.

Would it be possible to verify this patch version in a Windows environment, where TIME_WAIT occurs frequently?

Comment thread nucleus/admin/cli/src/main/java/com/sun/enterprise/admin/cli/CLIConstants.java Outdated
@dmatej

dmatej commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

My question was whether there is a reason for checking whether a (local or remote) server is listening to be done urgently within a very short time (about 130 connection attempts on Mac; about 1000 attempts on Windows?).

Because we want to make restarts fast as much as possible.
Reason 1: Every waiting counts, in tests we do a lot of restarts.
Reason 2: No additional latency here led also to a lot of related discoveries of resource leaks, timing issues etc.

Reason 3: Perhaps Grizzly should react to these events perfectly too. I have yet another half baked PR on Corba-ORB project with similar issues, race conditions, where listeners don't handle start/stop/restart too well. I believe Grizzly is in better shape.

On production GF/Grizzly has to handle thousands of clients in parallel with very different use cases, behaviors, so this is not something what should cause any problems. With any latency and graceful FIN we would leave hundreds open ports, which doesn't make any sense to me. It seems as a step back to me.

@avpinchuk

avpinchuk commented Apr 10, 2026

Copy link
Copy Markdown
Contributor
  • Contribute a fix to Grizzly — handle the race condition where socket is closed before options are set

Unfortunately, Java does not have a reliable way to determine whether a socket has been closed or has received an RST.

We can't get the OS EINVAL error code without using native tools. And using native tools in the fast NIO framework is very expensive. In Windows it's even worse.

@dmatej

dmatej commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Sorry for the conflicting changes made in #25972 , the Constants class was removed.

Yet another idea - create some PORT_CHECK_SLEEP (maybe a better name) environment option. Unset by default, so it would be so aggressive as now, but it would be possible to slow it down.

Another option is to open own server socket and wait for callback; because we have no problems with shutdown, the problem here is the detection that the server is listening to following asadmin+REST commands. That is maybe the correct solution. So we would not check if the 4848 is listening, but starting server would report itself at the point when it would cross some boundary of the startup process. The socket would be open just for loopback host as the server must be always started by the local asadmin command... then we would check the port just once and we don't even need the RST.

…ening via the admin CLI.

- For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below:
- eclipse-ee4j/glassfish-grizzly#2285
- Changed the start-domain process to efficiently avoid excessive short-duration port listening check requests: Modified to poll by explicitly specifying an interval.
@carryel

carryel commented Apr 13, 2026

Copy link
Copy Markdown
Author

I have resubmitted the PR after resolving the conflict and modifying the variable names as suggested.

However, unfortunately, I was unable to implement the environment option method you suggested. This is because the existing logic references various files such as GlassfishVariable.java, SystemPropertyConstants.java, and CLIConstants.java, and some of them are deprecated, so I was unsure of the proper approach.

@OndroMih OndroMih marked this pull request as draft May 1, 2026 15:35
@OndroMih OndroMih marked this pull request as draft May 1, 2026 15:35
@OndroMih

OndroMih commented May 1, 2026

Copy link
Copy Markdown
Contributor

We need to rethink how we address this - I moved this PR to draft to indicate that it's not to be merged for now.

@dmatej

dmatej commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

I plan to work on this in 8.0.4 - the callback solution instead of "spamming" the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants