Support for graceful socket closes when checking server endpoint listening via the admin CLI#25973
Support for graceful socket closes when checking server endpoint listening via the admin CLI#25973carryel wants to merge 1 commit into
Conversation
| // Force RST on close | ||
| socket.setSoLinger(true, 0); | ||
| } | ||
| socket.connect(endpoint.toInetSocketAddress(), SOCKET_CONNECT_TIMEOUT); |
There was a problem hiding this comment.
What will happen if we set some timeout after connection and save SO_LINGER option?
The SO_LINGER option was introduced to close a bunch of ephemeral ports in the TIME_WAIT state.
There was a problem hiding this comment.
|
The problem is that without the RST those connections will not be closed immediately and it brings issues with restarts and number of allocated ports. However looking at https://ipwithease.com/tcp-rst-flag/ I would say that the original was correct, and server should handle that. Probably log it just as a trace/finest level as it may happen any time with any other client.
Without that I was able to exhaust some limits. So I believe the server should just ignore this error. I have yet one thought ... in the past Java cached connections internally, now it should be possible to close them explicitly. Then it might be possible that we would not need to use the |
|
I asked Claude, it confirmed both mentioned problems - Mac issue and TIMED_WAITING without RST and offered this solution: public boolean isReachable(Endpoint endpoint) {
try (SocketChannel channel = SocketChannel.open();
Selector selector = Selector.open()) {
channel.configureBlocking(false);
channel.register(selector, SelectionKey.OP_CONNECT);
channel.connect(endpoint.toInetSocketAddress());
if (selector.select(SOCKET_CONNECT_TIMEOUT) == 0) {
return false; // timed out, no established connection → no TIME_WAIT
}
SelectionKey key = selector.selectedKeys().iterator().next();
if (key.isConnectable()) {
return channel.finishConnect(); // true = success
}
return false;
} catch (IOException e) {
LOG.log(TRACE, "Socket check to " + endpoint + " failed", e);
return false;
}
}That is very close to the original code I replaced in 55797c6 which probably did not work properly too as it was missing the |
dmatej
left a comment
There was a problem hiding this comment.
The problem is that "graceful" disconnect leads to thousands of connections in a state TIMED_WAITING.
I will try the Claude's recommendation, perhaps it could resolve both problems.
|
Testing: watch -n 1 "ss -tan state time-wait | grep ':4848' | wc -l"# Wait until watch goes to 0, then restart the domain and watch the watch.
asadmin restart-domainClaude's solution: 900-3000 connections. Finally Claude recommended just to reduce the logging level in Grizzly. This "vector" causes the problem: |
|
I just checked the past history. Thanks!
Considering potential side-effects I might not be aware of, I only changed the minimum number of calls (the ones I was able to verify).
Now that I know about that part as well, I’ll look into it a bit more with you. |
I will check this as well. It is presumed that doing this will resolve the original issue to some extent. This is because the problem occurred when a connection was closed in an incomplete state quickly due to a large number of connection requests over a very short period, so it is estimated that even a slight delay will prevent the issue. PS)
The |
|
I nearly forgot about it, but @avpinchuk's note and link refreshed my memory. After all I would really just reduced the log level of the warning, because any client/proxy/firewall can do that too (close without FIN) and Grizzly doesn't have any options to do something except throwing the connection away. Me:
Claude:
|
|
I also confirmed that an unnecessarily large number of TIME_WAIT states were generated when running commands like Shifting my perspective slightly, I would like to discuss the issue regarding the excessive TIME_WAIT states mentioned in the following issue.
If improvements are made in the following area as well, there will be no unnecessary (meaningless) connection requests, and the resulting TIME_WAIT requests will also decrease. I thought it would be good if the while loop and Even when only a 100ms sleep was applied for a brief test, excessive connection requests and the resulting TIME_WAIT did not occur. |
|
That is exactly what I would like to avoid - to sleep for no reason. We don't need the connection, we really want to close it asap, not to rely on implicit cleanup made by OS after 60 seconds. EDIT: yet one idea, what if we would use UDP for that? I will try. EDIT2: That looks quite well, 17. I will try to create a PR so you could test it on Mac - I am not sure how Grizzly can react on UDP ... EDIT3: UDP has new problem - the socket timeout is always there. So I discussed that with Claude again:
Claude:
The point is that those warnings really can be just suppressed, they don't mean any warning about disaster closing. Just the channel was unexpectedly closed from the other side while Grizzly tried to use it, and it should just reflect that probably by releasing own resources (if there are some). |
|
I am not sure if my intention was conveyed clearly, so I will explain a little further. My question was whether there is a reason for checking whether a (local or remote) server is listening to be done urgently within a very short time (about 130 connection attempts on Mac; about 1000 attempts on Windows?). I was not saying that the socket should be closed immediately with I discovered this while investigating why a large number of connect requests were coming in on port 4848 during the process of starting/stopping/restarting the domain. |
|
PS) Additionally, I plan to submit a pull request to adjust Grizzly's log level. Since it outputs actual errors as warning logs but ignores them to proceed, I don't see any reason to keep it at the warning level. |
01971a1 to
b1ae546
Compare
|
I have additionally modified the excessive port listening check that occurred for a short duration during the Furthermore, considering the issue of unnecessary persistence of TIME_WAIT, I have applied With these modifications, the TIME_WAIT level increased to an understandable level, and no errors occurred in Grizzly on my macOS. Would it be possible to verify this patch version in a Windows environment, where TIME_WAIT occurs frequently? |
b1ae546 to
d7433b2
Compare
Because we want to make restarts fast as much as possible. Reason 3: Perhaps Grizzly should react to these events perfectly too. I have yet another half baked PR on Corba-ORB project with similar issues, race conditions, where listeners don't handle start/stop/restart too well. I believe Grizzly is in better shape. On production GF/Grizzly has to handle thousands of clients in parallel with very different use cases, behaviors, so this is not something what should cause any problems. With any latency and graceful FIN we would leave hundreds open ports, which doesn't make any sense to me. It seems as a step back to me. |
Unfortunately, Java does not have a reliable way to determine whether a socket has been closed or has received an RST. We can't get the OS EINVAL error code without using native tools. And using native tools in the fast NIO framework is very expensive. In Windows it's even worse. |
|
Sorry for the conflicting changes made in #25972 , the Constants class was removed. Yet another idea - create some PORT_CHECK_SLEEP (maybe a better name) environment option. Unset by default, so it would be so aggressive as now, but it would be possible to slow it down. Another option is to open own server socket and wait for callback; because we have no problems with shutdown, the problem here is the detection that the server is listening to following asadmin+REST commands. That is maybe the correct solution. So we would not check if the 4848 is listening, but starting server would report itself at the point when it would cross some boundary of the startup process. The socket would be open just for loopback host as the server must be always started by the local asadmin command... then we would check the port just once and we don't even need the RST. |
…ening via the admin CLI. - For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below: - eclipse-ee4j/glassfish-grizzly#2285 - Changed the start-domain process to efficiently avoid excessive short-duration port listening check requests: Modified to poll by explicitly specifying an interval.
d7433b2 to
871bdcd
Compare
|
I have resubmitted the PR after resolving the conflict and modifying the variable names as suggested. However, unfortunately, I was unable to implement the environment option method you suggested. This is because the existing logic references various files such as |
|
We need to rethink how we address this - I moved this PR to draft to indicate that it's not to be merged for now. |
|
I plan to work on this in 8.0.4 - the callback solution instead of "spamming" the server. |
For macOS, immediately closing a client socket results in an incomplete socket state on the server, causing unnecessary server errors. Please refer to the issue below: