tests(iroh): Regression test for transient Windows recv errors#4348
tests(iroh): Regression test for transient Windows recv errors#4348Frando wants to merge 6 commits into
Conversation
d7a3b64 to
1e7c835
Compare
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/4348/docs/iroh/ Last updated: 2026-06-30T13:05:08Z |
|
Windows failed as expected: https://github.com/n0-computer/iroh/actions/runs/27609768946/job/81631003142?pr=4348 Now pushing the fix by patching iroh to n0-computer/net-tools#166. |
d7a3b64 to
ec24e3a
Compare
flub
left a comment
There was a problem hiding this comment.
very nice! let's wait for the other PR before merging though.
| // An unreachable relay Nothing listens on 127.0.0.1:1, and its QADv4 probe target | ||
| // at 127.0.0.1:7842 is closed too, so probing it draws the ICMP port-unreachable | ||
| // that is emitted from the Windows socket on recv. | ||
| let dead_relay: RelayUrl = "https://127.0.0.1:1".parse().expect("valid relay url"); |
There was a problem hiding this comment.
I'm always a bit nervous when using a port that you hope is unused. But not really sure what else to suggest so I guess this will have to do.
There was a problem hiding this comment.
didn't you figure out something better in netwatch itself?
Calling it "definitely closed" is a bit of a stretch though: there's nothing stopping the kernel from re-using the port right away, but on most machines that's unlikely.
| unused-async = "warn" | ||
|
|
||
| [patch.crates-io] | ||
| netwatch = { git = "https://github.com/n0-computer/net-tools", branch = "Frando/ignore-transient-error" } |
There was a problem hiding this comment.
Let's wait to switch this to net-tools main before merging this PR.
## Description On Windows, `UdpSocket::recv` can return an error if the previous *send* operation on that socket failed with an ICMP error. The error either surfaces as `WSAECONNRESET`, which the Rust standard library converts into a `io::ErrorKind::ConnectionReset`, or as a `WSAENETRESET`, which becomes `ErrorKind::Uncategorized`. `WSAECONNRESET` is emitted if the previous send failed due to a *ICMP port unreachable*, and `WSAENETRESET` if the previous send got a TTL expired (I think, not sure exactly). Both of these errors are transient, because they are issued max. once per sent datagram. The next recv on the socket then returns either a datagram, a different error, or WouldBlock. This PR changes the behavior such that we ignore these errors. It comes with a regression test (first commit). I confirmed that it fails on windows without the fix in the second commit. Fixes n0-computer/iroh#4297 Fixes n0-computer/iroh-gossip#149 Also tested in n0-computer/iroh#4348 ## Breaking Changes None ## Notes & open questions We can and likely should set the `SIO_UDP_CONNRESET` socket option on the underlying socket, which supresses at least the `WSAECONNRESET` error altogether. This would need to happen in `noq_udp` though. And even if we did it, it's still better to also ignore these errors here IMO. ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
2823a0b to
35fc343
Compare
35fc343 to
aa1f788
Compare
Instead of relying on the hard-coded default QUIC port (7842) being unused, claim an ephemeral UDP port and close it. This is a more robust way to get a closed port for the QADv4 probe to draw the ICMP port-unreachable error the Windows recv bug needs.
Derive the relay url port from a bind-then-close ephemeral port as well, so neither the relay url nor the QADv4 probe relies on a hard-coded port happening to be unused.
The relay url is dialed over TCP (HTTPS), so a freed UDP port says nothing about it. Claim the url port with a TCP listener and the QADv4 probe port with a UDP socket.
|
ugh, ci is pretty sad. |
Description
This adds a regression test and fix for #4297 and other transient Windows recv errors. See n0-computer/net-tools#166 for details.
The test fails on Windows, and passes with n0-computer/net-tools#166.
Fixes #4297
Notes & open questions
Will need a release of
netwatch(can be a patch release).Change checklist