Skip to content

tests(iroh): Regression test for transient Windows recv errors#4348

Open
Frando wants to merge 6 commits into
mainfrom
Frando/windows-conn-reset
Open

tests(iroh): Regression test for transient Windows recv errors#4348
Frando wants to merge 6 commits into
mainfrom
Frando/windows-conn-reset

Conversation

@Frando

@Frando Frando commented Jun 16, 2026

Copy link
Copy Markdown
Member

Description

This adds a regression test and fix for #4297 and other transient Windows recv errors. See n0-computer/net-tools#166 for details.

The test fails on Windows, and passes with n0-computer/net-tools#166.

Fixes #4297

Notes & open questions

Will need a release of netwatch (can be a patch release).

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/4348/docs/iroh/

Last updated: 2026-06-30T13:05:08Z

@Frando

Frando commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Windows failed as expected: https://github.com/n0-computer/iroh/actions/runs/27609768946/job/81631003142?pr=4348

Now pushing the fix by patching iroh to n0-computer/net-tools#166.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: d36155b

@Frando Frando force-pushed the Frando/windows-conn-reset branch from d7a3b64 to ec24e3a Compare June 16, 2026 10:10

@flub flub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! let's wait for the other PR before merging though.

Comment thread iroh/src/endpoint.rs Outdated
// An unreachable relay Nothing listens on 127.0.0.1:1, and its QADv4 probe target
// at 127.0.0.1:7842 is closed too, so probing it draws the ICMP port-unreachable
// that is emitted from the Windows socket on recv.
let dead_relay: RelayUrl = "https://127.0.0.1:1".parse().expect("valid relay url");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit nervous when using a port that you hope is unused. But not really sure what else to suggest so I guess this will have to do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you figure out something better in netwatch itself?

https://github.com/n0-computer/net-tools/blob/051ab8761006d7f2155e34a49f6bb881b582d5ab/netwatch/src/udp.rs#L1189

Calling it "definitely closed" is a bit of a stretch though: there's nothing stopping the kernel from re-using the port right away, but on most machines that's unlikely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Frando ping

Comment thread Cargo.toml Outdated
unused-async = "warn"

[patch.crates-io]
netwatch = { git = "https://github.com/n0-computer/net-tools", branch = "Frando/ignore-transient-error" }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait to switch this to net-tools main before merging this PR.

Frando added a commit to n0-computer/net-tools that referenced this pull request Jun 16, 2026
## Description

On Windows, `UdpSocket::recv` can return an error if the previous *send*
operation on that socket failed with an ICMP error.

The error either surfaces as `WSAECONNRESET`, which the Rust standard
library converts into a `io::ErrorKind::ConnectionReset`, or as a
`WSAENETRESET`, which becomes `ErrorKind::Uncategorized`.

`WSAECONNRESET` is emitted if the previous send failed due to a *ICMP
port unreachable*, and `WSAENETRESET` if the previous send got a TTL
expired (I think, not sure exactly).

Both of these errors are transient, because they are issued max. once
per sent datagram. The next recv on the socket then returns either a
datagram, a different error, or WouldBlock.

This PR changes the behavior such that we ignore these errors.

It comes with a regression test (first commit). I confirmed that it
fails on windows without the fix in the second commit.

Fixes n0-computer/iroh#4297
Fixes n0-computer/iroh-gossip#149
Also tested in n0-computer/iroh#4348

## Breaking Changes

None

## Notes & open questions

We can and likely should set the `SIO_UDP_CONNRESET` socket option on
the underlying socket, which supresses at least the `WSAECONNRESET`
error altogether. This would need to happen in `noq_udp` though. And
even if we did it, it's still better to also ignore these errors here
IMO.

## Change checklist
- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
@n0bot n0bot Bot added this to iroh Jun 16, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh Jun 16, 2026
@Frando Frando force-pushed the Frando/windows-conn-reset branch from 2823a0b to 35fc343 Compare June 29, 2026 09:44
@Frando Frando changed the title tests(iroh): Fix and regression test for transient Windows recv errors tests(iroh): Regression test for transient Windows recv errors Jun 29, 2026
@Frando Frando force-pushed the Frando/windows-conn-reset branch from 35fc343 to aa1f788 Compare June 29, 2026 10:07
Frando added 5 commits June 29, 2026 15:09
Instead of relying on the hard-coded default QUIC port (7842) being
unused, claim an ephemeral UDP port and close it. This is a more robust
way to get a closed port for the QADv4 probe to draw the ICMP
port-unreachable error the Windows recv bug needs.
Derive the relay url port from a bind-then-close ephemeral port as well,
so neither the relay url nor the QADv4 probe relies on a hard-coded port
happening to be unused.
The relay url is dialed over TCP (HTTPS), so a freed UDP port says
nothing about it. Claim the url port with a TCP listener and the QADv4
probe port with a UDP socket.
@flub

flub commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

ugh, ci is pretty sad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚑 Needs Triage

Development

Successfully merging this pull request may close these issues.

IpTransport should ignore WSAENETRESET errors on recv

2 participants