From df50641a6168e8fe2771cca4b65d8d86f587f183 Mon Sep 17 00:00:00 2001 From: Vladimir Urushev Date: Sat, 13 Jun 2026 15:21:53 +0200 Subject: [PATCH] fix(test): assert guest egress over TCP, not ICMP, for VM-boot CI --- .github/workflows/vm-linux.yml | 15 +++++++++++++++ README.md | 16 ++++++++++++++-- docs/adr/0025-linux-netlink-nftables.md | 12 ++++++++---- fleetboxtest/conformance_vm_test.go | 20 ++++++++++---------- 4 files changed, 47 insertions(+), 16 deletions(-) diff --git a/.github/workflows/vm-linux.yml b/.github/workflows/vm-linux.yml index bfb7181..cf73990 100644 --- a/.github/workflows/vm-linux.yml +++ b/.github/workflows/vm-linux.yml @@ -45,6 +45,21 @@ jobs: key: fleetbox-linux-${{ hashFiles('**/go.sum') }} restore-keys: fleetbox-linux- + # Hosted runners run Docker, which sets the iptables FORWARD policy to DROP when + # it enables IP forwarding. fleetbox VMs forward through their own non-docker + # bridge, so without this their egress is dropped — the same conflict libvirt KVM + # and LXD hit on Docker hosts. fleetbox does not override the host firewall, so + # allow the VM subnet range (192.168.0.0/16, where the backend hands out /24s) + # through DOCKER-USER, which Docker evaluates before its DROP. Both directions: + # guest egress (src) and the de-masqueraded return (dst). No-op without Docker. + # (Egress is asserted over TCP, not ICMP — the runner network drops outbound ICMP.) + - name: Allow VM egress past Docker's FORWARD DROP policy + run: | + if sudo iptables -L DOCKER-USER -n >/dev/null 2>&1; then + sudo iptables -I DOCKER-USER -s 192.168.0.0/16 -j ACCEPT + sudo iptables -I DOCKER-USER -d 192.168.0.0/16 -j ACCEPT + fi + - name: Boot real VMs (full capability-driven suite) run: | go test -c -o /tmp/fleetboxtest ./fleetboxtest diff --git a/README.md b/README.md index 693d262..08eaff2 100644 --- a/README.md +++ b/README.md @@ -83,6 +83,7 @@ Prefer the CLI to the library? On macOS (Apple Silicon) it ships as a Homebrew c ```bash brew tap pilat/fleetbox +brew trust --cask pilat/fleetbox/fleetbox brew install --cask fleetbox ``` @@ -305,6 +306,13 @@ way, the decision log lives in [docs/adr/](docs/adr/). brought back up needs its `/24` to still be free — on a contended host the auto-picked subnet can shift and the rebooted VM won't be reachable; bring clusters up fresh. arm64 Linux boot via rust-hypervisor-firmware is not yet validated on hardware. +- **Docker on the Linux host blocks VM egress.** When Docker is running it sets the + iptables `FORWARD` policy to `DROP`, which drops a fleetbox VM's traffic to the internet + (VMs forward through their own bridge, not Docker's) — the same conflict libvirt and LXD + hit on Docker hosts. VM↔VM and host↔VM still work; only internet egress is affected. + fleetbox deliberately does not rewrite your host firewall, so allow its subnet range + yourself: `sudo iptables -I DOCKER-USER -s 192.168.0.0/16 -j ACCEPT` (add the matching + `-d 192.168.0.0/16` rule for the return path). - **v0 API.** Expect breaking changes until it stabilizes. ### CI @@ -319,6 +327,9 @@ aren't re-downloaded every run): - run: | echo 'KERNEL=="kvm", GROUP="kvm", MODE="0666"' | sudo tee /etc/udev/rules.d/99-kvm.rules sudo udevadm control --reload-rules && sudo udevadm trigger + # Hosted runners run Docker (FORWARD policy DROP); allow the VM subnet to forward. + sudo iptables -I DOCKER-USER -s 192.168.0.0/16 -j ACCEPT + sudo iptables -I DOCKER-USER -d 192.168.0.0/16 -j ACCEPT - uses: actions/cache@v4 with: path: | @@ -328,8 +339,9 @@ aren't re-downloaded every run): ``` arm64 hosted Linux runners do **not** have KVM ("not supported for this sku"); use an -x86-64 runner for VM-boot CI. This is the "develop on a Mac, test in cheap x86-64 hosted -Linux CI" story. +x86-64 runner for VM-boot CI. Hosted runners also drop outbound **ICMP**, so check a +guest's internet egress over TCP (a connect to `1.1.1.1:443`), not `ping`. This is the +"develop on a Mac, test in cheap x86-64 hosted Linux CI" story. ## Roadmap diff --git a/docs/adr/0025-linux-netlink-nftables.md b/docs/adr/0025-linux-netlink-nftables.md index 01ed391..54b6638 100644 --- a/docs/adr/0025-linux-netlink-nftables.md +++ b/docs/adr/0025-linux-netlink-nftables.md @@ -127,7 +127,8 @@ ceiling is a documented limitation, not a code path. fail loudly unless the table, both chains, the masquerade verdict, and the forward drop's match all survived. - **Accepted egress ceiling.** On a host where Docker or ufw has clamped FORWARD to DROP, - the guests cannot reach the internet. Documented, not worked around. + the guests cannot reach the internet. Documented, not worked around (the README shows the + `DOCKER-USER` allow rules an operator can add if they want egress on such a host). - **Irreducible uplink-transit residual.** Keeping the uplink's forwarding flag on permits uplink-ingress transit to the host's other routed networks — the inherent cost of routed egress without a global clamp. Documented, not chased. @@ -136,6 +137,9 @@ ceiling is a documented limitation, not a code path. (its `MASQUERADE` rule and global `ip_forward` flip are invisible to the new sweep). fleetbox is pre-release: delete `~/.fleetbox` by hand when upgrading across this change. - **Dogfood-proven, not just compiled.** For network code, compile and lint prove nothing. - The VM-boot CI (`vm-linux.yml`) now asserts internet egress over SSH from the booted - guest (`ping 1.1.1.1`), so a missing or silently-dropped masquerade rule fails CI rather - than passing a green build. + The VM-boot CI (`vm-linux.yml`) asserts real internet egress over SSH from the booted + guest — a TCP connect to `1.1.1.1:443`, **not** `ping`, because GitHub-hosted runners drop + outbound ICMP (the host itself cannot ping out). The runner is itself a Docker host + (FORWARD policy DROP), so the workflow first opens the VM subnet through `DOCKER-USER` — + the operator's job, per the egress ceiling above; fleetbox does not. A missing or + silently-dropped masquerade rule then fails CI rather than passing a green build. diff --git a/fleetboxtest/conformance_vm_test.go b/fleetboxtest/conformance_vm_test.go index 94dbe66..da4f2af 100644 --- a/fleetboxtest/conformance_vm_test.go +++ b/fleetboxtest/conformance_vm_test.go @@ -55,16 +55,16 @@ func TestVMConformance(t *testing.T) { t.Errorf("SSH output = %q, want it to contain conformance-ok", out) } - // Egress: the guest must reach the public internet. On Linux this drives the nft - // masquerade + per-interface forwarding end to end (ADR-0025) — a missing or - // silently-dropped masq rule fails here, which a plain echo-over-SSH would never - // catch. Ping by IP so the check does not depend on guest DNS. - out, err = vm.SSH(ctx, "ping -c1 -W5 1.1.1.1") - if err != nil { - t.Fatalf("egress ping failed (guest cannot reach the internet): %v\n%s", err, out) - } - if !strings.Contains(out, "0% packet loss") { - t.Fatalf("egress ping got no reply (no internet egress):\n%s", out) + // Egress: the guest must reach the public internet. This drives the nft masquerade + // + per-interface forwarding end to end (ADR-0025) — a missing or silently-dropped + // masq rule fails here, which a plain echo-over-SSH would never catch. Use a TCP + // connect, NOT ICMP ping: some CI networks (notably GitHub-hosted runners) drop + // outbound ICMP while allowing TCP, so a ping would be a false negative even when + // egress works. 1.1.1.1:443 is a stable, DNS-independent target; bash's /dev/tcp + // needs no extra package in the stock guest. + out, err = vm.SSH(ctx, "timeout 8 bash -c 'exec 3<>/dev/tcp/1.1.1.1/443 && echo egress-ok'") + if err != nil || !strings.Contains(out, "egress-ok") { + t.Fatalf("egress failed (guest cannot open TCP to the internet): %v\n%s", err, out) } // Stop gracefully (disk preserved), then Destroy removes everything — the full