refactor(linux): netlink/nftables host networking and a unified capability-driven VM test suite#12
Conversation
|
Warning Review limit reached
More reviews will be available in 35 minutes and 21 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (25)
📝 WalkthroughWalkthroughThis PR migrates Linux cloud-hypervisor networking from subprocess-based ChangesLinux Networking Migration
Capability-Driven Test Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fleetboxtest/fleetboxtest_test.go`:
- Line 42: Add a doc comment immediately above the TestBootTimeout function
declaration that begins with "TestBootTimeout" and briefly describes the test's
purpose and behavior (e.g., what it verifies and any important conditions or
expectations); ensure the comment follows Go doc comment style for exported
symbols and is placed directly above the TestBootTimeout func declaration so
linters and readers can pick it up.
In `@internal/backend/cloudhypervisor/forwarding.go`:
- Around line 116-119: The loop in maybeRestoreForwarding currently clears the
fwd-*.orig marker unconditionally; change it so writeForwarding(uplink, orig)
returns/propagates an error and only call store.clearForwardingOrig(uplink) when
writeForwarding succeeded (err == nil). Specifically, in maybeRestoreForwarding,
capture the error from writeForwarding(uplink, orig), log or return the error as
appropriate, and only call store.clearForwardingOrig(uplink) on success; leave
the marker in place on failure so retries can occur.
- Around line 26-31: discoverUplink currently swallows netlink.RouteGet errors
and treats any failure as “no uplink”; update the code that calls
netlink.RouteGet in discoverUplink to check and propagate/log the error instead
of ignoring it (i.e., inspect the error returned from netlink.RouteGet, return
or surface it to the caller so hard failures aren’t misclassified as offline,
and only proceed to build indices when routes is non-nil and err == nil). In
maybeRestoreForwarding, stop clearing the first-writer-wins marker
unconditionally: call writeForwarding and check its returned error, and only
call store.clearForwardingOrig(...) after writeForwarding succeeds; if
writeForwarding fails, preserve the marker and return the error (or log and
retry as appropriate). Reference functions: discoverUplink,
maybeRestoreForwarding, writeForwarding, and store.clearForwardingOrig; ensure
netlink.RouteGet error handling is explicit and restore-marker clearing is
conditional on successful restore.
In `@internal/backend/cloudhypervisor/network.go`:
- Around line 99-105: When nftProbe() fails after the network object is created,
ensure the firewall cleanup is attempted and the WAL network record is removed:
call removeFirewall() before n.Close(), and after attempting removeFirewall()
set the n.fwRemoved flag (or otherwise ensure Close() will remove the WAL
record) so that Close() does not leave a stale record even if removeFirewall()
returns an error; keep error logging from removeFirewall() but proceed to call
n.Close() to delete the WAL entry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e2eb660e-fbd1-4446-939e-9a9e881a78e6
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (25)
.github/workflows/vm-linux.ymlARCHITECTURE.mdCLAUDE.mdMakefileREADME.mddocs/adr/0011-linux-cloud-hypervisor-backend.mddocs/adr/0013-crash-safe-linux-network-lifecycle.mddocs/adr/0025-linux-netlink-nftables.mdfleetbox_linux.gofleetboxtest/cluster_test.gofleetboxtest/conformance_vm_test.gofleetboxtest/fixtures_vm_test.gofleetboxtest/fleetboxtest.gofleetboxtest/fleetboxtest_test.gofleetboxtest/nested_test.gogo.modinternal/backend/cloudhypervisor/cloudhypervisor.gointernal/backend/cloudhypervisor/forwarding.gointernal/backend/cloudhypervisor/netstate.gointernal/backend/cloudhypervisor/network.gointernal/backend/cloudhypervisor/nftables.gointernal/backend/cloudhypervisor/purehelpers.gointernal/backend/cloudhypervisor/purehelpers_test.gointernal/holder/holder.gointernal/orchestrator/orchestrator.go
…ility-driven VM test suite
8e3778a to
5b3ed89
Compare
This branch pulls together two related pieces of the Linux story.
The first is a rewrite of how the Linux backend programs host networking. It used to build the shared bridge, the per-VM taps, and the egress masquerade by shelling out to
ipandiptables— fragile on exactly the hosts fleetbox is meant to run on (a laptop, a CI runner, a random KVM box), where those binaries may be missing or sitting in a/sbinthe user'sPATHdoesn't include. It now talks to the kernel directly over netlink and nftables from pure Go, so the Linux path depends on no host networking tools at all. While we were in there we also fixed the security posture: instead of flipping the globalip_forwardswitch (which quietly turns the whole machine into a router across every interface) it enables forwarding only on the bridge and the discovered uplink, and it owns a single nftables table per bridge that both masquerades egress and keeps unsolicited inbound out of the guest subnet. The reasoning is written up in ADR-0025, which amends ADR-0011 and ADR-0013.The second piece is a cleanup of the test suite, and it's why the two changes ride together: the cleanest way to prove the new networking code actually works on a live kernel is to run the real tests against it. Until now the VM tests were reachable only through a tangle of make targets,
-runfilters, and build tags — a human had to pick which tests ran where, and the nested "fleetbox-tests-fleetbox" dogfood ran a bespoke copy of the networking checks rather than the real ones. This collapses all of that into one capability-driven surface: the same test code runs everywhere, and each test decides for itself whether to boot a VM based on the speed tier (-short) and a runtime probe of the host — is/dev/kvmthere, does this Mac do nested virt, does the backend support clustering. No-runselectors, no build tags used to pick tests. There are now two obvious entry points,make testfor the quick VM-free tier andmake test-vmfor the full run, instead of four confusing ones.The payoff is concrete. The amd64 KVM CI job now boots a real two-node cluster, so it exercises VM↔VM connectivity over the new nftables/bridge plane for the first time. And the nested dogfood cross-builds the real suite and runs it inside an outer guest on the cloud-hypervisor backend, so the netlink/nftables path gets hit by the actual conformance (egress through the nft masquerade) and cluster (VM↔VM plus subnet isolation) tests instead of a hand-written stand-in — exercising the arm64 direct-kernel path locally on the M4 that hosted CI can't reach.
Checklist
ARCHITECTURE.mdupdated in this PRdocs/adr/(next sequential number)!in the title) → the description spells out what callers must changeSummary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Chores