fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed by stubbi · Pull Request #159 · paperclipinc/mitos

stubbi · 2026-06-17T18:19:16Z

Summary

Whole-codebase security audit plus a fork-correctness fix. Six findings, each fixed with tests and a docs/threat-model.md delta in the same commit. gofmt, golangci-lint (darwin and linux), helm lint, go build ./..., and the full controller envtest suite pass locally.

What changed

dnsproxy: the DNS-rebind pin filter now decodes IPv6 addresses that embed a private or link-local IPv4 inside the NAT64 well-known prefix (64:ff9b::/96) or 6to4 (2002::/16) and re-applies the policy to the embedded IPv4, and blocks deprecated site-local fec0::/10. Closes a DNS64/NAT64 path to cloud IMDS (169.254.169.254). A NAT64 wrapper of a public IPv4 stays allowed.
deploy: optional kernelProvisioner.kernelSha256 verifies the staged guest kernel (fresh download and an already-cached file) and fails closed on mismatch; automountServiceAccountToken: false on the privileged forkd, kvm-device-plugin, and kernel-stage pods (none call the Kubernetes API).
controller: selectDormantHuskPod requires the controller owner reference (Controller=true, BlockOwnerDeletion=true) before a husk pod is an activation target, so a tenant-planted decoy cannot receive another claim's secrets and bearer token. The forgery barrier is BlockOwnerDeletion, protected by OwnerReferencesPermissionEnforcement.
husk: a new nftables input-hook chain (husk path) lets the guest reach only the in-pod resolver on 53 and drops every other guest-sourced packet to a pod-local address, closing guest reachability to the in-pod sandbox API and mTLS control listeners that forward-only filtering missed.
controller: a validating admission webhook binds SandboxClaim.spec.serviceAccount to authorization (a SubjectAccessReview requires the creator to be able to impersonate the named ServiceAccount), so the memory-snapshot principal field can no longer be self-asserted. Default off; opt-in via admissionWebhook.enabled and --enable-principal-webhook.
guest (§1 fork correctness): reseedCRNG reported success on an uncredited write fallback, which silently defeated the host fail-closed reseed gate (a fork that could not be credibly reseeded would serve duplicate CRNG output). It now reports success only on a credited RNDADDENTROPY, so the reseed contract is credited-or-refused end to end. The host gate was already fail-closed on all engines; the stale threat-model text claiming otherwise was corrected.

Verification status and pre-merge gates

Fully verified locally: 1, 2, 3 (controller envtest green), 6 (linux go vet + test-binary compile; guest is linux-only so RED/GREEN runs in the go-test CI job).
Gated on CI before relying on them, called out in the commits and threat model:
- 4 (egress input hook): unit-tested; the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh) must be green, since it modifies the KVM-verified datapath.
- 5 (principal webhook): handler and SAR decision unit-tested; the cert wiring and a webhook admission e2e are not yet exercised in CI.

Reviewer note

Findings 3, 4, 5, and 6 touch security-sensitive paths (internal/husk netfilter, internal/daemon/guest fork correctness, the controller authz boundary, token handling). Per CLAUDE.md these need a named human reviewer before merge. The threat-model deltas are in each commit; please review those alongside the code.

Residuals tracked in the threat model: a per-namespace husk server identity (so the forkd key need not be replicated into tenant namespaces) and narrowing the cluster-wide Secrets ClusterRole, both coupled to the multi-tenant isolation track.

🤖 Generated with Claude Code

…NAT64/6to4/site-local) isBlockedPinAddr blocked RFC1918, IPv6 ULA, loopback, link-local, multicast, and CGNAT, but did not catch IPv6 addresses that embed a private or link-local IPv4 target inside a well-known embedding prefix, nor deprecated site-local. On a DNS64/NAT64 cluster a tenant who controls the authoritative DNS for an allowlisted name could answer with 64:ff9b::a9fe:a9fe (NAT64 well-known prefix wrapping 169.254.169.254). The old filter treated it as public, pinned it into the guest egress allow set, and the NAT64 gateway would translate the guest's connection to the IMDS address, defeating the rebind defense and the IMDS block. The filter now: - decodes the embedded IPv4 from the NAT64 well-known prefix (64:ff9b::/96) and the 6to4 prefix (2002::/16) and applies the same policy to it, so a wrapper of a private/metadata IPv4 is refused while a wrapper of a public IPv4 stays allowed (NAT64 clusters legitimately reach public IPv4 this way); - blocks the deprecated site-local range fec0::/10. Residual (recorded in the threat model): a non-default operator-configured NAT64 prefix other than the well-known 64:ff9b::/96 is not yet decoded. TestIsBlockedPinAddr_IPv6EmbeddedPrivate asserts the embedded-private/site-local cases are blocked and the public-via-prefix cases stay allowed. Full dnsproxy suite passes; gofmt/vet/golangci-lint clean on darwin and linux. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ileged pods Two chart hardening fixes from the security audit. Kernel integrity (Finding 5): the kernel-stage init container curl'd the guest vmlinux into the node hostPath with no integrity check and forkd booted that exact file as every microVM's kernel; the skip-if-present check was content blind, so a once-poisoned cache persisted. Add kernelProvisioner.kernelSha256: when set, the init container verifies the staged file (fresh download AND an already-staged file) and fails closed on mismatch. Empty by default with a loud warning; operators should pin it (no verified default digest is shipped because it must be reproducible per the no-unverified-claims rule: curl -fsSL <kernelUrl> | sha256sum). SA token automount (Finding 6): forkd (privileged, /dev/kvm, host mounts), kvm-device-plugin (host /dev, every node), and kernel-stage make no Kubernetes API calls yet received the namespace default SA token. Set automountServiceAccountToken: false on all three so a compromise of the highest-value pods does not hand out a usable cluster credential for free. threat-model.md gains a "Guest kernel integrity at stage time" supply-chain row; chart README documents kernelSha256. helm lint clean; helm template renders all three automount settings and the verify wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…k pod Husk activation delivers a claim's resolved secrets and per-sandbox bearer token to the selected pod's self-reported PodIP over mTLS pinned to the shared forkd.mitos SAN. selectDormantHuskPod chose the pod purely by the husk LABELS, which any principal with pod-create in the pool namespace can set. Because the replicated mitos-forkd-tls Secret carries the forkd server leaf private key, a tenant who could read it in their pool namespace could stand up a decoy pod pointing at their own listener, present the leaf (which satisfies the forkd.mitos + CA pin), and have the controller hand them another claim's secrets and sandbox token. selectDormantHuskPod now also requires the controller owner reference reconcileHuskPods stamps on every husk pod it creates: a controller reference of Kind SandboxPool naming the pool with BlockOwnerDeletion=true. Only pods the controller actually created are activation targets, so a tenant-planted decoy is never selected. The forgery barrier is BlockOwnerDeletion: the OwnerReferencesPermissionEnforcement admission plugin refuses to let a tenant set it referencing a pool whose finalizers subresource they cannot update. The owner UID is intentionally not compared (it adds no forgery resistance and would couple selection to pod/pool creation order). TestSelectDormantHuskPodRequiresControllerOwnerRef proves a labels-only decoy is not selected even when it would win the name sort, and a genuine controller owned pod is. makeDormantHuskPod stamps the same owner ref so the existing husk activation/fork/eviction envtests still exercise selection; full controller envtest suite passes. threat-model.md records the impersonation vector, the mitigation, and the residual (a per-namespace husk server identity so the forkd private key need not be replicated into tenant namespaces). This is the contained fix; the per-namespace identity redesign is tracked with the multi-tenant work and needs a named human reviewer per CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The per-tap egress chain is hooked on forward only, so it governs transit traffic (guest to the internet) but never sees a packet the guest sends to an address LOCAL to the pod network namespace: the tap gateway, the in-pod resolver, and the husk-stub sandbox API (:9091) and mTLS control (:9443) listeners. Those are delivered on the kernel input hook, which had no chain (policy accept), so a guest could reach the in-pod listeners regardless of egress policy. Their own bearer-token/mTLS auth limited impact, but the egress allowlist offered no protection for the host-local surface and any future in-pod listener was exposed. applyEgressFilter now also installs, on the husk path: - RenderSharedInputTable: an input-hooked base chain (policy accept, so non-sandbox input such as kubelet probes and the controller's mTLS dial on the pod uplink is untouched) plus an interface-keyed dispatch map. - RenderSandboxInputChain: a per-tap chain that accepts the guest only to the resolver on udp/tcp 53 (the same resolver address the forward chain and the proxy already use) and drops every other guest-sourced packet to a pod-local address. Reached only via this tap's input dispatch jump. Scoped to the husk path on purpose: the filter lives in the isolated pod netns. The raw-forkd path runs in the node netns where a node-wide input hook is not added; its host-local exposure is tracked separately. Teardown deletes the input dispatch element and chain. Renderer and wiring are unit-tested (RED then GREEN): the input chain allows resolver:53 before the drop, drops other guest-to-local, names per tap, and the husk apply installs both. End-to-end enforcement in a real restored VM is gated by the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh), which MUST be green before merge. gofmt/golangci-lint clean on darwin and linux. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…admission webhook The workspace memory-snapshot resume gate serves a head's in-RAM image (which can hold secrets-in-RAM) only to a claim whose spec.serviceAccount equals the snapshot's MemorySnapshotPrincipal. spec.serviceAccount is a free-form field the claim author sets, and nothing validated that the creator may act as it, so the principal "gate" reduced to typing the right string: a tenant could set another principal's value and resume its secrets. Add a validating admission webhook (internal/admission.ClaimServiceAccountValidator): on SandboxClaim CREATE/UPDATE, when spec.serviceAccount is set, it runs a SubjectAccessReview with the REQUEST creator's identity and admits only if they may impersonate that ServiceAccount (RBAC verb impersonate on serviceaccounts in the claim namespace). It fails closed (a SAR error is a denial) and is a no-op when no principal is asserted. Now the controller's principal equality check is a real authorization boundary. Wiring: - controller --enable-principal-webhook registers the webhook (default off, so single-tenant/webhook-less installs are unaffected); the controller logs a warning if --workspace-memory-snapshots is on without it. - Helm admissionWebhook.enabled (default false) renders the Service, a self-signed serving cert (caBundle injected; cert-manager recommended for production rotation), the ValidatingWebhookConfiguration (failurePolicy Fail), the controller webhook port + cert mount, and the subjectaccessreviews:create ClusterRole rule (granted only when enabled). The handler + SAR decision are unit-tested (allow when impersonate-authorized, deny when not, allow when no principal). helm template renders correctly with the webhook off and on; helm lint, gofmt, golangci-lint (darwin+linux), and go build ./... all pass. threat-model.md row "Memory-snapshot pairing" records the self-asserted gap and this mitigation. REMAINING GATE: the cert path and a webhook admission e2e are not exercised in CI yet, and this touches the authz boundary, so it needs a named human reviewer and an e2e before being relied on in production. The verified core is the handler logic; the chart wiring is opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The host fork-correctness gate reaps a fork whose guest reports ReseededRNG:false. That gate is sound on all engines (raw-forkd, sandbox-server, husk), but it keys entirely on the boolean the guest returns, and the guest was over-reporting: when the credited RNDADDENTROPY ioctl failed, reseedCRNG fell back to a plain write to /dev/urandom (which mixes bytes WITHOUT crediting entropy and does not guarantee the CRNG output diverges from a sibling fork) and still returned true. So a fork that could not be credibly reseeded would report success and be served sharing its siblings' CRNG output: duplicate keys, tokens, and nonces. The fail-closed gate was defeatable by a false positive. reseedCRNG (now reseedCRNGAt, with an injectable device path for testing) reports success ONLY when RNDADDENTROPY succeeds, and returns false on the uncredited fallback so the host reaps the fork. The guest agent runs as PID 1 with full capabilities on the shipped kernel, where RNDADDENTROPY succeeds, so the credited path is the normal path; the fallback was the unsafe one. The reseed contract is now "credited or refused" end to end. TestReseedCRNGFailsClosedWhenNotCredited points reseedCRNGAt at a regular file (RNDADDENTROPY returns ENOTTY, the same shape as a kernel that cannot credit) and asserts it reports false; this fails against the old `return true` and passes after the fix. guest/agent is linux-only, so RED/GREEN is observed in the linux go-test CI job; verified here via GOOS=linux vet + test-binary compile + cross-build. docs/fork-correctness.md, docs/threat-model.md (the stale "open (critical) on raw-forkd/sandbox-server" claim was corrected: the host gate is fail-closed on all engines), and ROADMAP.md updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

stubbi and others added 6 commits June 17, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159

fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159
stubbi wants to merge 6 commits into
mainfrom
fix/security-audit-findings

stubbi commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stubbi commented Jun 17, 2026

Summary

What changed

Verification status and pre-merge gates

Reviewer note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant