Skip to content

fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159

Open
stubbi wants to merge 6 commits into
mainfrom
fix/security-audit-findings
Open

fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159
stubbi wants to merge 6 commits into
mainfrom
fix/security-audit-findings

Conversation

@stubbi

@stubbi stubbi commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Whole-codebase security audit plus a fork-correctness fix. Six findings, each fixed with tests and a docs/threat-model.md delta in the same commit. gofmt, golangci-lint (darwin and linux), helm lint, go build ./..., and the full controller envtest suite pass locally.

What changed

  1. dnsproxy: the DNS-rebind pin filter now decodes IPv6 addresses that embed a private or link-local IPv4 inside the NAT64 well-known prefix (64:ff9b::/96) or 6to4 (2002::/16) and re-applies the policy to the embedded IPv4, and blocks deprecated site-local fec0::/10. Closes a DNS64/NAT64 path to cloud IMDS (169.254.169.254). A NAT64 wrapper of a public IPv4 stays allowed.
  2. deploy: optional kernelProvisioner.kernelSha256 verifies the staged guest kernel (fresh download and an already-cached file) and fails closed on mismatch; automountServiceAccountToken: false on the privileged forkd, kvm-device-plugin, and kernel-stage pods (none call the Kubernetes API).
  3. controller: selectDormantHuskPod requires the controller owner reference (Controller=true, BlockOwnerDeletion=true) before a husk pod is an activation target, so a tenant-planted decoy cannot receive another claim's secrets and bearer token. The forgery barrier is BlockOwnerDeletion, protected by OwnerReferencesPermissionEnforcement.
  4. husk: a new nftables input-hook chain (husk path) lets the guest reach only the in-pod resolver on 53 and drops every other guest-sourced packet to a pod-local address, closing guest reachability to the in-pod sandbox API and mTLS control listeners that forward-only filtering missed.
  5. controller: a validating admission webhook binds SandboxClaim.spec.serviceAccount to authorization (a SubjectAccessReview requires the creator to be able to impersonate the named ServiceAccount), so the memory-snapshot principal field can no longer be self-asserted. Default off; opt-in via admissionWebhook.enabled and --enable-principal-webhook.
  6. guest (§1 fork correctness): reseedCRNG reported success on an uncredited write fallback, which silently defeated the host fail-closed reseed gate (a fork that could not be credibly reseeded would serve duplicate CRNG output). It now reports success only on a credited RNDADDENTROPY, so the reseed contract is credited-or-refused end to end. The host gate was already fail-closed on all engines; the stale threat-model text claiming otherwise was corrected.

Verification status and pre-merge gates

  • Fully verified locally: 1, 2, 3 (controller envtest green), 6 (linux go vet + test-binary compile; guest is linux-only so RED/GREEN runs in the go-test CI job).
  • Gated on CI before relying on them, called out in the commits and threat model:
    • 4 (egress input hook): unit-tested; the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh) must be green, since it modifies the KVM-verified datapath.
    • 5 (principal webhook): handler and SAR decision unit-tested; the cert wiring and a webhook admission e2e are not yet exercised in CI.

Reviewer note

Findings 3, 4, 5, and 6 touch security-sensitive paths (internal/husk netfilter, internal/daemon/guest fork correctness, the controller authz boundary, token handling). Per CLAUDE.md these need a named human reviewer before merge. The threat-model deltas are in each commit; please review those alongside the code.

Residuals tracked in the threat model: a per-namespace husk server identity (so the forkd key need not be replicated into tenant namespaces) and narrowing the cluster-wide Secrets ClusterRole, both coupled to the multi-tenant isolation track.

🤖 Generated with Claude Code

stubbi and others added 6 commits June 17, 2026 15:17
…NAT64/6to4/site-local)

isBlockedPinAddr blocked RFC1918, IPv6 ULA, loopback, link-local, multicast,
and CGNAT, but did not catch IPv6 addresses that embed a private or link-local
IPv4 target inside a well-known embedding prefix, nor deprecated site-local.

On a DNS64/NAT64 cluster a tenant who controls the authoritative DNS for an
allowlisted name could answer with 64:ff9b::a9fe:a9fe (NAT64 well-known prefix
wrapping 169.254.169.254). The old filter treated it as public, pinned it into
the guest egress allow set, and the NAT64 gateway would translate the guest's
connection to the IMDS address, defeating the rebind defense and the IMDS block.

The filter now:
- decodes the embedded IPv4 from the NAT64 well-known prefix (64:ff9b::/96) and
  the 6to4 prefix (2002::/16) and applies the same policy to it, so a wrapper of
  a private/metadata IPv4 is refused while a wrapper of a public IPv4 stays
  allowed (NAT64 clusters legitimately reach public IPv4 this way);
- blocks the deprecated site-local range fec0::/10.

Residual (recorded in the threat model): a non-default operator-configured NAT64
prefix other than the well-known 64:ff9b::/96 is not yet decoded.

TestIsBlockedPinAddr_IPv6EmbeddedPrivate asserts the embedded-private/site-local
cases are blocked and the public-via-prefix cases stay allowed. Full dnsproxy
suite passes; gofmt/vet/golangci-lint clean on darwin and linux.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ileged pods

Two chart hardening fixes from the security audit.

Kernel integrity (Finding 5): the kernel-stage init container curl'd the guest
vmlinux into the node hostPath with no integrity check and forkd booted that
exact file as every microVM's kernel; the skip-if-present check was content
blind, so a once-poisoned cache persisted. Add kernelProvisioner.kernelSha256:
when set, the init container verifies the staged file (fresh download AND an
already-staged file) and fails closed on mismatch. Empty by default with a loud
warning; operators should pin it (no verified default digest is shipped because
it must be reproducible per the no-unverified-claims rule:
curl -fsSL <kernelUrl> | sha256sum).

SA token automount (Finding 6): forkd (privileged, /dev/kvm, host mounts),
kvm-device-plugin (host /dev, every node), and kernel-stage make no Kubernetes
API calls yet received the namespace default SA token. Set
automountServiceAccountToken: false on all three so a compromise of the
highest-value pods does not hand out a usable cluster credential for free.

threat-model.md gains a "Guest kernel integrity at stage time" supply-chain row;
chart README documents kernelSha256. helm lint clean; helm template renders all
three automount settings and the verify wiring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k pod

Husk activation delivers a claim's resolved secrets and per-sandbox bearer token
to the selected pod's self-reported PodIP over mTLS pinned to the shared
forkd.mitos SAN. selectDormantHuskPod chose the pod purely by the husk LABELS,
which any principal with pod-create in the pool namespace can set. Because the
replicated mitos-forkd-tls Secret carries the forkd server leaf private key, a
tenant who could read it in their pool namespace could stand up a decoy pod
pointing at their own listener, present the leaf (which satisfies the
forkd.mitos + CA pin), and have the controller hand them another claim's secrets
and sandbox token.

selectDormantHuskPod now also requires the controller owner reference
reconcileHuskPods stamps on every husk pod it creates: a controller reference of
Kind SandboxPool naming the pool with BlockOwnerDeletion=true. Only pods the
controller actually created are activation targets, so a tenant-planted decoy is
never selected. The forgery barrier is BlockOwnerDeletion: the
OwnerReferencesPermissionEnforcement admission plugin refuses to let a tenant set
it referencing a pool whose finalizers subresource they cannot update. The owner
UID is intentionally not compared (it adds no forgery resistance and would
couple selection to pod/pool creation order).

TestSelectDormantHuskPodRequiresControllerOwnerRef proves a labels-only decoy is
not selected even when it would win the name sort, and a genuine controller
owned pod is. makeDormantHuskPod stamps the same owner ref so the existing husk
activation/fork/eviction envtests still exercise selection; full controller
envtest suite passes. threat-model.md records the impersonation vector, the
mitigation, and the residual (a per-namespace husk server identity so the forkd
private key need not be replicated into tenant namespaces).

This is the contained fix; the per-namespace identity redesign is tracked with
the multi-tenant work and needs a named human reviewer per CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-tap egress chain is hooked on forward only, so it governs transit
traffic (guest to the internet) but never sees a packet the guest sends to an
address LOCAL to the pod network namespace: the tap gateway, the in-pod
resolver, and the husk-stub sandbox API (:9091) and mTLS control (:9443)
listeners. Those are delivered on the kernel input hook, which had no chain
(policy accept), so a guest could reach the in-pod listeners regardless of
egress policy. Their own bearer-token/mTLS auth limited impact, but the egress
allowlist offered no protection for the host-local surface and any future in-pod
listener was exposed.

applyEgressFilter now also installs, on the husk path:
- RenderSharedInputTable: an input-hooked base chain (policy accept, so
  non-sandbox input such as kubelet probes and the controller's mTLS dial on the
  pod uplink is untouched) plus an interface-keyed dispatch map.
- RenderSandboxInputChain: a per-tap chain that accepts the guest only to the
  resolver on udp/tcp 53 (the same resolver address the forward chain and the
  proxy already use) and drops every other guest-sourced packet to a pod-local
  address. Reached only via this tap's input dispatch jump.

Scoped to the husk path on purpose: the filter lives in the isolated pod netns.
The raw-forkd path runs in the node netns where a node-wide input hook is not
added; its host-local exposure is tracked separately. Teardown deletes the input
dispatch element and chain.

Renderer and wiring are unit-tested (RED then GREEN): the input chain allows
resolver:53 before the drop, drops other guest-to-local, names per tap, and the
husk apply installs both. End-to-end enforcement in a real restored VM is gated
by the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh), which MUST
be green before merge. gofmt/golangci-lint clean on darwin and linux.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…admission webhook

The workspace memory-snapshot resume gate serves a head's in-RAM image (which
can hold secrets-in-RAM) only to a claim whose spec.serviceAccount equals the
snapshot's MemorySnapshotPrincipal. spec.serviceAccount is a free-form field the
claim author sets, and nothing validated that the creator may act as it, so the
principal "gate" reduced to typing the right string: a tenant could set another
principal's value and resume its secrets.

Add a validating admission webhook (internal/admission.ClaimServiceAccountValidator):
on SandboxClaim CREATE/UPDATE, when spec.serviceAccount is set, it runs a
SubjectAccessReview with the REQUEST creator's identity and admits only if they
may impersonate that ServiceAccount (RBAC verb impersonate on serviceaccounts in
the claim namespace). It fails closed (a SAR error is a denial) and is a no-op
when no principal is asserted. Now the controller's principal equality check is a
real authorization boundary.

Wiring:
- controller --enable-principal-webhook registers the webhook (default off, so
  single-tenant/webhook-less installs are unaffected); the controller logs a
  warning if --workspace-memory-snapshots is on without it.
- Helm admissionWebhook.enabled (default false) renders the Service, a
  self-signed serving cert (caBundle injected; cert-manager recommended for
  production rotation), the ValidatingWebhookConfiguration (failurePolicy Fail),
  the controller webhook port + cert mount, and the subjectaccessreviews:create
  ClusterRole rule (granted only when enabled).

The handler + SAR decision are unit-tested (allow when impersonate-authorized,
deny when not, allow when no principal). helm template renders correctly with
the webhook off and on; helm lint, gofmt, golangci-lint (darwin+linux), and
go build ./... all pass. threat-model.md row "Memory-snapshot pairing" records
the self-asserted gap and this mitigation.

REMAINING GATE: the cert path and a webhook admission e2e are not exercised in
CI yet, and this touches the authz boundary, so it needs a named human reviewer
and an e2e before being relied on in production. The verified core is the
handler logic; the chart wiring is opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The host fork-correctness gate reaps a fork whose guest reports
ReseededRNG:false. That gate is sound on all engines (raw-forkd, sandbox-server,
husk), but it keys entirely on the boolean the guest returns, and the guest was
over-reporting: when the credited RNDADDENTROPY ioctl failed, reseedCRNG fell
back to a plain write to /dev/urandom (which mixes bytes WITHOUT crediting
entropy and does not guarantee the CRNG output diverges from a sibling fork) and
still returned true. So a fork that could not be credibly reseeded would report
success and be served sharing its siblings' CRNG output: duplicate keys, tokens,
and nonces. The fail-closed gate was defeatable by a false positive.

reseedCRNG (now reseedCRNGAt, with an injectable device path for testing)
reports success ONLY when RNDADDENTROPY succeeds, and returns false on the
uncredited fallback so the host reaps the fork. The guest agent runs as PID 1
with full capabilities on the shipped kernel, where RNDADDENTROPY succeeds, so
the credited path is the normal path; the fallback was the unsafe one. The
reseed contract is now "credited or refused" end to end.

TestReseedCRNGFailsClosedWhenNotCredited points reseedCRNGAt at a regular file
(RNDADDENTROPY returns ENOTTY, the same shape as a kernel that cannot credit)
and asserts it reports false; this fails against the old `return true` and
passes after the fix. guest/agent is linux-only, so RED/GREEN is observed in the
linux go-test CI job; verified here via GOOS=linux vet + test-binary compile +
cross-build. docs/fork-correctness.md, docs/threat-model.md (the stale
"open (critical) on raw-forkd/sandbox-server" claim was corrected: the host gate
is fail-closed on all engines), and ROADMAP.md updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant