fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159
Open
stubbi wants to merge 6 commits into
Open
fix: security audit remediation (6 findings) + fork-correctness reseed fail-closed#159stubbi wants to merge 6 commits into
stubbi wants to merge 6 commits into
Conversation
…NAT64/6to4/site-local) isBlockedPinAddr blocked RFC1918, IPv6 ULA, loopback, link-local, multicast, and CGNAT, but did not catch IPv6 addresses that embed a private or link-local IPv4 target inside a well-known embedding prefix, nor deprecated site-local. On a DNS64/NAT64 cluster a tenant who controls the authoritative DNS for an allowlisted name could answer with 64:ff9b::a9fe:a9fe (NAT64 well-known prefix wrapping 169.254.169.254). The old filter treated it as public, pinned it into the guest egress allow set, and the NAT64 gateway would translate the guest's connection to the IMDS address, defeating the rebind defense and the IMDS block. The filter now: - decodes the embedded IPv4 from the NAT64 well-known prefix (64:ff9b::/96) and the 6to4 prefix (2002::/16) and applies the same policy to it, so a wrapper of a private/metadata IPv4 is refused while a wrapper of a public IPv4 stays allowed (NAT64 clusters legitimately reach public IPv4 this way); - blocks the deprecated site-local range fec0::/10. Residual (recorded in the threat model): a non-default operator-configured NAT64 prefix other than the well-known 64:ff9b::/96 is not yet decoded. TestIsBlockedPinAddr_IPv6EmbeddedPrivate asserts the embedded-private/site-local cases are blocked and the public-via-prefix cases stay allowed. Full dnsproxy suite passes; gofmt/vet/golangci-lint clean on darwin and linux. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ileged pods Two chart hardening fixes from the security audit. Kernel integrity (Finding 5): the kernel-stage init container curl'd the guest vmlinux into the node hostPath with no integrity check and forkd booted that exact file as every microVM's kernel; the skip-if-present check was content blind, so a once-poisoned cache persisted. Add kernelProvisioner.kernelSha256: when set, the init container verifies the staged file (fresh download AND an already-staged file) and fails closed on mismatch. Empty by default with a loud warning; operators should pin it (no verified default digest is shipped because it must be reproducible per the no-unverified-claims rule: curl -fsSL <kernelUrl> | sha256sum). SA token automount (Finding 6): forkd (privileged, /dev/kvm, host mounts), kvm-device-plugin (host /dev, every node), and kernel-stage make no Kubernetes API calls yet received the namespace default SA token. Set automountServiceAccountToken: false on all three so a compromise of the highest-value pods does not hand out a usable cluster credential for free. threat-model.md gains a "Guest kernel integrity at stage time" supply-chain row; chart README documents kernelSha256. helm lint clean; helm template renders all three automount settings and the verify wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k pod Husk activation delivers a claim's resolved secrets and per-sandbox bearer token to the selected pod's self-reported PodIP over mTLS pinned to the shared forkd.mitos SAN. selectDormantHuskPod chose the pod purely by the husk LABELS, which any principal with pod-create in the pool namespace can set. Because the replicated mitos-forkd-tls Secret carries the forkd server leaf private key, a tenant who could read it in their pool namespace could stand up a decoy pod pointing at their own listener, present the leaf (which satisfies the forkd.mitos + CA pin), and have the controller hand them another claim's secrets and sandbox token. selectDormantHuskPod now also requires the controller owner reference reconcileHuskPods stamps on every husk pod it creates: a controller reference of Kind SandboxPool naming the pool with BlockOwnerDeletion=true. Only pods the controller actually created are activation targets, so a tenant-planted decoy is never selected. The forgery barrier is BlockOwnerDeletion: the OwnerReferencesPermissionEnforcement admission plugin refuses to let a tenant set it referencing a pool whose finalizers subresource they cannot update. The owner UID is intentionally not compared (it adds no forgery resistance and would couple selection to pod/pool creation order). TestSelectDormantHuskPodRequiresControllerOwnerRef proves a labels-only decoy is not selected even when it would win the name sort, and a genuine controller owned pod is. makeDormantHuskPod stamps the same owner ref so the existing husk activation/fork/eviction envtests still exercise selection; full controller envtest suite passes. threat-model.md records the impersonation vector, the mitigation, and the residual (a per-namespace husk server identity so the forkd private key need not be replicated into tenant namespaces). This is the contained fix; the per-namespace identity redesign is tracked with the multi-tenant work and needs a named human reviewer per CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-tap egress chain is hooked on forward only, so it governs transit traffic (guest to the internet) but never sees a packet the guest sends to an address LOCAL to the pod network namespace: the tap gateway, the in-pod resolver, and the husk-stub sandbox API (:9091) and mTLS control (:9443) listeners. Those are delivered on the kernel input hook, which had no chain (policy accept), so a guest could reach the in-pod listeners regardless of egress policy. Their own bearer-token/mTLS auth limited impact, but the egress allowlist offered no protection for the host-local surface and any future in-pod listener was exposed. applyEgressFilter now also installs, on the husk path: - RenderSharedInputTable: an input-hooked base chain (policy accept, so non-sandbox input such as kubelet probes and the controller's mTLS dial on the pod uplink is untouched) plus an interface-keyed dispatch map. - RenderSandboxInputChain: a per-tap chain that accepts the guest only to the resolver on udp/tcp 53 (the same resolver address the forward chain and the proxy already use) and drops every other guest-sourced packet to a pod-local address. Reached only via this tap's input dispatch jump. Scoped to the husk path on purpose: the filter lives in the isolated pod netns. The raw-forkd path runs in the node netns where a node-wide input hook is not added; its host-local exposure is tracked separately. Teardown deletes the input dispatch element and chain. Renderer and wiring are unit-tested (RED then GREEN): the input chain allows resolver:53 before the drop, drops other guest-to-local, names per tap, and the husk apply installs both. End-to-end enforcement in a real restored VM is gated by the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh), which MUST be green before merge. gofmt/golangci-lint clean on darwin and linux. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…admission webhook The workspace memory-snapshot resume gate serves a head's in-RAM image (which can hold secrets-in-RAM) only to a claim whose spec.serviceAccount equals the snapshot's MemorySnapshotPrincipal. spec.serviceAccount is a free-form field the claim author sets, and nothing validated that the creator may act as it, so the principal "gate" reduced to typing the right string: a tenant could set another principal's value and resume its secrets. Add a validating admission webhook (internal/admission.ClaimServiceAccountValidator): on SandboxClaim CREATE/UPDATE, when spec.serviceAccount is set, it runs a SubjectAccessReview with the REQUEST creator's identity and admits only if they may impersonate that ServiceAccount (RBAC verb impersonate on serviceaccounts in the claim namespace). It fails closed (a SAR error is a denial) and is a no-op when no principal is asserted. Now the controller's principal equality check is a real authorization boundary. Wiring: - controller --enable-principal-webhook registers the webhook (default off, so single-tenant/webhook-less installs are unaffected); the controller logs a warning if --workspace-memory-snapshots is on without it. - Helm admissionWebhook.enabled (default false) renders the Service, a self-signed serving cert (caBundle injected; cert-manager recommended for production rotation), the ValidatingWebhookConfiguration (failurePolicy Fail), the controller webhook port + cert mount, and the subjectaccessreviews:create ClusterRole rule (granted only when enabled). The handler + SAR decision are unit-tested (allow when impersonate-authorized, deny when not, allow when no principal). helm template renders correctly with the webhook off and on; helm lint, gofmt, golangci-lint (darwin+linux), and go build ./... all pass. threat-model.md row "Memory-snapshot pairing" records the self-asserted gap and this mitigation. REMAINING GATE: the cert path and a webhook admission e2e are not exercised in CI yet, and this touches the authz boundary, so it needs a named human reviewer and an e2e before being relied on in production. The verified core is the handler logic; the chart wiring is opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The host fork-correctness gate reaps a fork whose guest reports ReseededRNG:false. That gate is sound on all engines (raw-forkd, sandbox-server, husk), but it keys entirely on the boolean the guest returns, and the guest was over-reporting: when the credited RNDADDENTROPY ioctl failed, reseedCRNG fell back to a plain write to /dev/urandom (which mixes bytes WITHOUT crediting entropy and does not guarantee the CRNG output diverges from a sibling fork) and still returned true. So a fork that could not be credibly reseeded would report success and be served sharing its siblings' CRNG output: duplicate keys, tokens, and nonces. The fail-closed gate was defeatable by a false positive. reseedCRNG (now reseedCRNGAt, with an injectable device path for testing) reports success ONLY when RNDADDENTROPY succeeds, and returns false on the uncredited fallback so the host reaps the fork. The guest agent runs as PID 1 with full capabilities on the shipped kernel, where RNDADDENTROPY succeeds, so the credited path is the normal path; the fallback was the unsafe one. The reseed contract is now "credited or refused" end to end. TestReseedCRNGFailsClosedWhenNotCredited points reseedCRNGAt at a regular file (RNDADDENTROPY returns ENOTTY, the same shape as a kernel that cannot credit) and asserts it reports false; this fails against the old `return true` and passes after the fix. guest/agent is linux-only, so RED/GREEN is observed in the linux go-test CI job; verified here via GOOS=linux vet + test-binary compile + cross-build. docs/fork-correctness.md, docs/threat-model.md (the stale "open (critical) on raw-forkd/sandbox-server" claim was corrected: the host gate is fail-closed on all engines), and ROADMAP.md updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Whole-codebase security audit plus a fork-correctness fix. Six findings, each fixed with tests and a
docs/threat-model.mddelta in the same commit. gofmt,golangci-lint(darwin and linux),helm lint,go build ./..., and the full controller envtest suite pass locally.What changed
dnsproxy: the DNS-rebind pin filter now decodes IPv6 addresses that embed a private or link-local IPv4 inside the NAT64 well-known prefix (64:ff9b::/96) or 6to4 (2002::/16) and re-applies the policy to the embedded IPv4, and blocks deprecated site-localfec0::/10. Closes a DNS64/NAT64 path to cloud IMDS (169.254.169.254). A NAT64 wrapper of a public IPv4 stays allowed.deploy: optionalkernelProvisioner.kernelSha256verifies the staged guest kernel (fresh download and an already-cached file) and fails closed on mismatch;automountServiceAccountToken: falseon the privileged forkd, kvm-device-plugin, and kernel-stage pods (none call the Kubernetes API).controller:selectDormantHuskPodrequires the controller owner reference (Controller=true,BlockOwnerDeletion=true) before a husk pod is an activation target, so a tenant-planted decoy cannot receive another claim's secrets and bearer token. The forgery barrier isBlockOwnerDeletion, protected byOwnerReferencesPermissionEnforcement.husk: a new nftablesinput-hook chain (husk path) lets the guest reach only the in-pod resolver on53and drops every other guest-sourced packet to a pod-local address, closing guest reachability to the in-pod sandbox API and mTLS control listeners that forward-only filtering missed.controller: a validating admission webhook bindsSandboxClaim.spec.serviceAccountto authorization (aSubjectAccessReviewrequires the creator to be able toimpersonatethe named ServiceAccount), so the memory-snapshot principal field can no longer be self-asserted. Default off; opt-in viaadmissionWebhook.enabledand--enable-principal-webhook.guest(§1 fork correctness):reseedCRNGreported success on an uncredited write fallback, which silently defeated the host fail-closed reseed gate (a fork that could not be credibly reseeded would serve duplicate CRNG output). It now reports success only on a creditedRNDADDENTROPY, so the reseed contract is credited-or-refused end to end. The host gate was already fail-closed on all engines; the stale threat-model text claiming otherwise was corrected.Verification status and pre-merge gates
go vet+ test-binary compile; guest is linux-only so RED/GREEN runs in thego-testCI job).test/cluster-e2e/husk-network-e2e.sh) must be green, since it modifies the KVM-verified datapath.Reviewer note
Findings 3, 4, 5, and 6 touch security-sensitive paths (
internal/husknetfilter,internal/daemon/guestfork correctness, the controller authz boundary, token handling). PerCLAUDE.mdthese need a named human reviewer before merge. The threat-model deltas are in each commit; please review those alongside the code.Residuals tracked in the threat model: a per-namespace husk server identity (so the forkd key need not be replicated into tenant namespaces) and narrowing the cluster-wide Secrets ClusterRole, both coupled to the multi-tenant isolation track.
🤖 Generated with Claude Code