Skip to content

roadmap: Kubescape security stack → 100% and hold (posture · CVE · runtime) #2447

Description

@devantler

🤖 Generated by the Daily AI Assistant

Problem

A live investigation on 2026-07-04 found that all three Kubescape surfaces are broken or invisible — not clean. Every "0 findings" reading is absence of data, not health, and none of it reaches a human or the issue backlog, so it has aged silently for ~a week.

Surface State Root cause (verified against live prod)
Posture (config scan) 🔴 Broken kubescape-volume (empty emptyDir) mounts at …/.kubescape/config.json with subPath: config.json → kubelet materialises config.json as a directory → scanner aborts every run. Live log: open …/config.json: is a directory. 1542/1542 scan objects controls: null; compliance 0.00 across all 9 frameworks; data ~7 days stale.
CVEs (kubevuln) 🔴 Broken/invisible The relevancy scan path (ScanCP) is the only one wired and aborts on partial ApplicationProfiles ("workload restart required", ~815×/day). Verified: 197/197 image manifests have 0 grype matches / no scanner name; 109/109 summaries .all=0 and .relevant=0. The cluster is not 0-CVE — nothing is being scanned.
Runtime detection (node-agent) 🟠 Invisible alertManagerExporterUrls: [], prometheusExporterEnabled: false, stdoutExporter: true → every runtime threat alert goes to stdout and vanishes. Malware detection off, 0 RuntimeRuleAlertBindings, 0 alerts routed anywhere in 24h. Node-agent is healthy and learning (288 profiles, 123 still partial).

Secondary posture degraders (independent of the mount): the scanner SA cannot list validatingadmissionpolicies or istio VirtualService at cluster scope (forbidden), and cloud-provider controls are N/A on Talos.

Meta-problem: the daily engineer is issue-driven, but its survey is GitHub-only (portfolio-surveyor is kubectl-free by design), so no live cluster finding ever enters the backlog unless a human files it. Combined with a CI gate that is a floor (85), not a 100 target, and no fix-vs-except ladder / security definition-of-done, the whole stack could rot to 0.00 unnoticed — which is what happened.

Goal

Drive posture + CVE + runtime to a defensible 100% and hold it, with prevention (stop the number creeping back up) and continuous ingestion (findings become tracked issues the daily engineer drains oldest-first) — without reliability regressions.

Approach — two loops

  • Prevention loop: fix the scanners → wire the in-repo ClusterSecurityException set into the CI scan via ksail native --exceptions and ratchet --compliance-threshold toward 100 → graduate each fixed control into a Kyverno Enforce policy → SARIF → GitHub Code Scanning gate on PRs. Ratchet-only-tightens.
  • Backlog loop: a scheduled, fingerprint-deduped bridge turns live Kubescape findings (posture / CVE / runtime) into security issues under this epic; the daily engineer drains them oldest-first with a fix > except discipline (an exception is last-resort, scoped, justified, and periodically pruned).

Decisions (settled with the maintainer)

  • Runtime alert routing = in-flight feat(kubescape): route runtime-detection alerts to Headlamp, Slack, and Coroot #2445 (reconciled 2026-07-04): a single tiny kubescape-scoped Alertmanager (prod-only) fed by the node-agent, routing to Headlamp (its Runtime-Detection Alerts tab reads only from an Alertmanager), Slack (the existing webhook), and Coroot (node-agent stdout → Coroot logs). A Coroot-only PromQL path was considered but rejected — it can't feed the Headlamp plugin and Coroot CE has no inbound alert receiver.
  • CVE = both: decouple a full non-relevancy scan so grype data flows now, and complete the partial ApplicationProfiles for the reachable-CVE view over time.
  • Automation = full loop: scanners fixed + CI ratchet + SARIF→Code Scanning + CRD→issue bridge + Kyverno graduation + the daily-engineer definition changes.

Children (oldest/most-blocking first)

Already in-flight (parallel runs, 2026-07-04) — this epic aggregates them

Much of the posture + runtime + exceptions work was already underway when this epic was filed:

The genuinely-unaddressed gaps this investigation surfaced are #2450 (CVE scanning is silently broken — nobody was on it) and #2451 (continuous ingestion so this can't recur), plus the daily-engineer workflow change (monorepo#2052). This epic ties the scattered work into one "100% and hold" line so the number is tracked to completion rather than fixed piecemeal.

Non-goals / guardrails

Root-cause fixes only (no t.Skip/threshold-lowering/exception-as-first-resort); never spin up clusters to test; all changes ship as draft PRs under static validation; exceptions stay minimal, scoped, and justified (exceptions-as-code).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions