roadmap: Kubescape security stack → 100% and hold (posture · CVE · runtime)

> 🤖 Generated by the Daily AI Assistant

## Problem

A live investigation on 2026-07-04 found that **all three Kubescape surfaces are broken or invisible — not clean.** Every "0 findings" reading is *absence of data*, not health, and none of it reaches a human or the issue backlog, so it has aged silently for ~a week.

| Surface | State | Root cause (verified against live prod) |
|---|---|---|
| **Posture** (config scan) | 🔴 Broken | `kubescape-volume` (empty `emptyDir`) mounts at `…/.kubescape/config.json` with `subPath: config.json` → kubelet materialises `config.json` as a **directory** → scanner aborts every run. Live log: `open …/config.json: is a directory`. **1542/1542** scan objects `controls: null`; compliance **0.00 across all 9 frameworks**; data ~7 days stale. |
| **CVEs** (kubevuln) | 🔴 Broken/invisible | The relevancy scan path (`ScanCP`) is the only one wired and **aborts on partial ApplicationProfiles** ("workload restart required", ~815×/day). Verified: **197/197** image manifests have 0 grype matches / no scanner name; **109/109** summaries `.all`=0 *and* `.relevant`=0. The cluster is **not** 0-CVE — nothing is being scanned. |
| **Runtime detection** (node-agent) | 🟠 Invisible | `alertManagerExporterUrls: []`, `prometheusExporterEnabled: false`, `stdoutExporter: true` → every runtime threat alert goes to **stdout and vanishes**. Malware detection off, **0** RuntimeRuleAlertBindings, **0** alerts routed anywhere in 24h. Node-agent is healthy and learning (288 profiles, 123 still partial). |

Secondary posture degraders (independent of the mount): the scanner SA cannot `list` `validatingadmissionpolicies` or istio `VirtualService` at cluster scope (forbidden), and cloud-provider controls are N/A on Talos.

**Meta-problem:** the daily engineer is issue-driven, but its survey is GitHub-only (`portfolio-surveyor` is `kubectl`-free by design), so **no live cluster finding ever enters the backlog** unless a human files it. Combined with a CI gate that is a *floor (85), not a 100 target*, and no fix-vs-except ladder / security definition-of-done, the whole stack could rot to 0.00 unnoticed — which is what happened.

## Goal

Drive **posture + CVE + runtime** to a defensible **100% and hold it**, with **prevention** (stop the number creeping back up) and **continuous ingestion** (findings become tracked issues the daily engineer drains oldest-first) — **without reliability regressions**.

## Approach — two loops

- **Prevention loop:** fix the scanners → wire the in-repo `ClusterSecurityException` set into the CI scan via ksail native `--exceptions` and **ratchet `--compliance-threshold` toward 100** → graduate each fixed control into a Kyverno `Enforce` policy → SARIF → GitHub Code Scanning gate on PRs. Ratchet-only-tightens.
- **Backlog loop:** a scheduled, fingerprint-deduped bridge turns live Kubescape findings (posture / CVE / runtime) into `security` issues under this epic; the daily engineer drains them oldest-first with a **fix > except** discipline (an exception is last-resort, scoped, justified, and periodically pruned).

## Decisions (settled with the maintainer)

- **Runtime alert routing = in-flight #2445** (reconciled 2026-07-04): a single tiny **kubescape-scoped Alertmanager** (prod-only) fed by the node-agent, routing to **Headlamp** (its Runtime-Detection Alerts tab reads *only* from an Alertmanager), **Slack** (the existing webhook), and **Coroot** (node-agent stdout → Coroot logs). A Coroot-only PromQL path was considered but rejected — it can't feed the Headlamp plugin and Coroot CE has no inbound alert receiver.
- **CVE = both:** decouple a full non-relevancy scan so grype data flows now, **and** complete the partial ApplicationProfiles for the reachable-CVE view over time.
- **Automation = full loop:** scanners fixed + CI ratchet + SARIF→Code Scanning + CRD→issue bridge + Kyverno graduation + the daily-engineer definition changes.

## Children (oldest/most-blocking first)

- [ ] **P0.1** #2443 + #2452 — Unfreeze offline posture scans (`keepLocal`) + scanner v4.0.10 so exception CRs apply _(in-flight; mount-only PRs #2454/#2439/#2444/#2448 all closed as dups of #2443 — the mount fix alone is insufficient)_
- [ ] **P0.2** #2445 — Route runtime threat alerts off stdout → minimal kubescape-scoped Alertmanager → Headlamp + Slack + Coroot _(in-flight draft; #2449 closed as dup)_
- [ ] **P1.1** #2450 — Unblock kubevuln — grype produces no CVE data on 197 images (relevancy `ScanCP` partial-profile block)
- [ ] **#2264** — Wire `ClusterSecurityException` into the CI scan and ratchet `--compliance-threshold` toward 100 _(already filed — adopted under this epic)_
- [ ] **#2149** — Reduce reachable base-image CVEs _(re-scope: gated on #2450 — nothing is being flagged today)_
- [ ] **P2** #2451 — Continuous ingestion: SARIF→Code Scanning for posture + a CRD→issue bridge for live-only findings
- [ ] **P3** (monorepo) — Daily-engineer definition: live-security survey pass + fix-vs-except ladder + "drive Kubescape to 100% and hold it" standing objective

## Already in-flight (parallel runs, 2026-07-04) — this epic aggregates them

Much of the posture + runtime + exceptions work was already underway when this epic was filed:
- **Posture recovery:** #2443 (unfreeze scans via `keepLocal`) + #2452 (scanner v4.0.10 so exception CRs apply offline). The `config.json`-is-a-directory abort is real but the mount fix alone is insufficient (Submit stays true → cloud 402), so the mount-only PRs #2454/#2439/#2444/#2448 were all closed as dups of #2443. _(Four duplicate config.json PRs is itself the workflow gap this epic + monorepo#2052 address.)_
- **Runtime routing:** #2445 (kubescape-scoped Alertmanager → Headlamp + Slack + Coroot).
- **Posture exceptions / hardening:** #2434 (C-0026), #2440 (C-0002 infra), #2442 (C-0015), #2446 (C-0007), #2453 (C-0002 tenant), #2436 / #2437 (non-root).

**The genuinely-unaddressed gaps this investigation surfaced are #2450 (CVE scanning is silently broken — nobody was on it) and #2451 (continuous ingestion so this can't recur), plus the daily-engineer workflow change (monorepo#2052).** This epic ties the scattered work into one "100% and hold" line so the number is tracked to completion rather than fixed piecemeal.

## Non-goals / guardrails

Root-cause fixes only (no `t.Skip`/threshold-lowering/exception-as-first-resort); never spin up clusters to test; all changes ship as draft PRs under static validation; exceptions stay minimal, scoped, and justified (exceptions-as-code).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roadmap: Kubescape security stack → 100% and hold (posture · CVE · runtime) #2447

Problem

Goal

Approach — two loops

Decisions (settled with the maintainer)

Children (oldest/most-blocking first)

Already in-flight (parallel runs, 2026-07-04) — this epic aggregates them

Non-goals / guardrails

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Surface	State	Root cause (verified against live prod)
Posture (config scan)	🔴 Broken	`kubescape-volume` (empty `emptyDir`) mounts at `…/.kubescape/config.json` with `subPath: config.json` → kubelet materialises `config.json` as a directory → scanner aborts every run. Live log: `open …/config.json: is a directory`. 1542/1542 scan objects `controls: null`; compliance 0.00 across all 9 frameworks; data ~7 days stale.
CVEs (kubevuln)	🔴 Broken/invisible	The relevancy scan path (`ScanCP`) is the only one wired and aborts on partial ApplicationProfiles ("workload restart required", ~815×/day). Verified: 197/197 image manifests have 0 grype matches / no scanner name; 109/109 summaries `.all`=0 and `.relevant`=0. The cluster is not 0-CVE — nothing is being scanned.
Runtime detection (node-agent)	🟠 Invisible	`alertManagerExporterUrls: []`, `prometheusExporterEnabled: false`, `stdoutExporter: true` → every runtime threat alert goes to stdout and vanishes. Malware detection off, 0 RuntimeRuleAlertBindings, 0 alerts routed anywhere in 24h. Node-agent is healthy and learning (288 profiles, 123 still partial).

Uh oh!

roadmap: Kubescape security stack → 100% and hold (posture · CVE · runtime) #2447

Description

Problem

Goal

Approach — two loops

Decisions (settled with the maintainer)

Children (oldest/most-blocking first)

Already in-flight (parallel runs, 2026-07-04) — this epic aggregates them

Non-goals / guardrails

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions