You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A live investigation on 2026-07-04 found that all three Kubescape surfaces are broken or invisible — not clean. Every "0 findings" reading is absence of data, not health, and none of it reaches a human or the issue backlog, so it has aged silently for ~a week.
Surface
State
Root cause (verified against live prod)
Posture (config scan)
🔴 Broken
kubescape-volume (empty emptyDir) mounts at …/.kubescape/config.json with subPath: config.json → kubelet materialises config.json as a directory → scanner aborts every run. Live log: open …/config.json: is a directory. 1542/1542 scan objects controls: null; compliance 0.00 across all 9 frameworks; data ~7 days stale.
CVEs (kubevuln)
🔴 Broken/invisible
The relevancy scan path (ScanCP) is the only one wired and aborts on partial ApplicationProfiles ("workload restart required", ~815×/day). Verified: 197/197 image manifests have 0 grype matches / no scanner name; 109/109 summaries .all=0 and.relevant=0. The cluster is not 0-CVE — nothing is being scanned.
Runtime detection (node-agent)
🟠 Invisible
alertManagerExporterUrls: [], prometheusExporterEnabled: false, stdoutExporter: true → every runtime threat alert goes to stdout and vanishes. Malware detection off, 0 RuntimeRuleAlertBindings, 0 alerts routed anywhere in 24h. Node-agent is healthy and learning (288 profiles, 123 still partial).
Secondary posture degraders (independent of the mount): the scanner SA cannot listvalidatingadmissionpolicies or istio VirtualService at cluster scope (forbidden), and cloud-provider controls are N/A on Talos.
Meta-problem: the daily engineer is issue-driven, but its survey is GitHub-only (portfolio-surveyor is kubectl-free by design), so no live cluster finding ever enters the backlog unless a human files it. Combined with a CI gate that is a floor (85), not a 100 target, and no fix-vs-except ladder / security definition-of-done, the whole stack could rot to 0.00 unnoticed — which is what happened.
Goal
Drive posture + CVE + runtime to a defensible 100% and hold it, with prevention (stop the number creeping back up) and continuous ingestion (findings become tracked issues the daily engineer drains oldest-first) — without reliability regressions.
Approach — two loops
Prevention loop: fix the scanners → wire the in-repo ClusterSecurityException set into the CI scan via ksail native --exceptions and ratchet --compliance-threshold toward 100 → graduate each fixed control into a Kyverno Enforce policy → SARIF → GitHub Code Scanning gate on PRs. Ratchet-only-tightens.
Backlog loop: a scheduled, fingerprint-deduped bridge turns live Kubescape findings (posture / CVE / runtime) into security issues under this epic; the daily engineer drains them oldest-first with a fix > except discipline (an exception is last-resort, scoped, justified, and periodically pruned).
Decisions (settled with the maintainer)
Runtime alert routing = in-flight feat(kubescape): route runtime-detection alerts to Headlamp, Slack, and Coroot #2445 (reconciled 2026-07-04): a single tiny kubescape-scoped Alertmanager (prod-only) fed by the node-agent, routing to Headlamp (its Runtime-Detection Alerts tab reads only from an Alertmanager), Slack (the existing webhook), and Coroot (node-agent stdout → Coroot logs). A Coroot-only PromQL path was considered but rejected — it can't feed the Headlamp plugin and Coroot CE has no inbound alert receiver.
CVE = both: decouple a full non-relevancy scan so grype data flows now, and complete the partial ApplicationProfiles for the reachable-CVE view over time.
Automation = full loop: scanners fixed + CI ratchet + SARIF→Code Scanning + CRD→issue bridge + Kyverno graduation + the daily-engineer definition changes.
The genuinely-unaddressed gaps this investigation surfaced are #2450 (CVE scanning is silently broken — nobody was on it) and #2451 (continuous ingestion so this can't recur), plus the daily-engineer workflow change (monorepo#2052). This epic ties the scattered work into one "100% and hold" line so the number is tracked to completion rather than fixed piecemeal.
Non-goals / guardrails
Root-cause fixes only (no t.Skip/threshold-lowering/exception-as-first-resort); never spin up clusters to test; all changes ship as draft PRs under static validation; exceptions stay minimal, scoped, and justified (exceptions-as-code).
Problem
A live investigation on 2026-07-04 found that all three Kubescape surfaces are broken or invisible — not clean. Every "0 findings" reading is absence of data, not health, and none of it reaches a human or the issue backlog, so it has aged silently for ~a week.
kubescape-volume(emptyemptyDir) mounts at…/.kubescape/config.jsonwithsubPath: config.json→ kubelet materialisesconfig.jsonas a directory → scanner aborts every run. Live log:open …/config.json: is a directory. 1542/1542 scan objectscontrols: null; compliance 0.00 across all 9 frameworks; data ~7 days stale.ScanCP) is the only one wired and aborts on partial ApplicationProfiles ("workload restart required", ~815×/day). Verified: 197/197 image manifests have 0 grype matches / no scanner name; 109/109 summaries.all=0 and.relevant=0. The cluster is not 0-CVE — nothing is being scanned.alertManagerExporterUrls: [],prometheusExporterEnabled: false,stdoutExporter: true→ every runtime threat alert goes to stdout and vanishes. Malware detection off, 0 RuntimeRuleAlertBindings, 0 alerts routed anywhere in 24h. Node-agent is healthy and learning (288 profiles, 123 still partial).Secondary posture degraders (independent of the mount): the scanner SA cannot
listvalidatingadmissionpoliciesor istioVirtualServiceat cluster scope (forbidden), and cloud-provider controls are N/A on Talos.Meta-problem: the daily engineer is issue-driven, but its survey is GitHub-only (
portfolio-surveyoriskubectl-free by design), so no live cluster finding ever enters the backlog unless a human files it. Combined with a CI gate that is a floor (85), not a 100 target, and no fix-vs-except ladder / security definition-of-done, the whole stack could rot to 0.00 unnoticed — which is what happened.Goal
Drive posture + CVE + runtime to a defensible 100% and hold it, with prevention (stop the number creeping back up) and continuous ingestion (findings become tracked issues the daily engineer drains oldest-first) — without reliability regressions.
Approach — two loops
ClusterSecurityExceptionset into the CI scan via ksail native--exceptionsand ratchet--compliance-thresholdtoward 100 → graduate each fixed control into a KyvernoEnforcepolicy → SARIF → GitHub Code Scanning gate on PRs. Ratchet-only-tightens.securityissues under this epic; the daily engineer drains them oldest-first with a fix > except discipline (an exception is last-resort, scoped, justified, and periodically pruned).Decisions (settled with the maintainer)
Children (oldest/most-blocking first)
keepLocal) + scanner v4.0.10 so exception CRs apply (in-flight; mount-only PRs fix(kubescape): repair offline config.json mount so posture scans persist #2454/fix(kubescape): remount scanner config.json so offline posture scans persist #2439/fix(kubescape): mount scanner config.json dir so posture scans persist #2444/fix(kubescape): config-scan aborts every run — config.json mounted as a directory (compliance 0.00) #2448 all closed as dups of fix(kubescape): keep scheduled posture scans local so results persist #2443 — the mount fix alone is insufficient)ScanCPpartial-profile block)ClusterSecurityExceptioninto the CI scan and ratchet--compliance-thresholdtoward 100 (already filed — adopted under this epic)Already in-flight (parallel runs, 2026-07-04) — this epic aggregates them
Much of the posture + runtime + exceptions work was already underway when this epic was filed:
keepLocal) + fix(kubescape): bump posture scanner to v4.0.10 so exception CRs apply offline #2452 (scanner v4.0.10 so exception CRs apply offline). Theconfig.json-is-a-directory abort is real but the mount fix alone is insufficient (Submit stays true → cloud 402), so the mount-only PRs fix(kubescape): repair offline config.json mount so posture scans persist #2454/fix(kubescape): remount scanner config.json so offline posture scans persist #2439/fix(kubescape): mount scanner config.json dir so posture scans persist #2444/fix(kubescape): config-scan aborts every run — config.json mounted as a directory (compliance 0.00) #2448 were all closed as dups of fix(kubescape): keep scheduled posture scans local so results persist #2443. (Four duplicate config.json PRs is itself the workflow gap this epic + monorepo#2052 address.)The genuinely-unaddressed gaps this investigation surfaced are #2450 (CVE scanning is silently broken — nobody was on it) and #2451 (continuous ingestion so this can't recur), plus the daily-engineer workflow change (monorepo#2052). This epic ties the scattered work into one "100% and hold" line so the number is tracked to completion rather than fixed piecemeal.
Non-goals / guardrails
Root-cause fixes only (no
t.Skip/threshold-lowering/exception-as-first-resort); never spin up clusters to test; all changes ship as draft PRs under static validation; exceptions stay minimal, scoped, and justified (exceptions-as-code).