Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/dr/alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,45 @@ stays quiet by design, exactly as the old Alertmanager did.
container logs/traces, not host audit-log files, so the previous
alloy-audit β†’ Loki pipeline was removed.

## Kubescape runtime-detection alerts

The Kubescape node-agent's **runtime-detection** alerts (rule violations,
malware) are the one signal that does **not** fit the Coroot model, because their
only first-class dashboard β€” the **Headlamp Kubescape plugin's "Runtime
Detection β†’ Alerts" tab** β€” reads exclusively from a Prometheus **Alertmanager**
(`GET /api/v2/alerts`, filtered on `alertname="KubescapeRuleViolated"`). It
cannot read Kubescape's storage CRs and cannot query Coroot's Prometheus (a
metrics store, not an Alertmanager). So a **single, minimal Alertmanager** is
reintroduced *scoped to Kubescape* β€” not a re-adoption of the Prometheus stack.
It lives in the `kubescape` namespace, prod-only
(`providers/hetzner/infrastructure/controllers/alertmanager/`), ~10m CPU / 32Mi
RAM, ephemeral (no PVC).

The node-agent fans each alert out to **all three destinations** (wired in
`providers/hetzner/infrastructure/controllers/kubescape/patches/`):

| Destination | Path |
| --- | --- |
| **Headlamp plugin** | `nodeAgent.config.alertManagerExporterUrls` β†’ the Alertmanager, which the plugin queries. |
| **Slack** | the Alertmanager's `slack_configs` receiver β†’ the shared `${alertmanager_webhook_url}` incoming-webhook (same channel as Coroot/Flux). |
| **Coroot** | `nodeAgent.config.stdoutExporter` (on by default) β†’ Coroot's eBPF log capture surfaces the alert in the **Logs** view (Coroot CE has no inbound alert receiver). |

**One manual step (Headlamp).** The plugin's Alertmanager address is a
**per-user, per-browser** setting (stored in `localStorage`; there is no
declarative/Helm way to seed it β€” [headlamp#3979](https://github.com/kubernetes-sigs/headlamp/issues/3979)).
Each operator sets it **once** in Headlamp β†’ *Settings β†’ Plugins β†’ Kubescape*,
in the `namespace/service:port` form the plugin validates:

```
kubescape/alertmanager:9093
```

The plugin reaches it through the Kubernetes API server's Service proxy, so the
logged-in user needs `get`/`create` on `services/proxy` in the `kubescape`
namespace (satisfied by the admin binding). Until it is set, the tab shows
"Alertmanager URL is not configured" β€” the data source now exists, only the
per-user pointer is manual.

## Dead-man's-switch (off-cluster heartbeat)

In-cluster alerting cannot tell you the cluster is down β€” it's down too. A tiny
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Adds the Alertmanager-specific flows on top of the namespace-wide
# `allow-kubescape` policy (bases/infrastructure/controllers/kubescape/) and the
# Kyverno-generated default-deny. Cilium composes policies ADDITIVELY (the
# effective allow-set is the UNION across every policy selecting the endpoint), so
# this only needs to add what the base does not already grant the Alertmanager pod:
# * the node-agent β†’ Alertmanager:9093 push is intra-namespace, already allowed
# by allow-kubescape's intra-namespace rules β€” not repeated here;
# * DNS egress is already granted to every kubescape pod by allow-kubescape.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-alertmanager
namespace: kubescape
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: alertmanager
ingress:
# The Headlamp Kubescape plugin reads GET /api/v2/alerts through the
# Kubernetes API server's Service proxy (/api/v1/namespaces/kubescape/
# services/alertmanager:9093/proxy/...), so the connection to :9093
# originates from the API server. Mirror the entity set the base
# allow-kubescape policy uses for the webhook/host-scanner ports.
- fromEntities:
- kube-apiserver
- remote-node
- host
toPorts:
- ports:
- port: "9093"
protocol: TCP
egress:
# Post alert notifications to the Slack incoming-webhook. Pinned by FQDN
# (not world:443) so a compromised pod cannot exfiltrate elsewhere β€” the same
# lockdown posture as allow-kubescape's toFQDNs list.
- toFQDNs:
- matchName: "hooks.slack.com"
toPorts:
- ports:
- port: "443"
protocol: TCP
# DNS resolution, so Cilium's L7 DNS proxy can learn hooks.slack.com's IPs and
# enforce the toFQDNs allow above (the base policy engages the proxy too, but
# keep this self-contained).
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "53"
protocol: TCP
rules:
dns:
- matchPattern: "*"
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: alertmanager
namespace: kubescape
labels:
helm.toolkit.fluxcd.io/remediation: enabled
spec:
# WHY THIS EXISTS (prod-only). The platform's general alerting was migrated off
# the kube-prometheus-stack onto Coroot (docs/dr/alerting.md), so there is no
# Alertmanager in the cluster. But the Kubescape node-agent's runtime-detection
# alerts have exactly one first-class dashboard β€” the Headlamp Kubescape
# plugin's "Runtime Detection > Alerts" tab β€” and that tab reads ONLY from a
# Prometheus Alertmanager `GET /api/v2/alerts` (filtered on
# alertname="KubescapeRuleViolated"); it cannot read Kubescape storage CRs and
# cannot query Coroot's Prometheus (a metrics store, not an Alertmanager). So a
# single, minimal Alertmanager is reintroduced here β€” scoped to the kubescape
# namespace and to runtime alerts only, NOT a general re-adoption of the
# Prometheus stack. One instance both RECEIVES the node-agent's pushed alerts
# (POST /api/v2/alerts, wired via the nodeAgent.config.alertManagerExporterUrls
# patch in ../kubescape/patches/) and SERVES them to the Headlamp plugin (GET),
# and routes them to Slack (slack_configs β†’ the shared incoming-webhook). The
# third destination, Coroot, is fed independently by the node-agent's stdout
# exporter (Coroot's eBPF log capture surfaces the alerts in its Logs view;
# Coroot CE has no inbound alert receiver). Prod-only because runtimeDetection
# is disabled on the docker/local overlay and Slack is a prod-only concern.
interval: 10m
timeout: 10m
chart:
spec:
chart: alertmanager
version: 1.40.1
sourceRef:
kind: HelmRepository
name: prometheus-community
# https://github.com/prometheus-community/helm-charts/blob/main/charts/alertmanager/values.yaml
values:
# Deterministic in-cluster name so the node-agent exporter target and the
# Headlamp plugin setting can hard-code it: Service = alertmanager.kubescape.svc:9093.
fullnameOverride: alertmanager
replicaCount: 1
# This Alertmanager never calls the Kubernetes API β€” don't mount the SA
# token (chart default is true).
automountServiceAccountToken: false
# Ephemeral by design: this Alertmanager only relays transient runtime alerts
# (no silences or gossip cluster worth persisting), so drop the chart's PVC
# and back /alertmanager with an emptyDir. Keeps it a zero-storage add-on.
persistence:
enabled: false
# Grossly-overprovisioned chart default is `resources: {}`; the kubescape
# namespace is excluded from auto-vpa, so pin an explicit, tiny block.
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 100m
memory: 128Mi
# The kubescape namespace is excluded from the platform's add-security-context
# mutation and the enforced PSS-restricted policy, so harden the container
# here instead of relying on Kyverno. readOnlyRootFilesystem is safe: the only
# path Alertmanager writes is --storage.path=/alertmanager, mounted as the
# emptyDir above; config + the webhook secret are read-only mounts.
podSecurityContext:
fsGroup: 65534
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
# Mount the Slack incoming-webhook from a Secret so it never lands in the
# rendered alertmanager.yml ConfigMap; referenced below via slack_configs.api_url_file.
# Sibling path (NOT under /etc/alertmanager, which the chart already mounts
# the config ConfigMap onto) to avoid a nested volume mount.
extraSecretMounts:
- name: slack-webhook
mountPath: /etc/alertmanager-secrets
secretName: alertmanager-slack-webhook
readOnly: true
config:
enabled: true
global:
resolve_timeout: 5m
route:
receiver: slack
# The node-agent labels every alert with alertname
# (KubescapeRuleViolated / KubescapeMalwareDetected) + host, so group
# per rule per node. A single catch-all route sends everything to Slack.
group_by:
- alertname
- host
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: slack
slack_configs:
# api_url_file reads the webhook from the mounted Secret (above),
# keeping it out of the ConfigMap. The path MUST match the Secret's
# key name (slack-webhook-url in secret.yaml) β€” they are coupled.
- api_url_file: /etc/alertmanager-secrets/slack-webhook-url
# channel is optional for an incoming webhook (the channel is baked
# into the webhook URL); set here only for clarity in the UI.
channel: "#platform-alerts"
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }} on {{ .CommonLabels.host }}'
# Render every annotation the node-agent attaches (keys vary by
# rule), one per line, for each grouped alert.
text: |-
{{ range .Alerts }}{{ range .Annotations.SortedPairs }}*{{ .Name }}:* {{ .Value }}
{{ end }}{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: prometheus-community
namespace: kubescape
spec:
interval: 1h
url: https://prometheus-community.github.io/helm-charts
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- helm-repository.yaml
- helm-release.yaml
- secret.yaml
- cilium-network-policy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Slack incoming-webhook the Alertmanager Slack receiver POSTs Kubescape
# runtime-detection alerts to. Reuses ${alertmanager_webhook_url} β€” the SAME
# per-cluster webhook Coroot's incidents and the Flux notification-controller
# already post to (injected by Flux substitution from the variables-cluster
# Secret) β€” so there is nothing new to provision, and runtime alerts land in the
# same Slack channel as the rest of the platform's alerting. The inline default
# keeps `ksail workload validate` (which has no SOPS access) building and leaves
# local/CI inert; this overlay is prod-only anyway (runtimeDetection is disabled
# on docker/local). Mounted read-only into the Alertmanager pod and referenced by
# slack_configs.api_url_file so the URL never lands in the rendered ConfigMap.
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-slack-webhook
namespace: kubescape
type: Opaque
stringData:
slack-webhook-url: ${alertmanager_webhook_url:=https://example.invalid/no-slack-configured}
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kubescape
namespace: kubescape
spec:
# Matches the base interval so this patch is a no-op on that field once merged.
interval: 10m
values:
nodeAgent:
config:
# Fan the runtime-detection alerts out to their three destinations
# (prod-only β€” runtimeDetection is disabled on the docker/local overlay).
#
# 1. Headlamp + 2. Slack β€” via the in-cluster Alertmanager deployed in
# ../alertmanager/. The node-agent POSTs alerts to it; the Headlamp
# Kubescape plugin's "Runtime Detection > Alerts" tab reads them back
# (GET /api/v2/alerts), and the Alertmanager Slack receiver forwards
# them to the shared incoming-webhook. The value is HOST:PORT ONLY
# (no scheme, no path) β€” the node-agent's Prometheus Alertmanager
# client appends /api/v2/alerts itself.
# 3. Coroot β€” via stdout. Coroot CE has no inbound alert receiver, but
# its eBPF node-agent captures this pod's stdout, so the alerts show
# up in Coroot's Logs view (and its log-pattern inspection). This is
# the chart default (stdoutExporter: true); pinned here so the Coroot
# path is explicit and survives a chart-default change.
#
# The default httpExporterConfig (β†’ synchronizer:8089) is deliberately
# left untouched: it is what feeds the plugin's other Runtime Detection
# tabs (Application Profiles, Rules, …) from the Kubescape storage CRs.
alertManagerExporterUrls:
- alertmanager.kubescape.svc:9093
stdoutExporter: true
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ resources:
# local/CI only): no VirtualMachines/DataVolumes exist on Hetzner and
# virt-handler crash-looped (1000+ restarts) with zero VMs.
- ../../../../bases/infrastructure/controllers/
# Prod-only: a minimal Alertmanager scoped to Kubescape runtime-detection
# alerts. The Headlamp Kubescape plugin's "Runtime Detection > Alerts" tab
# reads ONLY from a Prometheus Alertmanager (GET /api/v2/alerts), which the
# Coroot migration removed from the cluster; this reintroduces one, fed by the
# node-agent (kubescape/patches/ below) and routing to Slack. See its
# helm-release.yaml header. Prod-only (runtimeDetection is off on docker/local).
- alertmanager/
# Prod-only: layers a cluster-wide mutual-auth (SPIRE mTLS) policy on top of
# the base Cilium controller listed above. SPIRE is enabled only on Hetzner,
# so the policy lives here, not in the base β€” see cilium/ for the rationale.
Expand Down Expand Up @@ -70,6 +77,10 @@ patches:
# eager startup OIDC discovery that needs a cluster-resolvable + TLS-trusted
# issuer, which the local *.platform.lan host-file domain isn't β€” see file.
- path: ksail-operator/patches/helm-release-patch.yaml
# Prod-only: point the Kubescape node-agent's runtime-detection alerts at the
# in-cluster Alertmanager (alertmanager/ above) + keep the stdout exporter on,
# fanning them out to the Headlamp plugin, Slack, and Coroot β€” see file.
- path: kubescape/patches/helm-release-patch.yaml
# NB: the Coroot CR moved to the `infrastructure` layer (so the operator's
# CRD is installed first), so its hetzner patch now lives in
# ../coroot/patches/ and is applied by ../kustomization.yaml, not here.