Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,18 +206,19 @@ You **cannot** decrypt existing secrets without the proper Age keys. For local d
- **Never commit plaintext secrets** — all secrets must be SOPS-encrypted with the `.enc.yaml` suffix.
- **Base files are immutable** — use Kustomize `patches:` in overlays; never edit `k8s/bases/` directly from a provider or cluster overlay.
- **Flux dependency order** — `bootstrap` → `infrastructure-controllers` → `infrastructure` → `apps`. One prod-only side layer hangs off `infrastructure` without gating `apps`: `infrastructure-overprovisioning` (apply-only autoscaler buffer). Declarative GitHub org management runs as a normal **app** (`github-config`) consuming the `devantler-tech/.github` artifact, with its Crossplane provider in the `infrastructure` layer — see [`docs/github-management.md`](docs/github-management.md).
- **File & directory naming** — kebab-case folders, one resource per file, and filenames led by the resource Kind (CR folders excepted); enforced by the `naming` CI job. See [File and Directory Naming Conventions](#file-and-directory-naming-conventions) below.
- **File & directory naming** — kebab-case folders, one resource per file, and filenames led by the resource Kind (CR folders and `patches/` excepted — both name files by intent). Talos machine-config patches (`talos/`, `talos-local/`) also hold one document per file with intent names; only the k8s-manifest-specific rules don't apply to them. Enforced by the `naming` CI job. See [File and Directory Naming Conventions](#file-and-directory-naming-conventions) below.

### File and Directory Naming Conventions

Enforced in CI by [`scripts/validate-naming.py`](scripts/validate-naming.py) (the `naming` job in `ci.yaml`); run it locally before any manifest PR.

- **Directories are kebab-case**, named after the **application/component** *or* a **CR Kind in plural**. Co-locate a component's own CRs in its folder by default; break a CR out into a `‹kind-plural›/` folder only when it cannot live with its component (see the two reasons in the next section). `‹kind-plural›` is the **kebab-cased plural of the Kind** (`VerticalPodAutoscaler → vertical-pod-autoscalers/`, `LimitRange → limit-ranges/`) — a folder that groups ≥2 instances of one non-workload Kind under any other name is flagged.
- **One Kubernetes resource per file.** The only exception is a vendored upstream operator bundle, explicitly whitelisted in the validator (today `controllers/cdi/cdi-operator.yaml` and `controllers/kubevirt/kubevirt-operator.yaml`).
- **One Kubernetes resource per file** — patch fragments included. The only exception is a vendored upstream operator bundle, explicitly whitelisted in the validator (today `controllers/cdi/cdi-operator.yaml` and `controllers/kubevirt/kubevirt-operator.yaml`).
- **Component-folder files are named after their resource Kind, kebab-cased**: `‹kind›.yaml` (e.g. `helm-release.yaml`, `http-route.yaml`, `cilium-network-policy.yaml`, `service-account.yaml`). When a folder holds more than one of a Kind, qualify each with a purpose: `‹kind›-‹purpose›.yaml` (e.g. `external-secret-db-backup.yaml`). The Kind→kebab map is acronym-aware: `HTTPRoute → http-route`, `OCIRepository → oci-repository`, `CiliumNetworkPolicy → cilium-network-policy`, `PodDisruptionBudget → pod-disruption-budget`.
- **CR-folder files** omit the folder-implied Kind and are named `‹verb›-‹purpose›.yaml` (e.g. `restrict-tenant-secret-stores.yaml`).
- A **Flux `Kustomization` CR** (`kustomize.toolkit.fluxcd.io`) is named `flux-kustomization*.yaml`; the `flux-` prefix disambiguates it from the kustomize **build** file, which must stay exactly `kustomization.yaml` (`kustomize.config.k8s.io`).
- **Patch fragments** (under a `patches/` directory or named `*-patch.yaml`) are overlay inputs, not deployed resources — they keep an intent-describing name and are exempt from the Kind-leads rule (but stay kebab-case, one-resource, and keep the `flux-kustomization` prefix where applicable).
- **Patch fragments** are overlay inputs, not deployed resources. They live under a `patches/` directory (a `*-patch.yaml` loose next to a kustomization is flagged as misplaced) and follow the **CR-folder naming convention**: an intent-describing `‹verb›-‹purpose›.yaml` (e.g. `enable-oidc.yaml`, `store-spire-data-on-hcloud.yaml`) that neither leads with the patched Kind nor carries a `-patch` suffix — the folder already says it's a patch. One-resource-per-file applies to them too; a patch on a Flux `Kustomization` CR keeps the `flux-kustomization` prefix (e.g. `flux-kustomization-protect-wedding-db.yaml`).
- **Talos machine-config patches** (`talos/`, `talos-local/`) follow the same spirit: **one YAML document per file** and intent-describing `‹verb›-‹purpose›.yaml` names (e.g. `enable-apparmor.yaml`, `block-ingress-by-default.yaml`, `allow-kubelet-ingress.yaml`). They are Talos config fragments, not Kubernetes manifests, so the k8s-specific rules — Kind-led filenames, `patches/` placement, the `flux-kustomization` prefix — are the only parts that don't apply. Ingress-firewall rule files stay **one `NetworkRuleConfig` per file**, but keep the rule *count* low by consolidating ports into an existing rule when protocol + subnets match (see the ENOBUFS note in `talos/control-planes/allow-public-ingress.yaml`).

### Infrastructure File Structure Convention

Expand Down Expand Up @@ -255,7 +256,7 @@ The platform uses a hierarchical kustomization structure: **base** configuration
- **Workaround:** fork the repository and use your own Age keys; re-encrypt every `*.enc.yaml` with your key.

### CNI Configuration
- The Talos cluster starts with its default CNI disabled (via `talos-local/cluster/cni.yaml`).
- The Talos cluster starts with its default CNI disabled (via `talos-local/cluster/disable-default-cni-and-kube-proxy.yaml`).
- Nodes stay `NotReady` until Cilium is installed by KSail.
- This is expected — KSail handles CNI installation automatically.

Expand Down
2 changes: 1 addition & 1 deletion docs/TENANTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ Open the change as a PR; once merged, Flux reconciles the new tenant.

A tenant's manifests ship from its own OCI artifact, but the **prod (hetzner) overlay** can
layer a `Kustomization` `spec.patches` onto the tenant's platform-side Flux `Kustomization`
at `k8s/providers/hetzner/apps/<tenant>/patches/kustomization-patch.yaml`, which Flux then
at `k8s/providers/hetzner/apps/<tenant>/patches/flux-kustomization-<purpose>.yaml`, which Flux then
applies to the tenant's resources *after* pulling the artifact. This is a **narrow escape
hatch**, not a place for tenant config.

Expand Down
2 changes: 1 addition & 1 deletion docs/dr/alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ pressure (the operator only exposes `priorityClassName` on the node-agent).
Local/CI (docker provider) runs the same CR on the cluster's default storage
class (ephemeral — losing telemetry on a restart is fine there). The `hcloud`
PVC overrides and longer retention live in the hetzner overlay
(`k8s/providers/hetzner/infrastructure/controllers/coroot/patches/`), the same
(`k8s/providers/hetzner/infrastructure/coroot/patches/`), the same
way OpenBao gets block storage.

## SSO
Expand Down
4 changes: 2 additions & 2 deletions docs/dr/spire-server-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Three independent blockers, each fatal to a blind change:
i.e. **behind the very gate SPIRE bootstraps**. SPIRE down → its Postgres
unreachable/uncertifiable → SPIRE can't start → loop. This is precisely the
SPIRE↔Longhorn deadlock the prod overlay already engineered around by moving
the datastore to hcloud-csi (`cilium/patches/spire-datastorage-patch.yaml`,
the datastore to hcloud-csi (`cilium/patches/store-spire-data-on-hcloud.yaml`,
2026-06-06 prod outage), but Postgres is a *busier, multi-pod, multi-node*
dependency than a single attached block device, so it is strictly harder to
make safe.
Expand Down Expand Up @@ -125,7 +125,7 @@ SVIDs, or it deadlocks. Options, hardest constraint first:
single most important safety prerequisite and **must land and be verified before
any replica/datastore change.** (It is purely additive — safe to ship ahead.)
- **Talos node firewall** already allows the SPIRE mesh-auth port 4250
node-to-node (`talos/{workers,control-planes}/ingress-firewall.yaml`). Postgres
node-to-node (`talos/workers/allow-cilium-mutual-auth-ingress.yaml`, `talos/control-planes/allow-internal-node-ingress.yaml`). Postgres
:5432 between nodes is intra-cluster pod traffic over the CNI, not a host port,
so no Talos firewall change is expected — **verify** spire-db instances and
spire-server can co-locate or cross nodes without a host-firewall drop.
Expand Down
6 changes: 3 additions & 3 deletions docs/runtime-security.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ they sit inside a wider set of controls:

| Layer | Control | What it does at runtime |
| --- | --- | --- |
| Kernel LSM | **AppArmor** ([`talos/cluster/apparmor.yaml`](../talos/cluster/apparmor.yaml)) | Confines container processes to a profile; default-deny for unexpected file/cap access |
| Kernel LSM | **AppArmor** ([`talos/cluster/enable-apparmor.yaml`](../talos/cluster/enable-apparmor.yaml)) | Confines container processes to a profile; default-deny for unexpected file/cap access |
| Syscall filter | **seccomp `RuntimeDefault`** (mutated + enforced by [Kyverno](../k8s/bases/infrastructure/cluster-policies/best-practices/validate-pod-security.yaml)) | Blocks the dangerous-syscall tail every container gets by default |
| Kernel hardening | **sysctls** ([`talos/cluster/sysctls.yaml`](../talos/cluster/sysctls.yaml)) | `kptr_restrict`, `ptrace_scope`, unprivileged-eBPF off, etc. — shrinks the local-privesc surface |
| Kernel hardening | **sysctls** ([`talos/cluster/harden-kernel-sysctls.yaml`](../talos/cluster/harden-kernel-sysctls.yaml)) | `kptr_restrict`, `ptrace_scope`, unprivileged-eBPF off, etc. — shrinks the local-privesc surface |
| Network | **Cilium + Hubble** | L3–L7 flow visibility and default-deny [CiliumNetworkPolicy](../k8s/bases/infrastructure/cluster-policies/best-practices/add-default-deny.yaml) per namespace |
| Runtime detection | **Kubescape node-agent** | Learned-behaviour anomaly detection, correlated with config/CVE/compliance posture |
| Runtime enforcement | **Tetragon** | Declarative kernel-hook policies that **terminate the offending process** (SIGKILL) on a policy match |
| Forensics | **API audit log** ([`talos/cluster/audit-logging.yaml`](../talos/cluster/audit-logging.yaml)) | Who-did-what record of control-plane mutations |
| Forensics | **API audit log** ([`talos/cluster/enable-audit-logging.yaml`](../talos/cluster/enable-audit-logging.yaml)) | Who-did-what record of control-plane mutations |

This document focuses on the two middle-to-bottom rows — the eBPF sensors.

Expand Down
4 changes: 2 additions & 2 deletions docs/rwx-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ hcloud volume create \
--server <worker-server-name>
```

The Talos machine config patch (`talos/workers/longhorn.yaml`) handles mounting `/dev/sdb` at `/var/lib/longhorn`.
The Talos machine config patch (`talos/workers/mount-longhorn-data.yaml`) handles mounting `/dev/sdb` at `/var/lib/longhorn`.

> **Verify the device path** after attaching: on Hetzner Cloud, the first attached volume
> consistently appears as `/dev/sdb`. Confirm with `talosctl disks --nodes <worker-ip>`.
> If the volume shows a different path, update `talos/workers/longhorn.yaml` accordingly.
> If the volume shows a different path, update `talos/workers/mount-longhorn-data.yaml` accordingly.

## StorageClasses

Expand Down
2 changes: 1 addition & 1 deletion docs/wireguard-vpn-access.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Talos control planes run a WireGuard **server** (`wg0` = `10.200.0.1/24`,
| --- | --- | --- |
| Gateway | single Cilium `Gateway platform` (kube-system), HTTPS:443, wildcard `gateway-tls` | `k8s/bases/infrastructure/gateway/` |
| Admin routes | all 7 HTTPRoutes bind that Gateway, `allowedRoutes.from: All` | per-controller `http-route.yaml` |
| LB | `cilium-gateway-platform` Service `type=LoadBalancer` → **Hetzner Cloud LB** | `gateway-patch.yaml` (hcloud annotations) |
| LB | `cilium-gateway-platform` Service `type=LoadBalancer` → **Hetzner Cloud LB** | `patches/attach-hcloud-load-balancer.yaml` (hcloud annotations) |
| LB IPs | public `49.12.20.241` (+ IPv6) **and private `10.0.1.7`**; ports 80/443 (nodePorts 32269/30755); `externalTrafficPolicy: Cluster` | live `svc` status |
| Public DNS | admin hostnames → **Cloudflare** (`188.114.96/97.1`) → proxied to the LB origin | `dig` |
| Cilium | `kube-proxy-replacement=true`, `routing-mode=tunnel/vxlan`, `enable-ipv4-masquerade=true`, LB-IPAM enabled **but no `CiliumLoadBalancerIPPool`**, **`devices = enp7s0 eth1`** | live `cilium-config` |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# (ghcr.io/devantler-tech/*) — the third verification layer, complementing:
# 1. Flux OCI *manifest* verification (apps' oci-repository.yaml verify.provider:
# cosign) — gates the GitOps artifacts, not container images.
# 2. Talos ImageVerificationConfig (talos/cluster/image-verification.yaml)
# 2. Talos ImageVerificationConfig (talos/cluster/verify-first-party-images.yaml)
# — gates the image bytes containerd pulls, with the SAME two signing
# identities as below. Keep both files in sync when identities change.
# This layer rejects a Pod spec referencing an unsigned/tampered first-party
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# First-party images (ghcr.io/devantler-tech/*) ARE cosign-signed and verified
# — by Flux on the OCI manifest artifacts (apps' oci-repository.yaml verify.provider:
# cosign) and, at the node pull layer, by Talos ImageVerificationConfig
# (talos/cluster/image-verification.yaml). But third-party chart images
# (talos/cluster/verify-first-party-images.yaml). But third-party chart images
# (Cilium, Longhorn, Coroot, registry.k8s.io, …) are not all signed, and there
# is no admission-layer signature enforcement, so this cluster-wide control —
# which scans every workload image ref — remains exempted.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ spec:
#
# The datastore StorageClass is intentionally NOT set here. On
# hetzner it is pinned to hcloud-csi by the prod overlay
# (providers/hetzner/.../cilium/patches/spire-datastorage-patch.yaml)
# (providers/hetzner/.../cilium/patches/store-spire-data-on-hcloud.yaml)
# — NOT the cluster-default longhorn. The reason is a circular
# dependency: Longhorn's own control plane is pod-to-pod traffic
# behind this very mTLS gate, so binding spire-server's PVC to
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dedicated volume for OpenBao raft snapshots (written by the
# vault-snapshot CronJob, newest 14 retained). Uses the cluster's default
# StorageClass (the hetzner overlay pins it to hcloud — see
# k8s/providers/hetzner/infrastructure/vault-snapshots-hcloud-patch.yaml).
# k8s/providers/hetzner/infrastructure/patches/store-vault-snapshots-on-hcloud.yaml).
# Snapshots on this PVC are the first-line restore source after OpenBao data
# loss: the vault-config Job restores from the newest one automatically when
# it finds an uninitialized cluster alongside a surviving openbao-unseal
Expand Down
4 changes: 2 additions & 2 deletions k8s/providers/docker/apps/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ kind: Kustomization
# kind: HelmRelease
# name: actual-budget
# namespace: actual-budget
# path: actual-budget/patches/helm-release-patch.yaml
# path: actual-budget/patches/shrink-persistence.yaml
# - target:
# kind: HelmRelease
# name: headlamp
# namespace: headlamp
# path: headlamp/patches/helm-release-patch.yaml
# path: headlamp/patches/trust-platform-ca.yaml
#
# Tenant RGD pilot (#1932 step 3) — opt-in like the apps above. Enabling it has
# the KRO controller expand ascoachingogvaner's full control-plane skeleton from
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,10 @@ components:
# enough — no redundant `target:` selector. Matches the path-only style used by
# the hetzner overlays.
patches:
- path: cilium/patches/helm-release-patch.yaml
- path: metrics-server/patches/helm-release-patch.yaml
- path: flux-operator/patches/helm-release-patch.yaml
- path: cilium/patches/disable-encryption-and-mutual-auth.yaml
- path: metrics-server/patches/allow-insecure-kubelet-tls.yaml
- path: flux-operator/patches/trust-platform-ca.yaml
# Opt-in patches — uncomment alongside the matching controller above:
# - path: kubescape/patches/helm-release-patch.yaml
# - path: tetragon/patches/helm-release-patch.yaml
# - path: kubevirt/patches/kubevirt-cr-patch.yaml
# - path: kubescape/patches/disable-heavy-capabilities.yaml
# - path: tetragon/patches/disable-helm-wait.yaml
# - path: kubevirt/patches/enable-software-emulation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# CloudNativePG instances on 5432 for the declarative Postgres integration — the
# cluster-agent discovers the DB pods via the coroot.com/postgres-scrape
# annotations the Cluster's inheritedMetadata stamps onto them, then connects to
# scrape pg_stat_* (see patches/postgres-cluster-patch.yaml).
# scrape pg_stat_* (see patches/enable-coroot-monitoring.yaml).
#
# ADDITIVE to allow-backstage (Cilium unions ingress allows across policies):
# this only adds the cross-namespace observability source on 5432 while
Expand Down

This file was deleted.

13 changes: 6 additions & 7 deletions k8s/providers/hetzner/apps/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,12 @@ patches:
# removed with the app: an unreferenced patch fragment is schema-validated
# standalone by `ksail workload validate`, and a partial HelmRelease (no
# spec.interval) fails. Restore the patch from git history if re-enabling.
- path: headlamp/patches/helm-release-patch.yaml
- path: wedding-app/patches/flux-kustomization-patch.yaml
- path: wedding-app/patches/flux-kustomization-protect-wedding-db.yaml
# Prod-only Coroot Postgres integration for the in-repo CNPG DBs (strategic
# merge: inheritedMetadata scrape annotations + pg_stat_statements params, plus
# for backstage a pg_monitor managed role). These self-identify, so no target.
- path: umami/patches/postgres-cluster-patch.yaml
- path: backstage/patches/postgres-cluster-patch.yaml
- path: umami/patches/enable-coroot-monitoring.yaml
- path: backstage/patches/enable-coroot-monitoring.yaml
# umami's pg_monitor grant is a JSON6902 append onto its EXISTING managed role
# (so it doesn't replace the list / desync the OpenBao-rotation enforcer); a
# JSON6902 patch has no embedded identity, so it keeps an explicit target.
Expand All @@ -56,7 +55,7 @@ patches:
kind: Cluster
name: umami-db
namespace: umami
path: umami/patches/grant-pg-monitor-patch.yaml
path: umami/patches/grant-pg-monitor.yaml
# User namespaces (hostUsers: false) — prod-only; see each patch file for the
# full rationale. JSON6902 patches (a list of ops) have no embedded identity,
# so they must keep an explicit target. whoami is the proven stateless canary
Expand All @@ -66,9 +65,9 @@ patches:
kind: HelmRelease
name: whoami
namespace: whoami
path: whoami/patches/helm-release-patch.yaml
path: whoami/patches/enable-user-namespaces.yaml
- target:
kind: HelmRelease
name: homepage
namespace: homepage
path: homepage/patches/helm-release-patch.yaml
path: homepage/patches/enable-user-namespaces.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# CloudNativePG instances on 5432 for the declarative Postgres integration —
# the cluster-agent discovers the DB pods via the coroot.com/postgres-scrape
# annotations the Cluster's inheritedMetadata stamps onto them, then connects to
# scrape pg_stat_* (see patches/postgres-cluster-patch.yaml).
# scrape pg_stat_* (see patches/enable-coroot-monitoring.yaml).
#
# ADDITIVE, not a replacement for allow-umami: Cilium unions the ingress allows
# across every policy selecting an endpoint, so this standalone policy only adds
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Prod-only Coroot Postgres integration for umami-db (strategic merge onto the
# base Cluster). Coroot/SPIRE run only on Hetzner, so this lives in the hetzner
# overlay alongside umami's other prod-only wiring. The pg_monitor grant on the
# `umami` role is a separate JSON6902 patch (grant-pg-monitor-patch.yaml) so it
# `umami` role is a separate JSON6902 patch (grant-pg-monitor.yaml) so it
# appends to the existing managed role instead of replacing the list.
#
# inheritedMetadata.annotations — CNPG propagates these to the DB instance
Expand Down
Loading