Production-grade Kubernetes for a household. GitOps with Flux · Automated dependency updates with Renovate · Self-hosted by design
This is the live configuration for a multi-node Kubernetes cluster that runs a household — home automation, security cameras, media, document management, AI workloads, and the operational tooling required to keep it all up. Every change lands in Git first; Flux reconciles the cluster from there, and Renovate keeps dependencies current via PRs.
The repo is GitOps-strict: applications are declared as HelmRelease resources, secrets are pulled from 1Password through External Secrets Operator, and clusters are mostly identical except for app selection and sizing. Operational quirks, durability tiers, and security defaults live alongside the manifests in .agents/instructions/ so the conventions are enforceable, not folklore.
flowchart LR
Dev[👤 Operator] -->|git push| Repo[(📦 GitHub<br/>home-ops)]
Renovate[🤖 Renovate] -.->|automated PRs| Repo
Repo -->|reconciles| Flux[⚙️ Flux]
Flux -->|deploys| Cluster[☸️ Kubernetes<br/>11 nodes · 177 apps]
Cluster --> Ceph[(🪨 Ceph<br/>block · default durable)]
Cluster --> LH[(🐂 Longhorn<br/>+ recurring backups)]
Cluster --> Garage[(🧺 Garage<br/>S3-compatible)]
Cluster --> NFS[(🗄️ NFS<br/>beast / brain · bulk media)]
LH -->|weekly + monthly| NFS
Garage -->|rclone CronJobs| AWS[☁️ AWS S3<br/>Glacier Deep Archive<br/>offsite DR]
classDef store fill:#1e293b,stroke:#475569,color:#e2e8f0
class Ceph,LH,Garage,NFS,AWS store
Storage tiers are picked deliberately per workload — see storage-class.instructions.md for the decision tree.
| Layer | Tool | Role |
|---|---|---|
| OS | CentOS Stream 9 / 10 (+ Ubuntu 24.04 on Spark) | Node operating system |
| Runtime | cri-o + crun (containerd on Spark) | CRI + OCI runtime; Spark is the lone containerd node |
| Kubernetes | v1.35.4 | Control-plane and node version |
| GPU | NVIDIA GPU Operator + Container Toolkit | P40 on worker8 (Pascal, 24 GB); GB10 on Spark (Grace-Blackwell, 128 GB unified) |
| GitOps | Flux2 | Declarative cluster reconciliation |
| Automation | Renovate + GitHub Actions | Dependency PRs, link checks, self-hosted runners |
| CNI | Cilium (eBPF) | Networking, BGP peering, LoadBalancer pool |
| Ingress | Envoy Gateway | L7 gateway / HTTPRoute |
| Service mesh | Istio | mTLS + traffic mgmt for mcp-system |
| Admission | Kyverno | Namespace-delete blast-radius + audit-mode policies |
| DNS | external-dns | Cloudflare + bind9 split-horizon |
| TLS | cert-manager | Let's Encrypt + internal CA |
| Tunnel | cloudflared | Public ingress without exposing home WAN |
| AuthN/Z | Authelia + oauth2-proxy | OIDC SSO; 24 oauth2-proxy instances gate apps |
| Secrets | External Secrets Operator + 1Password | 111 ExternalSecrets, zero plain-text in Git |
| VPN | wg-easy | Operator OOB WireGuard access |
| Storage | Rook-Ceph, Longhorn, Garage, direct NFS | Tiered by durability requirement |
| Databases | CloudNative-PG, Dragonfly, Qdrant | 25 Postgres clusters, KV, vector |
| Observability | kube-prometheus-stack, Loki, Tempo, Grafana, HolmesGPT | Metrics, logs, traces, dashboards, AI alert triage |
| Telemetry | OpenTelemetry Collector + Vector | Trace pipeline (→ Tempo) + log shipping (→ Loki) |
| Images | ZOT | Pull-through registry / local cache |
| Role | Hostname | Device | CPU | RAM | OS | Storage / Accelerators | Notes |
|---|---|---|---|---|---|---|---|
| 🧠 | master1 | bare-metal | 4 | 32 GB | CentOS 10 | NVMe (Longhorn) | Intel iGPU · RTL-SDR · control plane |
| 🧠 | master2 | VM on beast | 3 | 12 GB | CentOS 9 | virtualized control plane | |
| 🧠 | master3 | VM on beast | 3 | 10 GB | CentOS 9 | virtualized control plane | |
| 💪 | worker2 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | ZWA-2 Z-Wave dongle |
| 💪 | worker3 | ThinkCentre M910x | 8 | 64 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | Sonoff Zigbee dongle |
| 💪 | worker4 | ThinkCentre M910x | 8 | 32 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | Coral USB TPU |
| 💪 | worker5 | VM on beast | 10 | 24 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | |
| 💪 | worker6 | VM on beast | 10 | 30 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | |
| 💪 | worker7 | VM on beast | 10 | 30 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | |
| 🎮 | worker8 | VM on beast | 10 | 55 GB | CentOS 9 | NVMe (Longhorn + Ceph OSD) | NVIDIA P40 (24 GB VRAM) |
| 🚀 | spark | NVIDIA DGX Spark | 20 | 128 GB | Ubuntu 24.04 | NVMe + 8 GPU slots | NVIDIA GB10 (Grace-Blackwell, 128 GB unified); arm64 · containerd outlier |
| Host | Role |
|---|---|
beast |
Dell R730xd · iDRAC 8 · RAID6 bulk storage · primary NFS · Longhorn backup target · Garage substrate · VM host |
brain |
Router/gateway · RAID6 mass_storage · NFS for downloads & TV · OOB SSH on :3231 |
| Network | CIDR | VLAN |
|---|---|---|
| Default | 192.168.0.0/16 |
0 |
| IoT | 10.10.20.0/24 |
20 |
| Guest | 10.10.30.0/24 |
30 |
| Security (cameras) | 10.10.40.0/24 |
40 |
| Kubernetes pod subnet (Cilium) | 10.42.0.0/16 |
— |
| Kubernetes services subnet (Cilium) | 10.43.0.0/16 |
— |
| Kubernetes LB pool (CiliumLoadBalancerIPPool) | 10.45.0.0/24 |
— |
Worker nodes attach to iot and sec VLANs via Multus for direct camera and IoT-device reachability. Cilium peers BGP with the upstream router to advertise the LB pool; external ingress flows through Envoy Gateway behind cloudflared.
🏠 Home Automation — Home Assistant ecosystem, 400+ devices
| App | Purpose |
|---|---|
| Home Assistant | Primary orchestrator; 400+ Z-Wave / Zigbee / Matter / ESPHome devices |
| ESPHome | Build & deploy firmware for DIY sensors |
| EMQX | MQTT broker |
| Node-RED | Visual automation flows |
| Zigbee2MQTT | Zigbee bridge (Sonoff stick on worker3) |
| Z-Wave JS UI | Z-Wave bridge (ZWA-2 stick on worker1) |
| Matter Server | Matter protocol bridge |
| Frigate | NVR + ML camera analysis (7+ cameras, Frigate+ trained model) |
| NetBox | IPAM / DCIM |
| wyoming-services | Piper TTS + Whisper STT for voice |
| smtp-relay | Maddy → Mailgun outbound mail |
🎬 Media & Entertainment — Jellyfin, Immich, Music Assistant, RomM
| App | Purpose |
|---|---|
| Jellyfin | Primary media server (read-only metadata) |
| Immich + immich-pet-tagger + immichkiosk + immich-power-tools | Photo library with ML face/pet recognition, offsite-backed |
| Music Assistant + Gonic | Multi-room music control + Subsonic API |
| RomM | Retro game library (~10k ROMs) |
| Beets | Music library tagging |
| cutVideo / av1corrector / videodupfinder / medialyze | Custom video tooling |
| Theme Park | Consistent UI theming across apps |
| Batocera Webdashboard Pro | Retro-gaming console dashboard |
| kodi-playback-watcher | Bridge for Kodi playback state |
🤖 AI & ML — Local inference, agents, image generation (namespace ai/)
| App | Purpose |
|---|---|
| Ollama (P40) | Local LLM serving on the Pascal P40 (≤8b-class models, embeddings) |
| Ollama Spark | LLM serving on Spark/GB10 (qwen2.5:32b for the agent fleet + HolmesGPT + Open WebUI, bge-m3 embeddings) |
| ComfyUI | Image generation workflows |
| Khoj + khoj-oauth2-proxy | Personal AI assistant over notes + docs (Authelia-gated) |
| LangGraph Agents | Custom multi-agent runtime (rwlove/langgraph-agents, version pinned in helmrelease.yaml); Postgres-checkpointed with live task_queue + task_dlq substrate; MCP-gateway client. See AI architecture section below. |
| Langfuse | LLM observability — OTLP trace sink for the langgraph-agents fleet (CNPG-backed; ClickHouse/Valkey/MinIO bundled) |
| Paperless-AI | Auto-tagging for paperless-ngx |
| sync-receiver | Cross-host AI state sync endpoint |
| tei-spark | Text-embedding-inference reranker (unsuspended 2026-05-21) |
📊 Observability — Prom/Loki/Grafana with AI triage on top
| App | Purpose |
|---|---|
| kube-prometheus-stack | Prometheus + AlertManager + node-exporter |
| Loki | Log aggregation |
| Tempo | Distributed tracing backend (SingleBinary mode) |
| OpenTelemetry Collector | Trace ingestion pipeline (apps → OTel → Tempo) |
| Vector | Log shipping (sources → Loki) |
| Grafana | Dashboards + alerting UI |
| HolmesGPT | LLM-backed alert investigation |
| kube-state-metrics / kube-ops-view | Cluster state & visualization |
| Goldilocks | VPA-driven resource right-sizing recommendations |
| Kromgo | Prometheus → Glance dashboard bridge |
| Netdata | Per-node real-time metrics |
| network-ups-tools (NUT) | UPS monitoring & graceful shutdown |
| exporters | Custom Prometheus exporters |
🗄️ Data & Storage — Databases, object storage, vector search
| App | Purpose |
|---|---|
| CloudNative-PG | 25 Postgres clusters with WAL archiving to Garage |
| Dragonfly | Redis-compatible in-memory store |
| Qdrant | Vector DB for embeddings / RAG |
| pgAdmin | Postgres admin UI |
| Rook-Ceph | Distributed block storage (default durable tier) |
| Longhorn | Block storage with NFS-backed recurring backups |
| Garage | S3-compatible object storage (DB backups, app S3 workloads) |
🌐 Network, Auth & Platform — Ingress, SSO, GitOps machinery
| App | Purpose |
|---|---|
| Cilium | CNI, BGP, LoadBalancer pool |
| Envoy Gateway | Ingress / HTTPRoute (30 routes) |
| cert-manager | TLS certificate lifecycle |
| external-dns | Cloudflare + bind9 record sync |
| cloudflared | Public tunnel without exposed WAN |
| Authelia | OIDC identity provider |
| LLDAP | Lightweight LDAP directory backing Authelia |
| oauth2-proxy | 24 instances gating per-app SSO |
| wg-easy | Primary OOB WireGuard access |
| External Secrets Operator | 1Password-backed secret materialization |
| Flux2 | GitOps reconciler |
| Renovate | Image & Helm chart update PRs |
| Kuadrant | MCP server gateway (Authelia-gated JWT) |
| Kyverno | Admission controller — namespace-delete blast-radius + audit-mode policies |
| actions-runner-controller | Self-hosted GitHub Actions runners |
| ZOT | Pull-through registry cache |
🗂️ Documents & Collaboration — Personal knowledge stack + self-hosted tools
| App | Purpose |
|---|---|
| Paperless-ngx | Document scanning, OCR, tagging (CNPG-backed, offsite-backed) |
| Obsidian + obsidian-couchdb | Notes sync (CouchDB w/ Cloudflare rate-limiting) |
| Zulip | Self-hosted team chat (also wired into agent pipeline approvals) |
| Windmill | Workflow automation; 22 checked-in TypeScript flows under kubernetes/apps/home/windmill/workflows/ cover AlertManager → HolmesGPT, langgraph inbox/approval/digest/DLQ/cost-cap/awaiting-user/reviewer-weekly, weekly operator drift sweeps (storage / network / ml / observability), paperless RAG ingest+tombstone, Zulip triager webhook, the workaround upstream-watcher, and the errand-runner approval-flow smoke driver |
| ntfy | Self-hosted push notifications (operator approvals via Android tap actions) |
| BentoPDF | Self-hosted PDF toolkit |
| Kitchenowl | Shopping lists + recipe / meal management |
| Open WebUI | Self-hosted LLM frontend; routes chat to Ollama-Spark (default) / Ollama-P40, surfaces langgraph agents as selectable models, and pulls in HolmesGPT + the MCP gateway as tool servers. RAG via bge-m3 + bge-reranker-v2-m3 over Qdrant |
| SearXNG | Privacy-respecting metasearch engine |
| Glance | Personal dashboard / start page |
| Atuin | Encrypted shell-history sync across machines |
| IT-Tools | Self-hosted developer toolbox |
| MediKeep | Personal medical records |
| Nametag | Name tag / badge generator |
| Pump + Pump-cv | Custom personal apps (rwlove-built) |
🔌 MCP Servers — 17 Model Context Protocol servers behind an Authelia-gated gateway
| Server | Exposes |
|---|---|
| mcp-gateway | Aggregating gateway; Envoy SecurityPolicy validates Authelia-issued JWTs (daily-rotated key) |
| ha-mcp | Home Assistant entities + service calls |
| immich-mcp | Immich library search + asset metadata |
| kubectl-mcp | Cluster introspection + safe kubectl ops |
| grafana-mcp | Grafana dashboards + Loki/Prom queries |
| prometheus-mcp | Direct PromQL access |
| paperless-mcp | Paperless-ngx document search |
| netbox-mcp | NetBox IPAM / DCIM |
| github-mcp | GitHub repo + PR ops |
| omada-mcp | TP-Link Omada controller |
| searxng-mcp | Privacy search through SearXNG |
| arr-mcp | Library-search interface to media-pull apps |
| time-mcp | Time / timezone utilities (rwlove/time-mcp native-SHTTP build) |
| chrome-mcp | Playwright-driven Chromium browser automation for agents |
| memory-mcp | Cross-agent knowledge graph backed by Postgres + pgvector (bge-m3 1024-dim) |
| cilium-mcp | Read-only Cilium / Hubble introspection (kubectl-mcp-style, Cilium-scoped) |
| windmill-mcp | Aggregated Windmill workspace tools (script run, flow trigger) |
Local-first by default — chat, agents, retrieval, alert triage, doc ops — with explicit, separately-gated escape hatches to Claude API and Claude Code when a task genuinely needs cloud capacity.
📖 Full chapter: see
docs/src/ai_architecture.mdfor per-app integration paths, RAG pipelines, escalation matrix, and the file:line references behind every claim here.
flowchart TB
subgraph Surfaces[Surfaces]
OWUI[Open WebUI<br/>chat + RAG]
Khoj[Khoj<br/>personal AI]
Voice[HA voice 'inbox …']
ZulipDM[Zulip DM<br/>Triager bot]
AM[AlertManager]
end
subgraph Bridges[Windmill bridges<br/>22 TS workflows]
WInbox[langgraph-inbox.ts]
WAlert[alertmanager-holmesgpt-notify.ts]
WApprove[langgraph-approval-post/receive.ts]
WPaperless[paperless-rag-ingest.ts]
WOther[…digest/DLQ/cost-cap/<br/>awaiting-user/workaround]
end
subgraph Agents[Agents]
Holmes[HolmesGPT<br/>✅ live · qwen2.5:32b]
LG[langgraph-agents<br/>🟡 plumbed, cold]
end
subgraph Inference[Inference]
OllamaP40[(ollama / P40<br/>qwen2.5:7b · embeddings)]
OllamaSpark[(ollama-spark / GB10<br/>qwen2.5:32b · bge-m3)]
Claude[(Claude API)]
end
subgraph Tools[Tools + retrieval]
Gw[MCP Gateway<br/>17 servers]
Q[(Qdrant<br/>vector DB)]
Mem[(memory-mcp<br/>pgvector KG)]
end
subgraph Outputs[Outputs + observability]
Zulip[Zulip threads]
Ntfy[ntfy push]
Vault[(langgraph-vault)]
LF[Langfuse traces]
end
Voice --> WInbox --> LG
ZulipDM --> WInbox
OWUI -->|chat| OllamaSpark
OWUI -->|agent-as-model| LG
OWUI --> Gw
OWUI --> Holmes
OWUI --> Q
Khoj --> OllamaP40
AM --> WAlert --> Holmes
Holmes --> OllamaSpark
LG --> OllamaSpark
LG --> OllamaP40
LG -.->|gated| Claude
LG --> Gw --> Mem
LG -.->|OTLP| LF
WPaperless --> Q
LG --> WApprove --> Zulip & Ntfy
LG --> Vault
Holmes --> Zulip
CR --> Zulip
Dashed lines mark cold paths: ENABLE_CLAUDE_API: false today on
langgraph-agents; OTLP exporter fires only once Langfuse keys land
in 1Password.
- Open WebUI (
collab/) — primary chat UI. Defaults to qwen2.5:32b on Ollama-Spark; users can switch to any langgraph agent via the OpenAI-compatible API. RAG runs over Qdrant with bge-m3 embeds + BGE reranker-v2-m3 in-process. Tool servers wired in: HolmesGPT + the MCP gateway. - Khoj (
ai/) — parallel personal-AI surface for notes + docs. Self-contained: own embedding pipeline (default gte-small, optionally ollama nomic-embed-text), chat via Ollama-P40. Does not consume MCP gateway or langgraph-agents. - HolmesGPT (
observability/) — live in production for alert triage. AlertManager firings reach it via Windmill'salertmanager-holmesgpt-notify.ts; it reasons over Prometheus + Loki + cluster state and posts a root-cause hypothesis to Zulip / ntfy. Open WebUI also surfaces it as a tool server. Prompt + context budget tuned for qwen2.5:32b on Spark (32K context, 6 tool-call budget per investigation). - langgraph-agents (
ai/) — the FastAPI multi-agent runtime (rwlove/langgraph-agents, version pinned inhelmrelease.yaml). Plumbed end-to-end (Postgres checkpoints + memory, live task-queue substrate inpostgres-langgraph-checkpoints, vault PVCs, Windmill approval loop, cost caps in env). Trigger surface live: alertmanager → 6 namespace-mapped operators, daily 22:00 ET historian digest, weekly Saturday operator drift crons (ml / observability / network / reviewer / storage), errand-runner approval-flow smoke.ENABLE_CLAUDE_API: falseso Claude API escalation is still gated. Public ingress splits CLI traffic (hai.${SECRET_DOMAIN}, Bearer-only) from browser traffic (hai-web.${SECRET_DOMAIN}, Authelia). - Windmill (
home/) — 22 checked-in TypeScript flows underkubernetes/apps/home/windmill/workflows/are the bridges that knit the surfaces above together. Every alert webhook, Zulip-triggered DM, approval round-trip, daily digest, weekly vault-hygiene sweep, weekly operator drift sweeps (storage / network / ml / observability), DLQ retry, cost-cap pause, Paperless RAG ingest, and the errand-runner approval-flow smoke driver is a.tsfile there. - Langfuse (
ai/) — OTLP trace sink for langgraph-agents. Chart deploys ClickHouse + Valkey + MinIO bundled; Postgres comes from CNPGpostgres-langfuse. - memory-mcp (
mcp-system/) — cross-agent knowledge graph onpostgres-langgraph-memorywith pgvector(1024). bge-m3 embeds via Ollama-Spark.
| Agent | Role | Status |
|---|---|---|
HolmesGPT |
AlertManager-driven root-cause investigation | ✅ live |
triager |
Classifies inbound items, assigns owner agent | ✅ live · default route for every untargeted /inbox |
supervisor |
Routes work to specialist agents; opens approvals | ✅ live · in-graph fallback |
historian |
Activity log curator + daily/weekly/monthly accomplishment digests | ✅ live · daily 22:00 ET cron |
reporter |
Universal final hop — composes user-facing DM from upstream agent output | ✅ live · in-graph terminus |
reviewer |
Vault hygiene: aging TODOs, drift findings, dead [[wiki-links]] |
✅ live · weekly Sat 06:00 ET cron |
storage-operator |
Ceph + Longhorn + Garage + CNPG + Barman + NFS planning | ✅ live · alertmanager + weekly Sun 07:00 ET cron |
network-operator |
Lovenet L1–L7 ops (Omada SDN, Cilium BGP, VLANs, DNS, certs) | ✅ live · alertmanager + weekly Sat 04:00 ET cron |
observability-operator |
Prometheus rules, AlertManager routing, Loki, Grafana, HolmesGPT prompt tuning | ✅ live · alertmanager + weekly Sat 03:00 ET cron |
ml-operator |
Frigate, Immich CLIP, model tuning, GPU placement | ✅ live · alertmanager + weekly Sat 02:00 ET cron |
smart-home-operator |
Home Assistant entities, automations, ESPHome configs | ✅ live · alertmanager + intent-drift cron |
homelab-engineer |
Cluster ops, HelmRelease drafting, PR-shaped output | ✅ live · alertmanager default-route |
researcher |
Web + repo + vault research | ✅ live · hourly renovate-triage cron |
errand-runner |
Class C+ MCP-write executor (the only agent that calls MCP write) | ✅ live · in-graph after approval · local-only |
note-maker |
Captures decisions + facts back into the vault | 🟡 reachable via /inbox (HA voice "inbox …"); no recurring trigger |
coder |
Code reading, drafting, PR descriptions | 🟡 reachable via /inbox; no recurring trigger |
security |
Surveillance + physical-security analyst (Frigate triage) | 🟡 cold · needs Frigate HTTP client wiring |
auditor |
CVE + vulnerability researcher (kubectl + OSV + GH Advisory) | 🟡 cold · needs OSV/GHSA client wiring |
artist |
Image generation via ComfyUI MCP | 🟡 cold · needs ComfyUI MCP allowlist populated |
property-coordinator |
3532 Foxhall workstreams (contractors, deck, pool) | 🟡 cold · ad-hoc /inbox only |
health-tracker |
Personal health tracking | 🟡 cold · local-only |
doc-writer (Scribner) |
Sweeps repos for stale docs; drafts README + docs/ patches as diffs when commits land |
🟥 aspirational |
✅ live · 🟡 wired but not on a recurring trigger or blocked on tool wiring · 🟥 not built
Tool-binding gap (load-bearing caveat): All ✅-live agents above use with_structured_output() against the prompt content they receive. Only errand-runner actually calls MCP at runtime. Operator weekly drift crons produce LLM reasoning over the prompt — they do NOT dynamically query Prometheus / kubectl / Omada / etc. (the MCP allowlists exist, but the LLM call doesn't bind them as tools). Adding ReAct-style tool-binding to an agent is a deferred architectural step.
health-tracker and errand-runner are pinned local-only at the
routing layer — they never escalate to Claude API regardless of agent
uncertainty, because the data class isn't suitable for off-site
inference.
| Tier | Backend | When used |
|---|---|---|
| 1 | qwen2.5:7b on Ollama (P40) |
Fast / simple agents (triager, note-maker drafts) |
| 2 | qwen2.5:32b on Ollama-Spark (GB10) |
Default chat + agent inference + HolmesGPT |
| 3 | Claude API (langgraph escalation) | Explicit uncertainty markers, repeated local-retry failure, novel/long-context, or requires_cloud tag. Cost caps $5/task · $10/agent/day · $30/global/day enforced inside the cluster |
The most common way work enters the fleet — hold the phone's power button, say "inbox <whatever I'm thinking>", and the cluster takes it from there.
flowchart LR
Btn[📱 Hold power button<br/>Pixel: 'Hold for Assistant'] --> Assist[HA Companion app<br/>set as default assistant]
Assist -->|audio stream| HA[Home Assistant<br/>Assist pipeline]
HA --> Whisper[Whisper STT<br/>wyoming-services on P40]
Whisper --> Sentence[Sentence trigger:<br/>'inbox {content}']
Sentence --> Ollama[conversation.ollama_voice<br/>qwen3:8b]
Ollama --> Rest[HA rest_command<br/>POST + Authelia JWT]
Rest --> Hook[Windmill:<br/>langgraph-inbox.ts]
Hook --> LG[langgraph-agents /inbox]
LG --> Triage[triager classifies]
Triage -->|capture only| Note[note-maker]
Triage -->|plan + act| Spec[specialist agent<br/>drafts plan]
Spec -->|needs input| Zulip[💬 Zulip approval<br/>+ ntfy push]
Zulip -->|reply / tap| Receive[approval-receive]
Receive --> Spec
Spec --> Done[outcome to vault]
Note --> Inbox[/vault/inbox/YYYY-MM-DD-…md/]
Done --> Outputs[/vault/outputs/{drafts,finals}//]
Inbox --> Couch[(obsidian-couchdb)]
Outputs --> Couch
Couch -->|LiveSync| Phone[📱 Obsidian on phone<br/>same vault]
- Hold power button. Pixel's "Hold for Assistant" gesture is bound to the HA Companion app as the default digital assistant. The Assist UI opens with the mic hot.
- Speak. Audio streams to the cluster — no on-phone STT. The trigger phrase is
inbox <body>; everything afterinboxis the note. - STT in cluster. The Assist pipeline routes the audio to Whisper (
wyoming-services, GPU-accelerated on the P40). - Intent + LLM. A sentence trigger matches
inbox {content}and hands{content}toconversation.ollama_voice(qwen3:8b on Ollama, tool-calling enabled). The conversation agent's only job here is to confirm the intent and call the rest_command — it does not interpret the content. - Auth'd POST. An HA
rest_commandPOSTs tohttps://langgraph-inbox.${SECRET_DOMAIN}/webhookwith{ source:"voice", user:"rob", content:"<transcript>" }. The request carries an Authelia client_credentials JWT issued to a dedicatedha-voice-inboxOIDC client — same daily-rotated signing-key machinery the MCP gateway already uses. Envoy'sSecurityPolicyvalidates the JWT against Authelia's JWKS at the gateway. - Windmill
langgraph-inbox.ts. Normalizes the payload and POSTs to/inboxonlanggraph-agents. - Triager classifies. Research question, household errand, homelab change, property task, or note-to-self — and picks the specialist agent.
- Capture path → note-maker writes the file to
/vault/inbox/YYYY-MM-DD-HHMM-<slug>.mdon thelanggraph-vault-rwPVC. Single writer, no race with the phone. - Plan-and-act path → specialist drafts a plan into Postgres + a draft under
/vault/outputs/drafts/. HITL approval via the existing Zulip + Pushover loop when needed (see triggers above). - Round-trip to the phone.
obsidian-couchdbwatches the vault PVC and replicates new files through Self-hosted LiveSync — the note from step 8, plus any drafts/finals from step 9, appear in the Obsidian app on the phone within a sync cycle. Same surface the dictation started on.
The loop closes locally and on one surface: power-button → speak → outcome appears in the vault. Whisper, Ollama, Windmill, and the agents all run in the cluster; the only off-site dependency is claude.com if the local fleet escalates a task.
HolmesGPT is the one agent already running in production:
- AlertManager → Windmill
alertmanager-holmesgpt-notify.ts→ HolmesGPT on every firing alert - HolmesGPT queries Prometheus, Loki, and the cluster directly to build a root-cause hypothesis
- Result posted back as a Pushover message + Zulip thread; the Windmill workflow sanitizes raw tool-call descriptors out of the agent text before delivery
- HolmesGPT — live, handling cluster alerts daily on Ollama-Spark / qwen2.5:32b.
- LangGraph fleet — 21 specialist agents plumbed end-to-end but cold (
ENABLE_CLAUDE_API: false, no production triggers). Public ingress split into CLI (hai.${SECRET_DOMAIN}, Bearer) and browser (hai-web.${SECRET_DOMAIN}, Authelia). Gated on the Claude API key + a cluster-confidence sign-off; the Spark migration that was the prior gate completed 2026-05-20. - claude-runner — retired 2026-05-23. Superseded by the langgraph fleet; the two CronJobs (PR triage + cost-cap commentary) graduated into agent workflows inside langgraph-agents.
- KubeClaw — retired (memo
project_open_issues_cleanup_2026_05_20).
| Service | Use | Cost |
|---|---|---|
| 1Password | Secret backend for External Secrets | ~$65 / yr |
| Cloudflare | Domain, DNS, tunnel, WAF rate-limiting | Free |
| GitHub | Repo hosting + CI | Free |
| Mailgun | Outbound mail relay (via Maddy) | Free (Flex) |
| Pushover | Push notifications for AlertManager + apps | $10 one-time |
| Frigate+ | Trained ML model for Frigate NVR | $50 / yr |
| AWS S3 Glacier Deep Archive | Offsite DR for Immich + Paperless (objects + DB backups) | ~$1–5 / mo (varies) |
| ~$10–15 / mo |
Four tiers, picked by what the data has to survive — node loss, Ceph loss, cluster loss, or full site loss. Databases get ceph-block + Barman→Garage; irreplaceable state goes to Longhorn with NFS-shipped weekly + monthly backups; S3-shaped workloads use Garage; bulk media rides direct NFS. Full decision tree: .agents/instructions/storage-class.instructions.md.
All 111 ExternalSecrets resolve through External Secrets Operator from 1Password. Application credentials are templated into ExternalSecret resources and never live in YAML. Cross-namespace mirrors use the reflector pattern when consumer charts hard-code secret names.
Authelia (with LLDAP) is the OIDC identity provider; per-app oauth2-proxy instances enforce auth at Envoy Gateway. 24 apps sit behind SSO today. The mcp-gateway validates Authelia-issued JWTs with a daily-rotated signing key for MCP tooling.
kube-prometheus-stack scrapes everything; Loki ingests pod logs (via Vector); Tempo ingests traces (via OpenTelemetry Collector); Grafana stitches the dashboards. AlertManager fans alerts to ntfy and to HolmesGPT, which runs LLM-driven root-cause investigation against the cluster and posts findings back via Windmill.
Two GPUs split the workload:
- NVIDIA P40 on worker8 (Pascal, 24 GB VRAM) — Ollama for ≤8b-class models + embeddings, ComfyUI, Whisper STT, Immich CLIP face/pet recognition, and the immich-pet-tagger fork pinned to a P40-compatible PyTorch build.
- NVIDIA GB10 on Spark (Grace-Blackwell, 128 GB unified) — the larger Ollama deployment serving qwen2.5:32b for the LangGraph agent fleet, HolmesGPT, and Open WebUI, plus bge-m3 embeddings for the cross-agent knowledge graph and Paperless RAG.
Driver lifecycle is handled by the NVIDIA GPU Operator. Spark is the lone containerd node in an otherwise CRI-O cluster; a NodeFeatureRule auto-skips the GPU container-toolkit DaemonSet on CRI-O nodes.
Per-app rclone CronJobs ship Immich originals and Paperless documents — plus their Garage-stored Postgres backups — to encrypted AWS S3 with a 1-day Glacier Deep Archive transition. Recovery procedure is documented at Offsite recovery and was last validated 2026-05-05.
Every change reaches the cluster through Git. Flux suspends are a deliberate manual signal — paused Kustomizations are not "broken," they're intentional pauses for in-flight maintenance and are documented in conventions, not reverted on sight.
The full operator handbook lives at https://rwlove.github.io/home-ops/.
Frequently referenced pages:
- AI architecture
- Cluster rebuild
- Initialization & teardown
- Cluster upgrade
- Power outage recovery
- Limits & requests philosophy
- Debugging playbook
- Offsite recovery
- Immich restore to new CNPG database
- NVIDIA P40 GPU setup
- master1 etcd disk swap
- GitHub webhook
Repo-local conventions (auto-loaded by AI agents from .agents/instructions/):
- Storage class selection · HelmRelease security defaults · ConfigMap layout · Sorting rules · Schema correction · Persona
Inspired by the k8s-at-home community. @whazor maintains the excellent k8s-at-home search — a great way to discover how others configure the same Helm releases.
