An in-cluster operator that continuously monitors Kubernetes workloads and infrastructure, detects failing/unhealthy resources with deterministic rules, and uses a self-hosted LLM to explain the root cause and suggest remediation — without sending any cluster data outside your environment. Findings and diagnoses are surfaced through a web dashboard and a CLI.
Phase 1 is strictly read-only: it never mutates cluster resources.
kube-apiserver ──▶ COLLECT ──▶ DETECT ──▶ DIAGNOSE ──▶ STORE ──▶ API ─┬─▶ dashboard
(watch) controllers rules LLM (local) sqlite └─▶ CLI
- Collect — controller-runtime reconcilers watch Pods, Deployments, StatefulSets/DaemonSets, Jobs, Nodes, PVCs, and ResourceQuotas.
- Detect — pure-function rules flag known failure modes (CrashLoopBackOff, OOMKilled, ImagePullBackOff, config errors, failing readiness probes, stuck Pending/Unschedulable, stalled rollouts, failed Jobs, NotReady/pressured nodes, unbound PVCs, quotas at limit). The LLM is never used for detection.
- Diagnose — for each new finding, an async worker gathers evidence (rule evidence + recent events + a tail of container logs), redacts secrets, and asks the self-hosted model for a structured JSON diagnosis. If the model is unavailable, it falls back to the rule's static guidance.
- Store / surface — findings, diagnoses, and (never-executed) suggested
action plans are persisted to SQLite (Postgres optional) and served over a
JSON API + embedded dashboard, and via the
smartk8sCLI.
make build # builds bin/operator and bin/smartk8s
make test # runs the unit/integration test suite
make vetThe operator runs out-of-cluster too — handy for development. Point it at any OpenAI-compatible local model (Ollama, vLLM, LocalAI, llama.cpp):
# detect-only (no LLM):
make run-operator
# with diagnosis against a local Ollama:
SMARTK8S_LLM_BASE_URL=http://localhost:11434/v1 \
SMARTK8S_LLM_MODEL=llama3.1:8b \
DIAGNOSE=true make run-operatorThen browse the dashboard at http://localhost:8080 or use the CLI:
bin/smartk8s status
bin/smartk8s findings list
bin/smartk8s findings list -n default --severity critical
bin/smartk8s findings get <id>Customers / production — Helm chart (registry or air-gapped bundle):
helm install smartk8s oci://REGISTRY/charts/smartk8s --version X.Y.Z \
-n smartk8s --create-namespaceSee docs/install.md for the full guide (configuration reference, enabling LLM diagnosis, air-gapped installs, upgrades) and docs/release.md for how releases are cut.
Development — kustomize fast loop:
make docker-build # build the image
make kind-load # (optional) load it into a kind cluster
make deploy # kubectl apply -k deploy/Edit deploy/operator.yaml's ConfigMap to point SMARTK8S_LLM_BASE_URL /
SMARTK8S_LLM_MODEL at your model. RBAC is read-only (get/list/watch only).
- Start a cluster:
kind create cluster. - Start a local model and set
SMARTK8S_LLM_BASE_URL/SMARTK8S_LLM_MODEL. - Run the operator (
make run-operatorwithDIAGNOSE=true, or deploy it). - Inject failures:
kubectl apply -f test/e2e/broken-workloads.yaml
- Confirm:
bin/smartk8s findings listshows CrashLoopBackOff, OOMKilled, ImagePullBackOff, the config error, the unschedulable pod, the pending PVC, and the quota-at-limit.bin/smartk8s findings get <id>(and the dashboard) show an LLM root cause + remediation; stopping the model still yields rule-based fallback guidance. - Tear down:
kubectl delete ns smartk8s-e2e.
Or run the whole loop automatically (requires docker, kind, kubectl, jq):
make e2e # throwaway kind cluster, deploy via kustomize, assert findings
make e2e-helm # same, but installs the Helm chartA concrete in-cluster walkthrough on minikube. The only differences from kind are how the image gets into the cluster and how the operator reaches your local model.
# 1. Start a cluster (vfkit driver — no Docker Desktop needed on macOS)
brew install minikube kubectl
minikube start --driver=vfkit --cpus=4 --memory=8g
kubectl get nodes # context "minikube" is set automatically
# 2. Start a local model, bound to all interfaces so the VM can reach it
ollama pull llama3.1:8b
OLLAMA_HOST=0.0.0.0 ollama serve &
# 3. Build the operator image directly into the cluster
minikube image build -t smartk8s:latest .
# (or, with a local Docker daemon:
# make docker-build && minikube image load smartk8s:latest)
# 4. Point the operator at Ollama, then deploy.
# In deploy/operator.yaml's ConfigMap, set:
# SMARTK8S_LLM_BASE_URL: "http://host.minikube.internal:11434/v1"
kubectl apply -k deploy/
kubectl -n smartk8s rollout status deploy/smartk8s
# 5. Inject failures and inspect (see "End-to-end verification" for assertions)
kubectl apply -f test/e2e/broken-workloads.yaml
kubectl -n smartk8s port-forward svc/smartk8s 8080:8080 &
bin/smartk8s findings list
open http://localhost:8080Notes:
- The SQLite PVC binds out of the box — minikube ships a default StorageClass.
- The pending-PVC fixture stays Pending by design (it references a nonexistent StorageClass).
make e2eis kind-specific (it shells out tokind); on minikube, run the steps above manually.
Cleanup: kubectl delete ns smartk8s-e2e && minikube delete.
Key flags: --llm-base-url, --llm-model, --db-dsn, --namespaces,
--api-addr, --diagnose-enabled, --log-tail-lines. The core settings also
have SMARTK8S_* env equivalents (see internal/config); the remaining
tunables are flag-only.
The internal/action package defines the remediation interfaces and approval
model today; the only shipped Executor is NoopExecutor, which refuses to
act. Phase 2 plugs in executors behind approval gates to perform corrective
actions — without reshaping the core.