Smart K8s Cluster Manager

An in-cluster operator that continuously monitors Kubernetes workloads and infrastructure, detects failing/unhealthy resources with deterministic rules, and uses a self-hosted LLM to explain the root cause and suggest remediation — without sending any cluster data outside your environment. Findings and diagnoses are surfaced through a web dashboard and a CLI.

Phase 1 is strictly read-only: it never mutates cluster resources.

 kube-apiserver ──▶ COLLECT ──▶ DETECT ──▶ DIAGNOSE ──▶ STORE ──▶ API ─┬─▶ dashboard
   (watch)        controllers   rules      LLM (local)   sqlite        └─▶ CLI

How it works

Collect — controller-runtime reconcilers watch Pods, Deployments, StatefulSets/DaemonSets, Jobs, Nodes, PVCs, and ResourceQuotas.
Detect — pure-function rules flag known failure modes (CrashLoopBackOff, OOMKilled, ImagePullBackOff, config errors, failing readiness probes, stuck Pending/Unschedulable, stalled rollouts, failed Jobs, NotReady/pressured nodes, unbound PVCs, quotas at limit). The LLM is never used for detection.
Diagnose — for each new finding, an async worker gathers evidence (rule evidence + recent events + a tail of container logs), redacts secrets, and asks the self-hosted model for a structured JSON diagnosis. If the model is unavailable, it falls back to the rule's static guidance.
Store / surface — findings, diagnoses, and (never-executed) suggested action plans are persisted to SQLite (Postgres optional) and served over a JSON API + embedded dashboard, and via the smartk8s CLI.

Build & test

make build      # builds bin/operator and bin/smartk8s
make test       # runs the unit/integration test suite
make vet

Run locally (against your kubeconfig context)

The operator runs out-of-cluster too — handy for development. Point it at any OpenAI-compatible local model (Ollama, vLLM, LocalAI, llama.cpp):

# detect-only (no LLM):
make run-operator

# with diagnosis against a local Ollama:
SMARTK8S_LLM_BASE_URL=http://localhost:11434/v1 \
SMARTK8S_LLM_MODEL=llama3.1:8b \
DIAGNOSE=true make run-operator

Then browse the dashboard at http://localhost:8080 or use the CLI:

bin/smartk8s status
bin/smartk8s findings list
bin/smartk8s findings list -n default --severity critical
bin/smartk8s findings get <id>

Deploy in-cluster

Customers / production — Helm chart (registry or air-gapped bundle):

helm install smartk8s oci://REGISTRY/charts/smartk8s --version X.Y.Z \
  -n smartk8s --create-namespace

See docs/install.md for the full guide (configuration reference, enabling LLM diagnosis, air-gapped installs, upgrades) and docs/release.md for how releases are cut.

Development — kustomize fast loop:

make docker-build              # build the image
make kind-load                 # (optional) load it into a kind cluster
make deploy                    # kubectl apply -k deploy/

Edit deploy/operator.yaml's ConfigMap to point SMARTK8S_LLM_BASE_URL / SMARTK8S_LLM_MODEL at your model. RBAC is read-only (get/list/watch only).

End-to-end verification

Start a cluster: kind create cluster.
Start a local model and set SMARTK8S_LLM_BASE_URL / SMARTK8S_LLM_MODEL.
Run the operator (make run-operator with DIAGNOSE=true, or deploy it).

Inject failures:

kubectl apply -f test/e2e/broken-workloads.yaml

Confirm: bin/smartk8s findings list shows CrashLoopBackOff, OOMKilled, ImagePullBackOff, the config error, the unschedulable pod, the pending PVC, and the quota-at-limit. bin/smartk8s findings get <id> (and the dashboard) show an LLM root cause + remediation; stopping the model still yields rule-based fallback guidance.
Tear down: kubectl delete ns smartk8s-e2e.

Or run the whole loop automatically (requires docker, kind, kubectl, jq):

make e2e         # throwaway kind cluster, deploy via kustomize, assert findings
make e2e-helm    # same, but installs the Helm chart

Local development on minikube (macOS)

A concrete in-cluster walkthrough on minikube. The only differences from kind are how the image gets into the cluster and how the operator reaches your local model.

# 1. Start a cluster (vfkit driver — no Docker Desktop needed on macOS)
brew install minikube kubectl
minikube start --driver=vfkit --cpus=4 --memory=8g
kubectl get nodes                       # context "minikube" is set automatically

# 2. Start a local model, bound to all interfaces so the VM can reach it
ollama pull llama3.1:8b
OLLAMA_HOST=0.0.0.0 ollama serve &

# 3. Build the operator image directly into the cluster
minikube image build -t smartk8s:latest .
# (or, with a local Docker daemon:
#   make docker-build && minikube image load smartk8s:latest)

# 4. Point the operator at Ollama, then deploy.
#    In deploy/operator.yaml's ConfigMap, set:
#      SMARTK8S_LLM_BASE_URL: "http://host.minikube.internal:11434/v1"
kubectl apply -k deploy/
kubectl -n smartk8s rollout status deploy/smartk8s

# 5. Inject failures and inspect (see "End-to-end verification" for assertions)
kubectl apply -f test/e2e/broken-workloads.yaml
kubectl -n smartk8s port-forward svc/smartk8s 8080:8080 &
bin/smartk8s findings list
open http://localhost:8080

Notes:

The SQLite PVC binds out of the box — minikube ships a default StorageClass.
The pending-PVC fixture stays Pending by design (it references a nonexistent StorageClass).
make e2e is kind-specific (it shells out to kind); on minikube, run the steps above manually.

Cleanup: kubectl delete ns smartk8s-e2e && minikube delete.

Configuration

Key flags: --llm-base-url, --llm-model, --db-dsn, --namespaces, --api-addr, --diagnose-enabled, --log-tail-lines. The core settings also have SMARTK8S_* env equivalents (see internal/config); the remaining tunables are flag-only.

Roadmap (Phase 2)

The internal/action package defines the remediation interfaces and approval model today; the only shipped Executor is NoopExecutor, which refuses to act. Phase 2 plugs in executors behind approval gates to perform corrective actions — without reshaping the core.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
charts/smartk8s		charts/smartk8s
cmd		cmd
deploy		deploy
docs		docs
internal		internal
scripts		scripts
test/e2e		test/e2e
web		web
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart K8s Cluster Manager

How it works

Build & test

Run locally (against your kubeconfig context)

Deploy in-cluster

End-to-end verification

Local development on minikube (macOS)

Configuration

Roadmap (Phase 2)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart K8s Cluster Manager

How it works

Build & test

Run locally (against your kubeconfig context)

Deploy in-cluster

End-to-end verification

Local development on minikube (macOS)

Configuration

Roadmap (Phase 2)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages