Skip to content

ankathur/smart_k8s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Smart K8s Cluster Manager

An in-cluster operator that continuously monitors Kubernetes workloads and infrastructure, detects failing/unhealthy resources with deterministic rules, and uses a self-hosted LLM to explain the root cause and suggest remediation — without sending any cluster data outside your environment. Findings and diagnoses are surfaced through a web dashboard and a CLI.

Phase 1 is strictly read-only: it never mutates cluster resources.

 kube-apiserver ──▶ COLLECT ──▶ DETECT ──▶ DIAGNOSE ──▶ STORE ──▶ API ─┬─▶ dashboard
   (watch)        controllers   rules      LLM (local)   sqlite        └─▶ CLI

How it works

  • Collect — controller-runtime reconcilers watch Pods, Deployments, StatefulSets/DaemonSets, Jobs, Nodes, PVCs, and ResourceQuotas.
  • Detect — pure-function rules flag known failure modes (CrashLoopBackOff, OOMKilled, ImagePullBackOff, config errors, failing readiness probes, stuck Pending/Unschedulable, stalled rollouts, failed Jobs, NotReady/pressured nodes, unbound PVCs, quotas at limit). The LLM is never used for detection.
  • Diagnose — for each new finding, an async worker gathers evidence (rule evidence + recent events + a tail of container logs), redacts secrets, and asks the self-hosted model for a structured JSON diagnosis. If the model is unavailable, it falls back to the rule's static guidance.
  • Store / surface — findings, diagnoses, and (never-executed) suggested action plans are persisted to SQLite (Postgres optional) and served over a JSON API + embedded dashboard, and via the smartk8s CLI.

Build & test

make build      # builds bin/operator and bin/smartk8s
make test       # runs the unit/integration test suite
make vet

Run locally (against your kubeconfig context)

The operator runs out-of-cluster too — handy for development. Point it at any OpenAI-compatible local model (Ollama, vLLM, LocalAI, llama.cpp):

# detect-only (no LLM):
make run-operator

# with diagnosis against a local Ollama:
SMARTK8S_LLM_BASE_URL=http://localhost:11434/v1 \
SMARTK8S_LLM_MODEL=llama3.1:8b \
DIAGNOSE=true make run-operator

Then browse the dashboard at http://localhost:8080 or use the CLI:

bin/smartk8s status
bin/smartk8s findings list
bin/smartk8s findings list -n default --severity critical
bin/smartk8s findings get <id>

Deploy in-cluster

Customers / production — Helm chart (registry or air-gapped bundle):

helm install smartk8s oci://REGISTRY/charts/smartk8s --version X.Y.Z \
  -n smartk8s --create-namespace

See docs/install.md for the full guide (configuration reference, enabling LLM diagnosis, air-gapped installs, upgrades) and docs/release.md for how releases are cut.

Development — kustomize fast loop:

make docker-build              # build the image
make kind-load                 # (optional) load it into a kind cluster
make deploy                    # kubectl apply -k deploy/

Edit deploy/operator.yaml's ConfigMap to point SMARTK8S_LLM_BASE_URL / SMARTK8S_LLM_MODEL at your model. RBAC is read-only (get/list/watch only).

End-to-end verification

  1. Start a cluster: kind create cluster.
  2. Start a local model and set SMARTK8S_LLM_BASE_URL / SMARTK8S_LLM_MODEL.
  3. Run the operator (make run-operator with DIAGNOSE=true, or deploy it).
  4. Inject failures:
    kubectl apply -f test/e2e/broken-workloads.yaml
  5. Confirm: bin/smartk8s findings list shows CrashLoopBackOff, OOMKilled, ImagePullBackOff, the config error, the unschedulable pod, the pending PVC, and the quota-at-limit. bin/smartk8s findings get <id> (and the dashboard) show an LLM root cause + remediation; stopping the model still yields rule-based fallback guidance.
  6. Tear down: kubectl delete ns smartk8s-e2e.

Or run the whole loop automatically (requires docker, kind, kubectl, jq):

make e2e         # throwaway kind cluster, deploy via kustomize, assert findings
make e2e-helm    # same, but installs the Helm chart

Local development on minikube (macOS)

A concrete in-cluster walkthrough on minikube. The only differences from kind are how the image gets into the cluster and how the operator reaches your local model.

# 1. Start a cluster (vfkit driver — no Docker Desktop needed on macOS)
brew install minikube kubectl
minikube start --driver=vfkit --cpus=4 --memory=8g
kubectl get nodes                       # context "minikube" is set automatically

# 2. Start a local model, bound to all interfaces so the VM can reach it
ollama pull llama3.1:8b
OLLAMA_HOST=0.0.0.0 ollama serve &

# 3. Build the operator image directly into the cluster
minikube image build -t smartk8s:latest .
# (or, with a local Docker daemon:
#   make docker-build && minikube image load smartk8s:latest)

# 4. Point the operator at Ollama, then deploy.
#    In deploy/operator.yaml's ConfigMap, set:
#      SMARTK8S_LLM_BASE_URL: "http://host.minikube.internal:11434/v1"
kubectl apply -k deploy/
kubectl -n smartk8s rollout status deploy/smartk8s

# 5. Inject failures and inspect (see "End-to-end verification" for assertions)
kubectl apply -f test/e2e/broken-workloads.yaml
kubectl -n smartk8s port-forward svc/smartk8s 8080:8080 &
bin/smartk8s findings list
open http://localhost:8080

Notes:

  • The SQLite PVC binds out of the box — minikube ships a default StorageClass.
  • The pending-PVC fixture stays Pending by design (it references a nonexistent StorageClass).
  • make e2e is kind-specific (it shells out to kind); on minikube, run the steps above manually.

Cleanup: kubectl delete ns smartk8s-e2e && minikube delete.

Configuration

Key flags: --llm-base-url, --llm-model, --db-dsn, --namespaces, --api-addr, --diagnose-enabled, --log-tail-lines. The core settings also have SMARTK8S_* env equivalents (see internal/config); the remaining tunables are flag-only.

Roadmap (Phase 2)

The internal/action package defines the remediation interfaces and approval model today; the only shipped Executor is NoopExecutor, which refuses to act. Phase 2 plugs in executors behind approval gates to perform corrective actions — without reshaping the core.

About

An in-cluster operator that continuously monitors Kubernetes workloads and infrastructure, detects failing/unhealthy resources with deterministic rules, and uses a **self-hosted LLM** to explain the root cause and suggest remediation — without sending any cluster data outside your environment.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors