5 targeted Kubernetes failures in one lightweight Helm chart.
Lean. Focused. Perfect for a quick AI agent demo.
Quick Start • Scenarios • Diagnose • Full Version • Contribute
A stripped-down Helm chart with exactly 5 hand-picked Kubernetes failure scenarios — the most common ones you'll encounter in production. Deploy it in seconds, break things on purpose, and let your AI agent (or SRE candidate) figure out what went wrong.
🔒 100% namespace-scoped. No Ingress, no LoadBalancer, no cluster-wide resources. Safe for any cluster.
# Clone
git clone https://github.com/JustInCache/helm-failure-chart-lite.git
cd helm-failure-chart-lite
# Deploy (pick any namespace you want)
helm install failure-lite . --namespace failure-lite --create-namespace
# 🍿 Watch the failures roll in
kubectl get pods -n failure-lite -w| Component | Image | Replicas | Purpose |
|---|---|---|---|
| 🌐 Frontend | nginx |
1 | Web UI (static files) |
| ⚙️ Backend | node:18-alpine |
1 | REST API |
| 🔧 Worker | python:3.11-slim |
1 | Background job processor |
Also creates: ConfigMap, ServiceAccount, Role, RoleBinding, 2x Services
💡 Tiny footprint: ~200m CPU, ~256Mi memory in requests. Most pods fail before using anything.
| 📍 Component | Frontend Deployment |
| 📄 File | frontend-deployment.yaml · values.yaml |
| 🐛 Root Cause | Image tag nginx:1.99.0-nonexistent does not exist |
| 👀 What You See | Pod stuck in ImagePullBackOff / ErrImagePull |
| ✅ How to Fix | Change frontend.image.tag to a valid tag like 1.25-alpine |
kubectl describe pod -l app=frontend -n failure-lite | grep -A5 "Events"| 📍 Component | Backend Deployment |
| 📄 File | backend-deployment.yaml · values.yaml |
| 🐛 Root Cause | Liveness probe targets port 9090, but nothing listens there. Container runs a plain setTimeout — not an HTTP server. |
| 👀 What You See | Pod enters CrashLoopBackOff after repeated probe failures |
| ✅ How to Fix | Change probe port to 3000, or remove the liveness probe entirely |
kubectl logs -l app=backend -n failure-lite --previous| 📍 Component | Backend Service |
| 📄 File | backend-service.yaml |
| 🐛 Root Cause | Service selector is app: backend-api but pods are labeled app: backend |
| 👀 What You See | kubectl get endpoints shows 0 endpoints — traffic never reaches pods |
| ✅ How to Fix | Change the service selector from app: backend-api to app: backend |
kubectl get endpoints -n failure-lite| 📍 Component | Worker Deployment |
| 📄 File | worker-deployment.yaml · configmap.yaml |
| 🐛 Root Cause | Env vars reference ConfigMap keys DATABASE_HOST / DATABASE_PORT, but the ConfigMap defines DB_HOST / DB_PORT |
| 👀 What You See | Pod stuck in CreateContainerConfigError |
| ✅ How to Fix | Align the key names — update the ConfigMap or the deployment env refs |
kubectl describe pod -l app=worker -n failure-lite | grep -A3 "Warning"| 📍 Component | Role |
| 📄 File | rbac.yaml |
| 🐛 Root Cause | Role grants access to deployments / replicasets under apiGroup "" (core) — they belong to "apps" |
| 👀 What You See | 403 Forbidden when ServiceAccount tries to access deployments |
| ✅ How to Fix | Change apiGroups: [""] to apiGroups: ["apps"] for the deployments/replicasets rule |
kubectl auth can-i list deployments \
--as=system:serviceaccount:failure-lite:failure-lite-helm-failure-chart-lite-sa \
-n failure-lite# 📋 All pods + status
kubectl get pods -n failure-lite
# 🔎 Describe a failing pod
kubectl describe pod <pod-name> -n failure-lite
# 📅 Events sorted by time
kubectl get events -n failure-lite --sort-by='.lastTimestamp'
# 🔗 Check service endpoints (Scenario 3)
kubectl get endpoints -n failure-lite
# 🔐 Test RBAC permissions (Scenario 5)
kubectl auth can-i list deployments \
--as=system:serviceaccount:failure-lite:failure-lite-helm-failure-chart-lite-sa \
-n failure-lite| Use Case | How |
|---|---|
| 🤖 AI Agent Demo | Deploy → ask in Slack "What's broken in failure-lite?" → AI diagnoses via EKS MCP |
| 🎓 SRE Training | Give candidates 15 minutes to find and fix all 5 issues |
| 📊 Alert Validation | Verify Prometheus/Grafana detects CrashLoopBackOff, 0 endpoints, etc. |
| 🧪 Quick Smoke Test | Validate your troubleshooting tooling on known failures |
| Check | Status |
|---|---|
| 🔐 Cluster-scoped RBAC | ✅ None — Role/RoleBinding only |
| 🌐 LoadBalancer / NodePort | ✅ None — all ClusterIP |
| 🚪 Ingress | ✅ None |
| 📦 CRDs / Webhooks | ✅ None |
| 💾 PVCs / Storage | ✅ None |
Zero risk to existing workloads. Everything stays inside your chosen namespace.
helm uninstall failure-lite -n failure-lite
kubectl delete namespace failure-litehelm-failure-chart-lite/
├── Chart.yaml
├── values.yaml
├── README.md
└── templates/
├── _helpers.tpl
├── configmap.yaml # Key mismatch source (Scenario 4)
├── serviceaccount.yaml
├── rbac.yaml # Scenario 5 — wrong apiGroup
├── frontend-deployment.yaml # Scenario 1 — ImagePullBackOff
├── frontend-service.yaml
├── backend-deployment.yaml # Scenario 2 — CrashLoopBackOff
├── backend-service.yaml # Scenario 3 — 0 endpoints
└── worker-deployment.yaml # Scenario 4 — CreateContainerConfigError
This lite chart covers the 5 most common Kubernetes failures. If you want the full experience with 10+ scenarios including OOMKilled, Ingress misconfiguration, PVC Pending, HPA mismatch, NetworkPolicy blocking, and more:
👉 helm-failure-chart — the full version
| Lite (this repo) | Full | |
|---|---|---|
| Scenarios | 5 | 10+ |
| Components | 3 (Frontend, Backend, Worker) | 4 (+ Redis) |
| Ingress / HPA / PVC / NetworkPolicy | ❌ | ✅ |
| Resource footprint | ~200m CPU, ~256Mi | ~850m CPU, ~900Mi |
| Best for | Quick demos, interviews | Comprehensive training, deep dives |
Contributions are welcome! Here's how you can help:
- 🐛 Add a new failure scenario — open a PR with a new template
- 📝 Improve docs — typos, better explanations, diagrams
- 🧪 Test on different clusters — EKS, GKE, AKS, minikube, kind
- 💡 Suggest ideas — open an issue with your use case
- Fork the repo
- Create a feature branch (
git checkout -b feat/new-scenario) - Commit your changes (
git commit -m "Add new failure scenario") - Push to the branch (
git push origin feat/new-scenario) - Open a Pull Request
If this chart helped you demo, learn, or break things in a fun way — give it a star! ⭐
It helps others discover this project and motivates continued development.
If this project saved you time or sparked an idea, consider buying me a coffee!
Your support keeps this project maintained and growing 🙏
☕ buymeacoffee.com/connectankush
MIT — do whatever you want with it. Break things responsibly. ⚡