A deliberately broken Helm chart with 10+ real-world Kubernetes failure scenarios.
Deploy. Break. Diagnose. Fix. Repeat.
Quick Start • Scenarios • Usage • Customize • Contribute
Ever wanted a safe playground of Kubernetes failures you can deploy on demand? This Helm chart ships with 10+ intentionally broken resources — each producing a different, real-world K8s failure.
Perfect for:
- 🤖 AI Agent Demos — Let your AI bot (n8n + Slack + EKS MCP) diagnose and fix live cluster issues
- 🎓 SRE Training & Interviews — Test troubleshooting skills on realistic failures
- 📊 Monitoring Validation — Verify your Prometheus/Grafana/PagerDuty alerts actually fire
- 🧪 Chaos Engineering — Controlled failure injection without the unpredictability
🔒 Everything is namespace-scoped and isolated — safe to deploy alongside production workloads.
# Clone the repo
git clone https://github.com/JustInCache/helm-failure-chart.git
cd helm-failure-chart
# Deploy to any namespace you like
helm install failure-demo . --namespace failure-demo --create-namespace
# 🍿 Sit back and watch the chaos unfold
kubectl get pods -n failure-demo -w| Component | Image | Replicas | Purpose |
|---|---|---|---|
| 🌐 Frontend | nginx |
2 | Web UI (static files) |
| ⚙️ Backend | node:18-alpine |
2 | REST API |
| 🔧 Worker | python:3.11-slim |
1 | Background job processor |
| 🗄️ Redis | redis:7-alpine |
1 | Cache / message queue |
Also creates: ConfigMap, Secret, ServiceAccount, Role, RoleBinding, Ingress, HPA, PVC, NetworkPolicy
💡 Resource footprint: ~850m CPU, ~900Mi memory in requests (most pods fail before consuming anything).
If you want the lite chart covers the 5 most common Kubernetes failures. You can switch to below:
👉 helm-failure-chart-lite — the lite version
| Lite | Full (this repo) | |
|---|---|---|
| Scenarios | 5 | 10+ |
| Components | 3 (Frontend, Backend, Worker) | 4 (+ Redis) |
| Ingress / HPA / PVC / NetworkPolicy | ❌ | ✅ |
| Resource footprint | ~200m CPU, ~256Mi | ~850m CPU, ~900Mi |
| Best for | Quick demos, interviews | Comprehensive training, deep dives |
| 📍 Component | Frontend Deployment |
| 🐛 Root Cause | Image tag nginx:v99.99.99 does not exist |
| 👀 What You See | Pod stuck in ImagePullBackOff / ErrImagePull |
| ✅ How to Fix | Change frontend.image.tag to a valid tag like 1.25-alpine |
kubectl describe pod -l app=frontend -n failure-demo | grep -A5 "Events"| 📍 Component | Backend Deployment |
| 🐛 Root Cause | Liveness probe targets port 8080 but container only has port 3000. Container runs a simple setTimeout, not an HTTP server. |
| 👀 What You See | Pod enters CrashLoopBackOff after repeated probe failures |
| ✅ How to Fix | Change probe ports to 3000 in values.yaml, or remove the HTTP probes |
kubectl logs -l app=backend -n failure-demo --previous| 📍 Component | Backend Service |
| 🐛 Root Cause | Service selector is app: backend-api but pods are labeled app: backend |
| 👀 What You See | kubectl get endpoints shows 0 endpoints — traffic never reaches pods |
| ✅ How to Fix | Change the service selector to app: backend |
kubectl get endpoints -n failure-demo| 📍 Component | Worker Deployment |
| 🐛 Root Cause | Env vars reference ConfigMap keys DATABASE_HOST / DATABASE_PORT, but the ConfigMap defines DB_HOST / DB_PORT |
| 👀 What You See | Pod stuck in CreateContainerConfigError |
| ✅ How to Fix | Align the key names — update the ConfigMap or the deployment env refs |
kubectl describe pod -l app=worker -n failure-demo | grep -A3 "Warning"| 📍 Component | Redis Deployment |
| 🐛 Root Cause | Memory limit is 5Mi — Redis needs ~30-50Mi minimum to start |
| 👀 What You See | Pod gets OOMKilled immediately, restarts in a loop |
| ✅ How to Fix | Increase redis.resources.limits.memory to at least 128Mi |
kubectl get pod -l app=redis -n failure-demo -o jsonpath='{.items[0].status.containerStatuses[0].lastState}'| 📍 Component | Ingress |
| 🐛 Root Cause | Frontend path routes to service xxx-web (should be xxx-frontend). API path uses port 8080 (should be 3000). |
| 👀 What You See | 503 errors, no healthy targets |
| ✅ How to Fix | Correct the service name and port in ingress.yaml |
kubectl describe ingress -n failure-demo| 📍 Component | HorizontalPodAutoscaler |
| 🐛 Root Cause | HPA targets deployment xxx-backend-api but actual name is xxx-backend |
| 👀 What You See | HPA shows <unknown> for current metrics |
| ✅ How to Fix | Fix scaleTargetRef.name in hpa.yaml to match the actual deployment |
kubectl get hpa -n failure-demo| 📍 Component | PersistentVolumeClaim |
| 🐛 Root Cause | References StorageClass gp3-encrypted-premium which doesn't exist |
| 👀 What You See | PVC stays Pending, worker pod can't mount volume |
| ✅ How to Fix | Change persistence.storageClassName to an existing class (gp2, gp3, standard) |
kubectl get pvc -n failure-demo
kubectl get storageclass| 📍 Component | Role |
| 🐛 Root Cause | Role lists deployments under apiGroup "" (core) — they belong to "apps" |
| 👀 What You See | 403 Forbidden when the ServiceAccount tries to access deployments |
| ✅ How to Fix | Change apiGroups: [""] to apiGroups: ["apps"] for the deployments rule |
kubectl auth can-i list deployments \
--as=system:serviceaccount:failure-demo:failure-demo-helm-failure-chart-sa \
-n failure-demo| 📍 Component | NetworkPolicy |
| 🐛 Root Cause | Backend egress only allows Redis. No rule for PostgreSQL on port 5432. |
| 👀 What You See | Backend → database connections time out |
| ✅ How to Fix | Add an egress rule for the database service on port 5432 |
| 📍 Component | Backend Deployment + Secret |
| 🐛 Root Cause | Deployment references DATABASE_USERNAME from the secret, but the secret key is DB_USERNAME |
| 👀 What You See | CreateContainerConfigError on backend pods |
| ✅ How to Fix | Align the key name in either the secret or the deployment |
| 📍 Component | ServiceAccount |
| 🐛 Root Cause | IRSA annotation points to arn:aws:iam::123456789012:role/app-nonexistent-role |
| 👀 What You See | AWS API calls from pods fail with auth errors |
| ✅ How to Fix | Update the ARN to a valid IAM role, or remove the annotation |
- Deploy the chart to your EKS cluster
- Ask in Slack: "What's wrong with pods in the failure-demo namespace?"
- AI agent uses EKS MCP to inspect pods, events, services, etc.
- Agent diagnoses failures and suggests fixes
- Apply fixes iteratively and re-ask to validate
Deploy the chart and ask candidates to:
- 🔍 Identify all failing resources and their root causes
- 🛠️ Propose fixes without looking at the source
- 📋 Prioritize which failures to fix first
Use the chart to verify that your Prometheus/Grafana/PagerDuty pipeline correctly detects:
CrashLoopBackOffandOOMKilledpod states- Deployments with 0 available replicas
- PVCs stuck in Pending
- Services with 0 endpoints
Override any value at install time:
# Disable ingress (e.g., if you use Kong and don't want ALB conflicts)
helm install failure-demo . -n failure-demo --create-namespace \
--set ingress.enabled=false
# Disable persistence (skip PVC scenario)
helm install failure-demo . -n failure-demo --create-namespace \
--set persistence.enabled=false
# Fix the frontend image to isolate other scenarios
helm install failure-demo . -n failure-demo --create-namespace \
--set frontend.image.tag=1.25-alpineEverything is namespace-scoped. Nothing touches other namespaces or creates cluster-wide resources.
| Check | Status |
|---|---|
| 🔐 Cluster-scoped RBAC (ClusterRole) | ✅ None |
| 🌐 LoadBalancer / NodePort services | ✅ None — all ClusterIP |
| 🔗 Cross-namespace NetworkPolicy | ✅ None |
| 📦 CRDs / Webhooks | ✅ None |
| 🚪 Ingress controller side effects | ingress.enabled=false |
helm uninstall failure-demo -n failure-demo
kubectl delete namespace failure-demohelm-failure-chart/
├── Chart.yaml
├── values.yaml
├── README.md
└── templates/
├── _helpers.tpl
├── backend-deployment.yaml # Scenarios 2, Bonus (secret mismatch)
├── backend-service.yaml # Scenario 3
├── configmap.yaml # Scenario 4 (key mismatch source)
├── frontend-deployment.yaml # Scenario 1
├── frontend-service.yaml
├── hpa.yaml # Scenario 7
├── ingress.yaml # Scenario 6
├── network-policy.yaml # Scenario 10
├── pvc.yaml # Scenario 8
├── rbac.yaml # Scenario 9
├── redis-deployment.yaml # Scenario 5
├── redis-service.yaml
├── secret.yaml # Bonus (key mismatch source)
├── serviceaccount.yaml # Bonus (invalid IAM ARN)
└── worker-deployment.yaml # Scenario 4
Contributions are welcome! Here's how you can help:
- 🐛 Add new failure scenarios — open a PR with a new template and update the README
- 📝 Improve documentation — typos, better explanations, diagrams
- 🧪 Test on different clusters — EKS, GKE, AKS, minikube, kind — report what works
- 💡 Suggest ideas — open an issue with your use case
- Fork the repo
- Create a feature branch (
git checkout -b feat/new-scenario) - Commit your changes (
git commit -m "Add new failure scenario") - Push to the branch (
git push origin feat/new-scenario) - Open a Pull Request
If this chart helped you demo, learn, or break things in a fun way — give it a star! ⭐
It helps others discover this project and motivates continued development.
If this project saved you time or sparked an idea, consider buying me a coffee!
Your support keeps this project maintained and growing 🙏
☕ buymeacoffee.com/connectankush
MIT — do whatever you want with it. Break things responsibly. 💥