AI-powered Kubernetes Incident Response, orchestrated by UiPath Maestro
Demo Video (4m 30s) → Watch on YouTube |
demo_assets/neurascale_ops_demo_FINAL_v3.mp4
Platform engineering teams are drowning in alert noise. A single OOMKill cascade in production means:
- 3 AM pages to 2–3 engineers
- Manual
kubectlinto 6 pods to trace the root cause - Approval over Slack from someone who is asleep
- Post-mortem written from memory at 6 AM
Average MTTR: 45–90 minutes. Not because the fix is hard — because the coordination is broken.
NeuroScale Ops is a 7-stage UiPath Maestro Case that takes a Prometheus alert from detection to resolved post-mortem — with human approval exactly where it matters — fully automatically.
MTTR drops from 45 minutes to under 15. SRE intervention: one approval tap.
Every stage maps 1:1 to a UiPath Maestro Case stage with defined SLAs, input/output contracts, and escalation paths. The full tech stack — Groq LLM, OpenCost, ArgoCD, Kyverno, UiPath Apps — alongside all 5 incident runbooks.
Prometheus Alert
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ UiPath Maestro Case │
│ │
│ S1: Detector ──► S2: Groq Triage ──► S3: Cost Impact │
│ (Python Agent) (llama-3.3-70b) (OpenCost API) │
│ │ │
│ S4: Human Approval (UiPath Apps, 15-min) │
│ │ │
│ APPROVED ────┤──── REJECTED │
│ │ │ │
│ S5: Remediate S7: Post-Mortem │
│ (kubectl/ArgoCD) (Doc Understanding) │
│ │ │
│ S6: Resolution Sign-off (UiPath Apps) │
│ │ │
│ S7: Post-Mortem (Doc Understanding) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
Slack / PagerDuty + PDF Post-Mortem
Real UiPath Automation Cloud screenshot. The 7-stage case plan for
NeuroScale Ops — K8s Incident Responsepublished asv1.0.0to DefaultTenant on June 17, 2026. Each stage is a named Maestro node with its own task type — Agentic process, API workflow, or Human action.
The change history panel confirms
Published v1.0.0authored by Sodiq Jimoh on the DefaultTenant Orchestrator. Timestamp: June 17, 2026. The solution packageSolution 1 ver. 1.0.0was created and checked into the tenant.
UiPath Studio confirmation that the solution package was packaged and published. The
Solution 1 ver. 1.0.0package contains the full NeuroScale Ops case definition ready for deployment.
Real UiPath Action Center screenshot — Stage 4 Human Approval gate in action. The SRE sees:
- Full incident details: pod
payments-api-7d9f4c-xxp2r, memory 512Mi hitting 100% hard limit, 7 restarts in 12 min- Groq AI reasoning with 94% confidence — "Safe to approve"
- Proposed fix: scale memory limit to 768Mi (+50%)
- Estimated cost impact: $2,847 downtime saved
- Maestro Case Progress sidebar — stages 1, 2, 3 completed; Stage 4 awaiting decision with 2:14 remaining on the 15-min SLA
- One-click approve with mandatory blast-radius confirmation checkbox
python main.py— live execution. Groqllama-3.3-70bidentifies root causeOOMKILLwithHIGHconfidence in under 2 seconds, recommendspatch_resourcesagainst runbookRB-001, OpenCost calculates$+15.00/mocost delta, case routes to UiPath Maestro for SRE approval,kubectl patchexecutes with doubled memory limits, post-mortem auto-generated. All 7 stages in one pipeline run.
python main.py --scenario all— all 5 incident types processed in a single run. Groq AI adapts its reasoning for each. Net cost savings: -$120/mo from the scale-down remediation alone. The CrashLoop scenario escalates (MEDIUM confidence) by design — the circuit breaker refuses to auto-remediate below 85% confidence, protecting against runaway rollbacks.
Full test coverage across every pipeline stage. 17 tests, 0 failures:
- Detector Agent — scenario existence, alert model, validation, emission
- Triage Agent — OOMKill, CrashLoop, CostSpike rule matching, serialization
- Cost Impact Agent — report generation, serialization
- Remediation Agent — execution, all action types
- Notification Agent — payload, cost-less notification
- End-to-End Pipeline — OOMKill, CrashLoop, CostSpike full runs
This is not a collection of scripts. NeuroScale Ops is a proper Maestro Case — stateful, audited, SLA-bound, with branching logic and human gates.
| UiPath Component | Role in NeuroScale Ops |
|---|---|
| Maestro Case | Core orchestration — 7 stages, SLAs, escalation on timeout, full audit trail |
| Coded Agents | Detector, Triage, Remediation agents — full Python business logic |
| API Workflow | OpenCost namespace query for cost impact |
| UiPath Apps | Stage 4 approval form + Stage 6 resolution sign-off |
| Action Center | SRE receives task with AI reasoning, approves in one click |
| Document Understanding | Auto-generates structured post-mortem PDF |
{
"id": "stage_4_human_approval",
"type": "human_in_loop",
"app": "triage_approval_form",
"sla_minutes": 15,
"escalation_on_timeout": "on_call_engineer",
"on_approve": "stage_5_remediation",
"on_reject": "stage_6_postmortem"
}If the on-call SRE doesn't respond within 15 minutes, Maestro automatically escalates to the next engineer. Every decision — who approved, when, what the AI said, what action was taken — is permanently stored in the Maestro audit trail.
| Incident Type | Root Cause | Remediation | Runbook |
|---|---|---|---|
| OOMKill | Memory limit exceeded | kubectl patch memory limits |
RB-001 |
| CrashLoop | Repeated container crash | ArgoCD rollback | RB-002 |
| Policy Violation | Privileged container | Kyverno PolicyException |
RB-003 |
| Cost Spike | Budget overrun | kubectl scale --replicas=1 |
RB-004 |
| Deployment Failure | Image pull error | ArgoCD rollback | RB-005 |
The Remediation Agent has a built-in confidence gate: if Groq's confidence score is below 85%, the agent refuses to auto-remediate and escalates to a human. This is why the CrashLoop scenario shows ESCALATED — not a bug. It's the safety system working correctly.
| Layer | Technology |
|---|---|
| Orchestration | UiPath Maestro Case (7 stages, v1.0.0, published) |
| AI / LLM | Groq llama-3.3-70b-versatile |
| Human Loop | UiPath Apps + Action Center |
| Cost Analysis | OpenCost REST API |
| GitOps / Remediation | ArgoCD, kubectl, Kyverno |
| Agent Runtime | Python 3.11+ |
| Observability | structlog, JSON events |
| Tests | pytest — 17/17 passing |
This solution uses Coded Agents (Python 3.11+) — not Low-code / drag-and-drop.
Each of the five agents (DetectorAgent, TriageAgent, CostImpactAgent, RemediationAgent, NotificationAgent) is a standalone Python class wired into a UiPath Maestro Case stage. The agents are invoked by Maestro at runtime; all business logic is in Python, not in a visual workflow designer.
| Agent | File | Stage |
|---|---|---|
| DetectorAgent | agents/detector_agent.py |
Stage 1 — Incident Detection |
| TriageAgent | agents/triage_agent.py |
Stage 2 — AI Triage (Groq) |
| CostImpactAgent | agents/cost_impact_agent.py |
Stage 3 — Cost Impact Analysis |
| RemediationAgent | agents/remediation_agent.py |
Stage 5 — Execute Remediation |
| NotificationAgent | agents/notification_agent.py |
Stage 7 — Post-Mortem |
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.11+ | python --version to verify |
| pip | 23+ | bundled with Python 3.11 |
| Git | any | for cloning |
| Groq API Key | — | console.groq.com — free tier is enough |
| UiPath Automation Cloud | — | cloud.uipath.com — free Community plan |
No Kubernetes cluster required.
DEMO_MODE=truesimulates all K8s API calls locally. The full pipeline runs offline except for the Groq LLM call.
git clone https://github.com/sodiq-code/neurascale-ops
cd neurascale-opspip install -r requirements.txtKey packages installed: groq, structlog, pydantic, httpx, pytest.
Create a .env file (or export directly):
cp .env.example .env # if the example file exists, else create manually| Variable | Required | Default | Description |
|---|---|---|---|
GROQ_API_KEY |
✅ Yes | — | Your Groq API key from console.groq.com |
DEMO_MODE |
No | false |
Set true to simulate K8s/ArgoCD calls without a real cluster |
OPENCOST_URL |
No | http://localhost:9003 |
OpenCost API base URL (only needed in live mode) |
SLACK_WEBHOOK_URL |
No | — | Slack incoming webhook for notifications |
UIPATH_TENANT |
No | DefaultTenant |
UiPath Automation Cloud tenant name |
Minimum setup (demo mode):
export GROQ_API_KEY=gsk_xxxxxxxxxxxx
export DEMO_MODE=truepython main.py --scenario oomkillAvailable scenarios: oomkill, crashloop, policy, cost, deploy, all
# Run all 5 scenarios in one pass
python main.py --scenario allpython -m pytest tests/test_pipeline.py -vExpected output: 17/17 tests passing in under 2 seconds.
- Log in to cloud.uipath.com → open your tenant
- Navigate to Maestro → Cases
- Click New Case → Import from JSON
- Upload
uipath/maestro_case/case_definition.json - The 7-stage case plan will be imported with all stage definitions, SLAs, and escalation rules
- Click Publish → the case is ready to trigger
The published v1.0.0 case on
DefaultTenant(Sodiq Jimoh's account) is already live — the import step is only needed if you want to run it on your own tenant.
The full case definition is at uipath/maestro_case/case_definition.json. Import into UiPath Maestro Studio via Cases → New Case → Import from JSON.
This project was built using Claude Code (Anthropic) as an AI coding assistant. Full session logs documenting how the agents, pipeline, and Maestro case were built are in docs/coding-agents/claude-sessions/.
UiPath AgentHack 2026 · Track 1: Maestro Case · Built by Sodiq Jimoh







