NeuroScale Ops Agent

Autonomous AI-powered operations for Kubernetes ML platforms, built on Splunk.

Live Demo

▶ Try it live → neuroscale-ops-agent.streamlit.app

What This Is

NeuroScale Ops Agent is a fully autonomous AI-powered operations platform for Kubernetes-based machine learning infrastructure — with Splunk as its observability backbone and reasoning engine.

It is built from the ground up as a complete system: a production-grade Kubernetes ML platform (ArgoCD + KServe + Kyverno + OpenCost) wired end-to-end into Splunk, with an LLM agent that can query, reason, and act — autonomously.

No human in the loop. No toy demo. A real self-healing ops system.

Real-time telemetry ingestion → KServe, Kyverno, OpenCost, and ArgoCD events stream into Splunk via HEC across 4 concurrent threads
SPL-powered anomaly detection → Splunk threshold alerts fire automatically on model failures, cost spikes, and policy violations
MCP-connected reasoning → The agent queries live Splunk data mid-reasoning via Model Context Protocol — every answer cites its data source
Runbook RAG → Every remediation action is grounded in documented runbooks, not hallucination
Autonomous self-healing → 3 end-to-end workflows execute without human intervention: model recovery, policy remediation, cost optimization
Operator dashboard → Streamlit UI exposes the full reasoning chain, SPL queries, and action log in real time

Hackathon Track: Platform & Developer Experience
Bonus Target: Best Use of Splunk MCP Server

Architecture

K8s Cluster (k3d)
  ├── KServe   (model inference)
  ├── Kyverno  (policy engine)      ──► splunk-integration/k8s_to_splunk.py
  ├── OpenCost (cost monitoring)         (4 threads, 30s interval, HEC)
  └── ArgoCD   (GitOps)                          │
                                          Splunk Index: neuroscale
                                          ├── Alerts (SPL thresholds)
                                          └── MCP Server ──► agent/core.py (Llama 3.3 70B)
                                                                  │
                                              ┌────────────────────┴────────────────────┐
                                              ▼                   ▼                     ▼
                                        runbook_rag          splunk_client           k8s_ops
                                              └──────── workflows/ ────────────────────┘
                                                               │
                                                       ui/app.py (Streamlit)

See architecture_diagram.md for the full Mermaid diagram.

Screenshots

Operator Dashboard	Cost Attribution (OpenCost → Splunk)

Kyverno Policy Violations	Agent Reasoning (MCP + SPL)

Splunk Query Results	Self-Healing Workflow

Splunk Capabilities Used

Capability	Implementation	Prize Target
Splunk MCP Server	Agent queries live Splunk data mid-reasoning via Model Context Protocol — 7 structured MCP tools	MCP Bonus ($1K)
Python SDK	`tools/splunk_client.py` — SDK-based REST queries, index management, SPL execution	Core
HEC Ingestion	`splunk-integration/k8s_to_splunk.py` — 4 threads, structured JSON events per sourcetype	Core
SPL Queries	Model health, cost breakdown, policy violations, ArgoCD status, error timelines	Core
Alert Webhooks	`splunk-integration/alert-actions/trigger_agent.py` — Splunk alert fires → agent acts automatically	Core
Custom Index	`neuroscale` index with 4 sourcetypes: `neuroscale:models`, `neuroscale:costs`, `neuroscale:policies`, `neuroscale:argocd`	Core
`splunk_generate_spl` tool	LLM-generated SPL from natural language — grounded in the neuroscale schema	Core

Agent Tools (14 Total)

Tool	What It Does
`query_splunk`	Run arbitrary SPL against the `neuroscale` index
`get_model_health`	KServe inference service health, latency, error rates
`get_policy_violations`	Kyverno blocks and warnings from Splunk
`get_cost_attribution`	Per-namespace hourly cost from OpenCost → Splunk
`get_error_timeline`	Time-series error trend for any model or namespace
`lookup_runbook`	RAG retrieval from the platform runbook
`trigger_argocd_sync`	Force-sync an ArgoCD application
`restart_inference_service`	kubectl rollout restart on a KServe InferenceService
`patch_memory_limit`	Adjust memory limits on an InferenceService
`get_cluster_overview`	Recent K8s events across any namespace
`get_cost_direct`	Direct OpenCost API query (bypasses Splunk)
`splunk_security_analysis`	Splunk Foundation-Sec model for security triage
`splunk_forecast`	Cisco Deep Time Series forecasting via Splunk AI Toolkit
`splunk_generate_spl`	Natural language → SPL query generation

Self-Healing Workflows

1. Model Down (`workflows/model_down.py`)

Trigger: KServe model error rate > 5% for 5 minutes

Pull model telemetry from Splunk
Check ArgoCD sync status
Retrieve runbook steps via RAG
Restart InferenceService
Poll until healthy (5 retries × 30s)
Write resolution event to Splunk

2. Policy Violation (`workflows/policy_violation.py`)

Trigger: Kyverno BLOCK action appears in Splunk

Identify the violating resource and policy
Retrieve compliance runbook
Annotate resource for review
Trigger ArgoCD sync to restore desired state
Verify no new violations within 60s

3. Cost Spike (`workflows/cost_spike.py`)

Trigger: Hourly namespace cost > $50 threshold

Pull OpenCost breakdown from Splunk
Identify highest-spend namespace
Check if workloads are over-provisioned
Scale down replica count (with user approval prompt)
Project new cost and log to Splunk

Quick Start

Prerequisites

Python 3.11+
kubectl configured (or use DEMO_MODE=true)
Splunk instance with HEC enabled (or use DEMO_MODE=true)
Groq API key (free at console.groq.com)

1. Clone and install

git clone https://github.com/sodiq-code/neuroscale-ops-agent
cd neuroscale-ops-agent
bash scripts/setup.sh

2. Configure

cp .env.example .env
# Edit .env with your keys

Key variables:

# LLM — Groq (free tier, fast inference)
OPENAI_API_KEY=gsk_...          # Your Groq API key
OPENAI_BASE_URL=https://api.groq.com/openai/v1
OPENAI_MODEL=llama-3.3-70b-versatile

# Splunk
SPLUNK_HOST=localhost
SPLUNK_HEC_TOKEN=your-hec-token
SPLUNK_INDEX=neuroscale

# Demo mode (no infrastructure required)
DEMO_MODE=false

3. Start the data forwarder

# Streams K8s events into Splunk every 30s
python3 splunk-integration/k8s_to_splunk.py

4. Run the agent UI

streamlit run ui/app.py
# → http://localhost:8501
# Or use the hosted version: https://neuroscale-ops-agent.streamlit.app

Demo mode (zero infrastructure)

DEMO_MODE=true streamlit run ui/app.py

All cluster and Splunk calls return realistic synthetic data. Full agent reasoning still fires. Anyone can run this in 2 minutes with only a Groq API key.

Seed demo data into Splunk

# Populates your Splunk index with realistic K8s events
python3 splunk-integration/seed_demo_data.py

Splunk Setup

See docs/SPLUNK_SETUP.md for:

Docker-based Splunk in 2 minutes
HEC token creation (UI + CLI)
MCP server configuration
Alert action webhook setup

Repository Structure

neuroscale-ops-agent/
├── agent/
│   └── core.py                          # Llama 3.3 70B function-calling loop (14 tools)
├── tools/
│   ├── splunk_client.py                 # Splunk HEC + SDK + SPL query engine
│   ├── runbook_rag.py                   # Keyword RAG over runbook.md
│   ├── kubernetes_ops.py                # kubectl / ArgoCD / KServe operations
│   └── splunk_hosted_models.py          # Splunk AI Toolkit hosted model integrations
├── workflows/
│   ├── model_down.py                    # Model failure → auto-recovery
│   ├── policy_violation.py              # Kyverno violation → remediation
│   └── cost_spike.py                    # Cost spike → scale-down
├── splunk-integration/
│   ├── k8s_to_splunk.py                 # 4-thread real-time K8s→Splunk forwarder
│   ├── seed_demo_data.py                # Demo data seeder (HEC population)
│   └── alert-actions/
│       └── trigger_agent.py             # Splunk alert webhook handler
├── ui/
│   └── app.py                           # Streamlit operator dashboard
├── assets/
│   ├── architecture.png                 # Architecture diagram
│   └── screenshot_*.png                 # Live demo screenshots
├── docs/
│   ├── runbook.md                       # Platform runbook (source for RAG)
│   └── SPLUNK_SETUP.md                  # Splunk + HEC + MCP setup guide
├── scripts/
│   ├── setup.sh                         # One-command setup
│   └── smoke-test-extended.sh           # Full connectivity smoke test
├── k8s-manifests/                       # Kubernetes manifests (ArgoCD, KServe, Kyverno, OpenCost)
├── .github/workflows/ci.yml             # Lint + import smoke test CI
├── .env.example                         # Environment variable template
├── requirements.txt                     # Python dependencies
├── architecture_diagram.md              # Mermaid architecture diagram
└── LICENSE                              # MIT

What's Inside

A complete, self-contained system — every component purpose-built for this project:

Component	What It Does
`splunk-integration/`	Real-time K8s→Splunk HEC pipeline (4 threads, 30s interval)
`agent/core.py`	Llama 3.3 70B agentic reasoning loop with 14 function-calling tools
`tools/splunk_client.py`	Splunk SDK + HEC + MCP client — the agent's data layer
`tools/runbook_rag.py`	Keyword RAG over operational runbooks — grounds every action
`tools/kubernetes_ops.py`	Programmatic cluster operations via kubectl and ArgoCD API
`tools/splunk_hosted_models.py`	Foundation-Sec, Deep Time Series, GPT-OSS Splunk AI integrations
`workflows/`	3 autonomous end-to-end remediation workflows
`ui/app.py`	Streamlit operator dashboard with live reasoning panel

Smoke Test

# Full test (requires Splunk + env vars)
bash scripts/smoke-test-extended.sh

# Demo mode (no infrastructure needed)
bash scripts/smoke-test-extended.sh --demo

Checks: Python imports, file integrity, syntax, Splunk HEC connectivity, agent module loads, runbook RAG, workflow imports, README completeness, MIT license.

Hackathon Context

Built for the Splunk Agentic Ops Hackathon 2026 (2,052 participants, deadline June 15 2026).

Key differentiators:

Real production-grade platform (not a toy demo) extended with Splunk
MCP-connected agent with full function-calling reasoning loop
Demo mode works offline — zero friction for evaluators
4 Splunk sourcetypes, 14 agent tools, 3 self-healing workflows
Runbook RAG grounds every action in documented procedures
Self-healing loop: detect anomaly → Splunk alert → agent reasons → runbook → kubectl → verify → report back to Splunk

License

MIT — see LICENSE.

Author

Sodiq Jimoh (Afsod) — Platform Engineer
GitHub · LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroScale Ops Agent

Live Demo

What This Is

Architecture

Screenshots

Splunk Capabilities Used

Agent Tools (14 Total)

Self-Healing Workflows

1. Model Down (`workflows/model_down.py`)

2. Policy Violation (`workflows/policy_violation.py`)

3. Cost Spike (`workflows/cost_spike.py`)

Quick Start

Prerequisites

1. Clone and install

2. Configure

3. Start the data forwarder

4. Run the agent UI

Demo mode (zero infrastructure)

Seed demo data into Splunk

Splunk Setup

Repository Structure

What's Inside

Smoke Test

Hackathon Context

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
agent		agent
assets		assets
docs		docs
k8s-manifests		k8s-manifests
scripts		scripts
splunk-integration		splunk-integration
tests		tests
tools		tools
ui		ui
workflows		workflows
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture_diagram.md		architecture_diagram.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NeuroScale Ops Agent

Live Demo

What This Is

Architecture

Screenshots

Splunk Capabilities Used

Agent Tools (14 Total)

Self-Healing Workflows

1. Model Down (workflows/model_down.py)

2. Policy Violation (workflows/policy_violation.py)

3. Cost Spike (workflows/cost_spike.py)

Quick Start

Prerequisites

1. Clone and install

2. Configure

3. Start the data forwarder

4. Run the agent UI

Demo mode (zero infrastructure)

Seed demo data into Splunk

Splunk Setup

Repository Structure

What's Inside

Smoke Test

Hackathon Context

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Model Down (`workflows/model_down.py`)

2. Policy Violation (`workflows/policy_violation.py`)

3. Cost Spike (`workflows/cost_spike.py`)

Packages