Skip to content

Overlxrd-uwu/SADE-NetworkAgent

Repository files navigation

SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting

Overview · What SADE adds · Installation · Quick Start · Reproducing the paper · Repo layout · Acknowledgements

Built on top of NIKA. SADE uses NIKA's unmodified orchestrator, fault-injection platform, and four-step evaluation pipeline. We add a phase-gated diagnostic workflow, a 15-skill library, two new Claude-Code-based agents, and a reproducibility data pack on top — see What SADE adds for the full list.

Overview

SADE is a methodology-grounded LLM agent for network troubleshooting on the NIKA benchmark. It pairs a phase-gated diagnostic workflow that separates evidence acquisition from hypothesis commitment with a 15-skill library (12 fault-family books, a diagnosis manual with 12 read-only helper scripts, and 2 utility books). This repo includes SADE plus the two baselines used in the paper:

  • SADE — Claude Code + phase-gated workflow + skill index library
  • CC-Baseline — same Claude Code backbone, no SADE policy
  • ReAct + GPT-5 — the original NIKA baseline

All three agents plug into the unmodified NIKA orchestrator and four-step evaluation pipeline, so the comparison is on identical (problem, scenario, topology) triples.

What SADE adds on top of NIKA

NIKA contributes the network-incident benchmark, the Kathará-based fault-injection environment, and the four-step evaluation pipeline. Everything under src/nika/ and src/scripts/step1step4 is unmodified upstream NIKA. SADE adds:

  1. A phase-gated diagnostic workflow (src/agent/prompts/sade_prompt.py) — five phases (blind start → branch → symptom-first diagnosis → broad-search escalation → submission) that separate evidence acquisition from hypothesis commitment.
  2. A 15-skill library wired into Claude Code's Skill tool (src/agent/.claude/skills/):
    • 12 fault-family books mapping symptoms to confirmation patterns
    • 1 diagnosis manual (diagnosis-methodology-skill) with 12 read-only helper scripts (infra_sweep, l2_snapshot, ospf_snapshot, tc_snapshot, service_snapshot, pressure_sweep, ...)
    • 2 utility books (baseline-behavior-skill for symptom gating, big-return-skill for oversized output handling)
  3. A helper-script launcher (h.py at the repo root) that the agent invokes as python ../../h.py <script>. See the h.py subsection below for usage.
  4. Two new agents plugged into the unmodified NIKA pipeline: SADE (claude-code-sade) and CC-Baseline (claude-code). Both use the claude-agent-sdk.
  5. A held-out train/test split of NIKA's 640-incident pool (benchmark/benchmark_train.csv, benchmark/benchmark_test.csv) so skill design and evaluation are kept separate.
  6. Three-way matched evaluation across SADE / CC-Baseline / ReAct + GPT-5 on the matched (problem, scenario, topology) triples, regenerable end-to-end via Research_results/build_research_results.py.
  7. Pipeline robustness fixes that make 500-case batch runs feasible — auto Docker/Kathará recovery in benchmark/run_benchmark.py, native Claude Code SDK token accounting in src/nika/evaluator/trace_parser.py, UTF-8 explicit encoding for Windows compatibility, and skip-row recording for setup failures.

Installation

Requirements

  • Kathará — follow the official installation guide.
  • Python ≥ 3.12
  • Docker (Kathará dependency)
  • API access:
    • Anthropic (for SADE and CC-Baseline runners — claude-sonnet-4-6)
    • OpenAI (for ReAct runs and the LLM-as-judge — gpt-5-mini)

Setup

# Clone
git clone https://github.com/Overlxrd-uwu/SADE-NetworkAgent.git
cd SADE-NetworkAgent

Install Python deps. claude-agent-sdk is pinned in pyproject.toml and ships with whichever path you pick.

Option A — uv (recommended). Install uv first if you don't have it:

# Linux / macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows PowerShell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then sync and activate the venv:

uv sync
source .venv/bin/activate          # macOS / Linux
# .\.venv\Scripts\Activate.ps1     # Windows PowerShell

Option B — plain pip.

python -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .\.venv\Scripts\Activate.ps1     # Windows PowerShell
pip install -e .

Add the current user to the docker group so Kathará calls don't need sudo (Linux only — security implications: see Docker docs):

sudo usermod -aG docker $USER
newgrp docker

Build the customised Kathará Docker images that NIKA uses for fault injection:

bash src/nika/net_env/utils/DockerFiles/build_dockers.sh

Environment variables

Create a .env at the repo root:

BASE_DIR=/absolute/path/to/SADE-NetworkAgent

# SADE / CC-Baseline runners
ANTHROPIC_API_KEY=sk-ant-...

# ReAct runner + LLM-as-judge
OPENAI_API_KEY=sk-...

# Optional observability
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=
LANGFUSE_SECRET_KEY=
LANGFUSE_PUBLIC_KEY=

Sanity check

After setup, confirm the SADE skill library and helper launcher landed. Expected: 15 skill directories (12 fault-family + diagnosis-methodology-skill

  • baseline-behavior-skill + big-return-skill).
# Linux / macOS
ls src/agent/.claude/skills/ | wc -l
# Windows PowerShell
(Get-ChildItem src/agent/.claude/skills/).Count

Then verify the helper launcher resolves the skill paths:

python h.py     # should print a usage message + helper list, not "helper not found"

If the count is below 15, your clone or extraction skipped hidden directories — re-clone with git clone (not "Download ZIP" from the GitHub UI), since the archive endpoint occasionally drops dot-prefixed paths.

Quick Start

NIKA evaluation is a four-step pipeline. Run it once on a single incident to confirm the install works end to end.

Step 1 — Spin up the Kathará lab for one scenario

python src/scripts/step1_net_env_start.py --scenario simple_ospf --topo_size s

Step 2 — Inject a fault

python src/scripts/step2_failure_inject.py --root_cause_name ospf_neighbor_missing

Step 3 — Run the agent (pick one)

# SADE (Claude Code + phase-gated workflow + skill library)
python src/scripts/step3_agent_run.py --agent-type claude-code-sade --model claude-sonnet-4-6 --max-steps 20

# CC-Baseline (Claude Code, no SADE policy)
python src/scripts/step3_agent_run.py --agent-type claude-code      --model claude-sonnet-4-6 --max-steps 20

# ReAct + GPT-5 (original NIKA baseline)
python src/scripts/step3_agent_run.py --agent-type react            --llm-backend openai --model gpt-5 --max-steps 20

Step 4 — Score the run (LLM-as-judge + grading)

python src/scripts/step4_result_eval.py --judge-model gpt-5-mini

A successful run appends one row to results/0_summary/evaluation_summary.csv with populated in_tokens, out_tokens, tool_calls, judge scores, and detection / localisation / RCA metrics.

h.py — the SADE helper launcher

h.py (at the repo root) is the single entry point the SADE agent uses to run the diagnosis-methodology helper scripts. From the agent's working directory (src/agent/), it invokes them as:

python ../../h.py infra_sweep              # one-pass nft / addressing / routing / ARP / resolver / link sweep
python ../../h.py ospf_snapshot            # FRR + OSPF adjacency + per-interface state
python ../../h.py service_snapshot         # combined DNS + HTTP + localhost-HTTP + service-process
python ../../h.py                          # bare invocation lists every available helper

What the launcher does:

  • Resolves the helper's full path under src/agent/.claude/skills/diagnosis-methodology-skill/scripts/ (or bgp-fault-skill/scripts/, big-return-skill/scripts/).
  • Forwards the rest of the argv to the helper unchanged, so python ../../h.py infra_sweep --device router1 works exactly like invoking infra_sweep.py --device router1 directly.
  • Injects LAB_NAME from runtime/current_session.json into the helper's environment, so every helper targets the right Kathará lab without the agent having to remember it.

You can use h.py directly during debugging — python h.py <script> from the repo root works the same way (just without the ../../ prefix the agent uses from src/agent/).

Run the full benchmark

# Iterate the four steps over benchmark/benchmark_selected.csv
python benchmark/run_benchmark.py --agent-type claude-code-sade --model claude-sonnet-4-6 --max-steps 20

Each row in benchmark_selected.csv defines one (root cause, scenario, topology) triple. The runner tears down the Kathará lab between cases, so a full pass on the held-out test split takes time and credits.

Reproducing the paper

Research_results/ ships pre-computed CSVs for the test-set 3-way comparison (SADE, CC-Baseline, ReAct + GPT-5) on the matched triples. To regenerate every figure used in the paper:

pip install matplotlib numpy pandas
python Research_results/build_research_results.py

That writes 11 PNGs into Research_results/figures/. Per-session conversation logs are not committed (they total ~580 MB); to regenerate them, re-run the full benchmark above for each agent.

Repo layout

SADE-NetworkAgent/
├── benchmark/                # NIKA benchmark + selected slice
│   ├── benchmark_full.csv         # 640-incident pool
│   ├── benchmark_selected.csv     # held-out test slice (paper headline)
│   ├── benchmark_train.csv        # training split (skill design corpus)
│   ├── benchmark_test.csv         # held-out test split
│   ├── generate_benchmark.py
│   └── run_benchmark.py
├── h.py                      # SADE helper launcher (python h.py infra_sweep, etc.)
├── Research_results/         # paper data + figure regenerator
│   ├── data/                      # unified CSV + log-scanned tool-error CSV
│   ├── figures/                   # 11 PNGs (regenerated by build script)
│   ├── tables/                    # paper Table 1, per-family, time-efficiency, topology
│   └── build_research_results.py  # one-shot regenerator
├── run_nika_break.py         # manual fault-injection / verify-injection harness
└── src/
    ├── agent/
    │   ├── claude_code_agent.py        # CC-Baseline runner
    │   ├── claude_code_agent_sade.py   # SADE runner
    │   ├── react_agent.py              # ReAct + GPT-5 baseline
    │   ├── prompts/                    # baseline + sade system prompts
    │   └── .claude/skills/             # 15-skill library (SADE)
    ├── nika/                  # NIKA orchestrator (unmodified)
    └── scripts/               # step1–step4 pipeline (unmodified)

Acknowledgements

This repository is built on top of NIKA — the network-troubleshooting benchmark and orchestration platform from the SANDS Lab. NIKA contributes the 640-incident benchmark suite, the Kathará-based fault-injection environment, the four-step evaluation pipeline, and the LLM-as-judge scoring framework that this work depends on. SADE uses all of those unmodified; only the agent layer and the reproducibility data pack are ours.

About

To make sure the structure clear and reproduce, no redundant legacy file and so.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages