Overview · What SADE adds · Installation · Quick Start · Reproducing the paper · Repo layout · Acknowledgements
Built on top of NIKA. SADE uses NIKA's unmodified orchestrator, fault-injection platform, and four-step evaluation pipeline. We add a phase-gated diagnostic workflow, a 15-skill library, two new Claude-Code-based agents, and a reproducibility data pack on top — see What SADE adds for the full list.
SADE is a methodology-grounded LLM agent for network troubleshooting on the NIKA benchmark. It pairs a phase-gated diagnostic workflow that separates evidence acquisition from hypothesis commitment with a 15-skill library (12 fault-family books, a diagnosis manual with 12 read-only helper scripts, and 2 utility books). This repo includes SADE plus the two baselines used in the paper:
- SADE — Claude Code + phase-gated workflow + skill index library
- CC-Baseline — same Claude Code backbone, no SADE policy
- ReAct + GPT-5 — the original NIKA baseline
All three agents plug into the unmodified NIKA orchestrator and four-step evaluation pipeline, so the comparison is on identical (problem, scenario, topology) triples.
NIKA contributes the network-incident benchmark, the Kathará-based
fault-injection environment, and the four-step evaluation pipeline. Everything
under src/nika/ and src/scripts/step1–step4 is unmodified upstream NIKA.
SADE adds:
- A phase-gated diagnostic workflow (
src/agent/prompts/sade_prompt.py) — five phases (blind start → branch → symptom-first diagnosis → broad-search escalation → submission) that separate evidence acquisition from hypothesis commitment. - A 15-skill library wired into Claude Code's
Skilltool (src/agent/.claude/skills/):- 12 fault-family books mapping symptoms to confirmation patterns
- 1 diagnosis manual (
diagnosis-methodology-skill) with 12 read-only helper scripts (infra_sweep,l2_snapshot,ospf_snapshot,tc_snapshot,service_snapshot,pressure_sweep, ...) - 2 utility books (
baseline-behavior-skillfor symptom gating,big-return-skillfor oversized output handling)
- A helper-script launcher (
h.pyat the repo root) that the agent invokes aspython ../../h.py <script>. See the h.py subsection below for usage. - Two new agents plugged into the unmodified NIKA pipeline: SADE
(
claude-code-sade) and CC-Baseline (claude-code). Both use theclaude-agent-sdk. - A held-out train/test split of NIKA's 640-incident pool
(
benchmark/benchmark_train.csv,benchmark/benchmark_test.csv) so skill design and evaluation are kept separate. - Three-way matched evaluation across SADE / CC-Baseline / ReAct + GPT-5
on the matched (problem, scenario, topology) triples, regenerable
end-to-end via
Research_results/build_research_results.py. - Pipeline robustness fixes that make 500-case batch runs feasible —
auto Docker/Kathará recovery in
benchmark/run_benchmark.py, native Claude Code SDK token accounting insrc/nika/evaluator/trace_parser.py, UTF-8 explicit encoding for Windows compatibility, and skip-row recording for setup failures.
- Kathará — follow the official installation guide.
- Python ≥ 3.12
- Docker (Kathará dependency)
- API access:
- Anthropic (for SADE and CC-Baseline runners —
claude-sonnet-4-6) - OpenAI (for ReAct runs and the LLM-as-judge —
gpt-5-mini)
- Anthropic (for SADE and CC-Baseline runners —
# Clone
git clone https://github.com/Overlxrd-uwu/SADE-NetworkAgent.git
cd SADE-NetworkAgentInstall Python deps. claude-agent-sdk is pinned in pyproject.toml and
ships with whichever path you pick.
Option A — uv (recommended). Install uv first if you don't have it:
# Linux / macOS
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows PowerShell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Then sync and activate the venv:
uv sync
source .venv/bin/activate # macOS / Linux
# .\.venv\Scripts\Activate.ps1 # Windows PowerShellOption B — plain pip.
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .\.venv\Scripts\Activate.ps1 # Windows PowerShell
pip install -e .Add the current user to the docker group so Kathará calls don't need
sudo (Linux only — security implications: see Docker docs):
sudo usermod -aG docker $USER
newgrp dockerBuild the customised Kathará Docker images that NIKA uses for fault injection:
bash src/nika/net_env/utils/DockerFiles/build_dockers.shCreate a .env at the repo root:
BASE_DIR=/absolute/path/to/SADE-NetworkAgent
# SADE / CC-Baseline runners
ANTHROPIC_API_KEY=sk-ant-...
# ReAct runner + LLM-as-judge
OPENAI_API_KEY=sk-...
# Optional observability
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=
LANGFUSE_SECRET_KEY=
LANGFUSE_PUBLIC_KEY=After setup, confirm the SADE skill library and helper launcher landed.
Expected: 15 skill directories (12 fault-family + diagnosis-methodology-skill
baseline-behavior-skill+big-return-skill).
# Linux / macOS
ls src/agent/.claude/skills/ | wc -l# Windows PowerShell
(Get-ChildItem src/agent/.claude/skills/).CountThen verify the helper launcher resolves the skill paths:
python h.py # should print a usage message + helper list, not "helper not found"If the count is below 15, your clone or extraction skipped hidden
directories — re-clone with git clone (not "Download ZIP" from the GitHub
UI), since the archive endpoint occasionally drops dot-prefixed paths.
NIKA evaluation is a four-step pipeline. Run it once on a single incident to confirm the install works end to end.
python src/scripts/step1_net_env_start.py --scenario simple_ospf --topo_size spython src/scripts/step2_failure_inject.py --root_cause_name ospf_neighbor_missing# SADE (Claude Code + phase-gated workflow + skill library)
python src/scripts/step3_agent_run.py --agent-type claude-code-sade --model claude-sonnet-4-6 --max-steps 20
# CC-Baseline (Claude Code, no SADE policy)
python src/scripts/step3_agent_run.py --agent-type claude-code --model claude-sonnet-4-6 --max-steps 20
# ReAct + GPT-5 (original NIKA baseline)
python src/scripts/step3_agent_run.py --agent-type react --llm-backend openai --model gpt-5 --max-steps 20python src/scripts/step4_result_eval.py --judge-model gpt-5-miniA successful run appends one row to results/0_summary/evaluation_summary.csv
with populated in_tokens, out_tokens, tool_calls, judge scores, and
detection / localisation / RCA metrics.
h.py (at the repo root) is the single entry point the SADE agent uses to
run the diagnosis-methodology helper scripts. From the agent's working
directory (src/agent/), it invokes them as:
python ../../h.py infra_sweep # one-pass nft / addressing / routing / ARP / resolver / link sweep
python ../../h.py ospf_snapshot # FRR + OSPF adjacency + per-interface state
python ../../h.py service_snapshot # combined DNS + HTTP + localhost-HTTP + service-process
python ../../h.py # bare invocation lists every available helperWhat the launcher does:
- Resolves the helper's full path under
src/agent/.claude/skills/diagnosis-methodology-skill/scripts/(orbgp-fault-skill/scripts/,big-return-skill/scripts/). - Forwards the rest of the argv to the helper unchanged, so
python ../../h.py infra_sweep --device router1works exactly like invokinginfra_sweep.py --device router1directly. - Injects
LAB_NAMEfromruntime/current_session.jsoninto the helper's environment, so every helper targets the right Kathará lab without the agent having to remember it.
You can use h.py directly during debugging — python h.py <script> from
the repo root works the same way (just without the ../../ prefix the
agent uses from src/agent/).
# Iterate the four steps over benchmark/benchmark_selected.csv
python benchmark/run_benchmark.py --agent-type claude-code-sade --model claude-sonnet-4-6 --max-steps 20Each row in benchmark_selected.csv defines one (root cause, scenario,
topology) triple. The runner tears down the Kathará lab between cases, so a
full pass on the held-out test split takes time and credits.
Research_results/ ships pre-computed CSVs for the test-set 3-way comparison
(SADE, CC-Baseline, ReAct + GPT-5) on the matched triples. To regenerate
every figure used in the paper:
pip install matplotlib numpy pandas
python Research_results/build_research_results.pyThat writes 11 PNGs into Research_results/figures/. Per-session conversation
logs are not committed (they total ~580 MB); to regenerate them, re-run the
full benchmark above for each agent.
SADE-NetworkAgent/
├── benchmark/ # NIKA benchmark + selected slice
│ ├── benchmark_full.csv # 640-incident pool
│ ├── benchmark_selected.csv # held-out test slice (paper headline)
│ ├── benchmark_train.csv # training split (skill design corpus)
│ ├── benchmark_test.csv # held-out test split
│ ├── generate_benchmark.py
│ └── run_benchmark.py
├── h.py # SADE helper launcher (python h.py infra_sweep, etc.)
├── Research_results/ # paper data + figure regenerator
│ ├── data/ # unified CSV + log-scanned tool-error CSV
│ ├── figures/ # 11 PNGs (regenerated by build script)
│ ├── tables/ # paper Table 1, per-family, time-efficiency, topology
│ └── build_research_results.py # one-shot regenerator
├── run_nika_break.py # manual fault-injection / verify-injection harness
└── src/
├── agent/
│ ├── claude_code_agent.py # CC-Baseline runner
│ ├── claude_code_agent_sade.py # SADE runner
│ ├── react_agent.py # ReAct + GPT-5 baseline
│ ├── prompts/ # baseline + sade system prompts
│ └── .claude/skills/ # 15-skill library (SADE)
├── nika/ # NIKA orchestrator (unmodified)
└── scripts/ # step1–step4 pipeline (unmodified)
This repository is built on top of NIKA — the network-troubleshooting benchmark and orchestration platform from the SANDS Lab. NIKA contributes the 640-incident benchmark suite, the Kathará-based fault-injection environment, the four-step evaluation pipeline, and the LLM-as-judge scoring framework that this work depends on. SADE uses all of those unmodified; only the agent layer and the reproducibility data pack are ours.