Offline evaluation lab for agent routing and tool-selection policies before production.
Most agent teams tweak prompts, routers, tool catalogs, and policy rules frequently. This lab demonstrates why production changes need offline evaluation first: a new policy can improve one metric while quietly increasing cost, latency, unsafe actions, or unresolved requests.
- Replaying logged decisions from a realistic support/ops scenario
- Comparing candidate routers on the same historical data
- Measuring quality/cost/latency/safety trade-offs
- Flagging weak support/coverage regions in logs
- Using context-aware bounded tool cards vs exposing every tool schema
- Generating a decision-ready report before rollout
flowchart LR
A[Historical logged decisions] --> B[Candidate routing policies]
B --> C[Offline evaluator]
C --> D[Metrics + support coverage checks]
D --> E[Markdown report + terminal summary]
E --> F[Rollout decision \n hold / revise / canary]
See /docs/architecture.md for details.
- CLI Reference — Every subcommand, flag, and exit code
- Input Schema — Logged-decisions CSV contract (+
validate) - JSON Results Schema — Machine-readable output for CI
- Architecture — System design and data flow
- Evaluation Methodology — How metrics are calculated
- Consultant Playbook — Guidance for enterprise adoption
- Glossary — Definitions of key terms
- Roadmap — Backlog themes, sequencing, and contributor entry points
make install
make test
make generate-data
make evaluate
make report
make demo| Policy | Success | Correct Tool | Avg Cost | Avg Latency (ms) | Unsafe | Unresolved | Regret | Score |
|---|---|---|---|---|---|---|---|---|
| contextweaver_v1 | 79.67% | 83.67% | $0.153 | 183.3 | 2.67% | 8.67% | 0.319 | 83.41 |
| baseline | 62.00% | 66.00% | $0.309 | 304.2 | 3.00% | 37.67% | 0.666 | 66.67 |
| strict_policy | 46.00% | 46.00% | $0.065 | 121.8 | 0.00% | 44.67% | 0.711 | 60.91 |
| cost_aware | 23.67% | 23.67% | $0.051 | 105.4 | 0.00% | 23.00% | 1.014 | 49.38 |
Source: mirrors reports/example_report.md, generated by make demo with 300 synthetic rows and seed 7.
The gate subcommand turns the evaluation into an executable pre-deployment
check that exits non-zero when thresholds are violated:
- name: Routing gate
run: agent-routing-eval-lab gate --input logs.csv --max-unsafe-rate 0.05 --min-success-rate 0.6Exit codes: 0 pass, 1 gate/validation failure, 2 usage/data error. See the
CLI reference for evaluate --format json, compare, validate,
and --dump-decisions.
Use this lab as a pre-deployment gate before online A/B testing. It helps teams reject policy changes that improve happy-path demos but harm safety, support coverage, or operating cost.
skdr-eval: wrapped bysrc/agent_routing_eval_lab/adapters/skdr_eval_adapter.pyas the evaluation anchor. The adapter is explicit about fallback behavior and emits warnings when native API wiring is unavailable.contextweaver: demonstrated via bounded tool cards insrc/agent_routing_eval_lab/adapters/contextweaver_adapter.pyand theContextWeaverRouter.
Optional extensions to deterministic flows (e.g., ChainWeaver) or governance layers (e.g., AgentFence / agent-kernel) are noted in docs, but routing evaluation stays the main focus.
- Evaluating agent router changes safely
- Reducing tool-call costs without quality regressions
- Validating stricter safety/approval policies
- Preparing an agent for production rollout
- Comparing prompt/model/tool-catalog changes before traffic exposure
- Synthetic demo data, not production telemetry
- Offline evaluation cannot replace online experiments
- Still requires red-teaming and human review for high-risk actions
- Counterfactual estimates are sensitive to support/coverage in logs
See docs/non-goals.md for scope boundaries that keep the
lab focused on offline routing evaluation rather than live runtime ownership.