agent-routing-eval-lab

Offline evaluation lab for agent routing and tool-selection policies before production.

Most agent teams tweak prompts, routers, tool catalogs, and policy rules frequently. This lab demonstrates why production changes need offline evaluation first: a new policy can improve one metric while quietly increasing cost, latency, unsafe actions, or unresolved requests.

What this repo demonstrates

Replaying logged decisions from a realistic support/ops scenario
Comparing candidate routers on the same historical data
Measuring quality/cost/latency/safety trade-offs
Flagging weak support/coverage regions in logs
Using context-aware bounded tool cards vs exposing every tool schema
Generating a decision-ready report before rollout

Architecture

flowchart LR
    A[Historical logged decisions] --> B[Candidate routing policies]
    B --> C[Offline evaluator]
    C --> D[Metrics + support coverage checks]
    D --> E[Markdown report + terminal summary]
    E --> F[Rollout decision \n hold / revise / canary]

See /docs/architecture.md for details.

Documentation

CLI Reference — Every subcommand, flag, and exit code
Input Schema — Logged-decisions CSV contract (+ validate)
JSON Results Schema — Machine-readable output for CI
Architecture — System design and data flow
Evaluation Methodology — How metrics are calculated
Consultant Playbook — Guidance for enterprise adoption
Glossary — Definitions of key terms
Roadmap — Backlog themes, sequencing, and contributor entry points

Quickstart

make install
make test
make generate-data
make evaluate
make report
make demo

Example comparison output

Policy	Success	Correct Tool	Avg Cost	Avg Latency (ms)	Unsafe	Unresolved	Regret	Score
contextweaver_v1	79.67%	83.67%	$0.153	183.3	2.67%	8.67%	0.319	83.41
baseline	62.00%	66.00%	$0.309	304.2	3.00%	37.67%	0.666	66.67
strict_policy	46.00%	46.00%	$0.065	121.8	0.00%	44.67%	0.711	60.91
cost_aware	23.67%	23.67%	$0.051	105.4	0.00%	23.00%	1.014	49.38

Source: mirrors reports/example_report.md, generated by make demo with 300 synthetic rows and seed 7.

Use as a CI gate

The gate subcommand turns the evaluation into an executable pre-deployment check that exits non-zero when thresholds are violated:

- name: Routing gate
  run: agent-routing-eval-lab gate --input logs.csv --max-unsafe-rate 0.05 --min-success-rate 0.6

Exit codes: 0 pass, 1 gate/validation failure, 2 usage/data error. See the CLI reference for evaluate --format json, compare, validate, and --dump-decisions.

Enterprise governance mapping

Use this lab as a pre-deployment gate before online A/B testing. It helps teams reject policy changes that improve happy-path demos but harm safety, support coverage, or operating cost.

Public library showcase

skdr-eval: wrapped by src/agent_routing_eval_lab/adapters/skdr_eval_adapter.py as the evaluation anchor. The adapter is explicit about fallback behavior and emits warnings when native API wiring is unavailable.
contextweaver: demonstrated via bounded tool cards in src/agent_routing_eval_lab/adapters/contextweaver_adapter.py and the ContextWeaverRouter.

Optional extensions to deterministic flows (e.g., ChainWeaver) or governance layers (e.g., AgentFence / agent-kernel) are noted in docs, but routing evaluation stays the main focus.

When to use this pattern

Evaluating agent router changes safely
Reducing tool-call costs without quality regressions
Validating stricter safety/approval policies
Preparing an agent for production rollout
Comparing prompt/model/tool-catalog changes before traffic exposure

Limitations

Synthetic demo data, not production telemetry
Offline evaluation cannot replace online experiments
Still requires red-teaming and human review for high-risk actions
Counterfactual estimates are sensitive to support/coverage in logs

See docs/non-goals.md for scope boundaries that keep the lab focused on offline routing evaluation rather than live runtime ownership.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
docs		docs
examples		examples
reports		reports
src/agent_routing_eval_lab		src/agent_routing_eval_lab
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-routing-eval-lab

What this repo demonstrates

Architecture

Documentation

Quickstart

Example comparison output

Use as a CI gate

Enterprise governance mapping

Public library showcase

When to use this pattern

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-routing-eval-lab

What this repo demonstrates

Architecture

Documentation

Quickstart

Example comparison output

Use as a CI gate

Enterprise governance mapping

Public library showcase

When to use this pattern

Limitations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages