This repository contains a tau2-bench based evaluation setup for studying whether a dedicated guardrail layer can reduce policy-violating tool calls in customer-service agents.
The important idea is simple:
User simulator → Agent → Guardrail middleware → Tool execution
The agent still decides what tool to call. The guardrail middleware sees that tool call before it reaches the real tool. If the call violates a mapped policy rule, the guard blocks it and returns feedback to the agent instead of executing the side effect.
The project currently supports experiments across:
airlinetelecomretail
It also contains a policy-tool mapper pipeline that maps policy passages to the tools they govern.
The main way to work with the project is the custom viewer in:
./viewer/
It is not the old web/leaderboard app. The viewer is the project-specific UI for:
- browsing all completed simulation runs
- comparing guarded vs unguarded experiments
- analyzing reward, policy violation rate, latency, and guard blocks
- scheduling full experiments from the browser
- scheduling policy-tool-mapper runs from the browser
- run live chats with & without gaurd architecture in production mode
Run it from the repository root:
uv run python viewer/server.pyThen open:
http://localhost:8765
Install dependencies:
uv syncCreate or update the environment file:
./.env
Typical keys used by experiments:
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
HF_TOKEN=...The exact keys needed depend on the models you run.
The core command is:
uv run tau2 run \
--domain <domain> \
--agent llm_agent \
--guardrail-config <guardrail_config> \
--agent-llm <agent_model> \
--user-llm <user_model> \
--num-trials <n> \
--max-concurrency <n>Use guardrail_configs/null.json.
uv run tau2 run \
--domain airline \
--agent llm_agent \
--guardrail-config guardrail_configs/null.json \
--agent-llm gpt-4.1-mini \
--user-llm gpt-5.1 \
--num-trials 3 \
--max-concurrency 3Use the domain guard config.
uv run tau2 run \
--domain airline \
--agent llm_agent \
--guardrail-config guardrail_configs/airline_llm_guard.json \
--agent-llm gpt-4.1-mini \
--user-llm gpt-5.1 \
--guard-llm gpt-4.1-mini \
--num-trials 3 \
--max-concurrency 3--guard-llm overrides the LLM used by all LLM-based guards in the guardrail config. Everything else in the config stays unchanged.
Current main configs:
guardrail_configs/airline_llm_guard.json
guardrail_configs/telecom_llm_guard.json
guardrail_configs/retail_llm_guard.json
guardrail_configs/null.json
null.json is the no-guard baseline.
uv run tau2 run \
--domain airline \
--agent llm_agent \
--guardrail-config guardrail_configs/airline_llm_guard.json \
--agent-llm gpt-4.1-mini \
--user-llm gpt-5.1 \
--guard-llm gpt-4.1-mini \
--num-trials 1 \
--max-concurrency 1 \
--task-ids 0 1 4 5 9 11 12The recommended path is the web viewer:
python3 viewer/server.pyOpen:
http://localhost:8765
Useful pages in the viewer:
- Simulation Runs: all result folders with reward, policy violation rate, latency, guard model, agent model, user model, and sample count.
- Experiment Analyzer: select guarded and unguarded runs per model and compare policy violation rate, reward, latency, and guard block counts.
- Task Conversation View: inspect the actual dialogue and tool calls for one task/trial.
- Experiment Scheduler: create queued tau2 runs from the browser instead of hand-writing CLI commands.
- Mapper Scheduler / Analyzer: run and compare policy-tool-mapper experiments.
The analyzer’s guard-block graph is count-based:
- Total guard blocks: all tool calls blocked by guards.
- Correctly blocked: blocked calls that matched an explicit
unauthorized_actioncompliance assertion in the task JSON. - Unmatched block chats: links to conversations where the guard blocked a call that did not match an explicit task compliance assertion.
The policy-tool mapper creates mappings from tools to the policy passages that constrain them.
There are two modes:
chunker → profiler → mapper → sweeper
The LLM directly maps policy statements to tools.
chunker → profiler → BM25 + embedding search → cross-encoder top-k reranker → LLM judge → sweeper
This mode first retrieves candidate policy passages, reranks them, and then asks an LLM judge to verify the candidates.
This is the most important command when you want the full overview of existing mapper results.
It does not rerun mapping or evaluation. It only reads existing files from:
policy_tool_mapper/output/
Airline:
uv run python policy_tool_mapper/run_pipeline.py \
--domain airline \
--compareTelecom:
uv run python policy_tool_mapper/run_pipeline.py \
--domain telecom \
--compareRetail:
uv run python policy_tool_mapper/run_pipeline.py \
--domain retail \
--comparePolicy and tool input files live in:
policy_tool_mapper/input/
Expected files:
policy_tool_mapper/input/airlinePolicy.md
policy_tool_mapper/input/airlineTools.json
policy_tool_mapper/input/telecomPolicy.md
policy_tool_mapper/input/telecomTools.json
policy_tool_mapper/input/retailPolicy.md
policy_tool_mapper/input/retailTools.json
Ground truth lives in:
policy_tool_mapper/ground_truth/
Important files:
policy_tool_mapper/ground_truth/airline-ground-truth.json
policy_tool_mapper/ground_truth/telecom-ground-truth.json
policy_tool_mapper/ground_truth/retail-ground-truth.json
Outputs are written to:
policy_tool_mapper/output/
Run from the repository root.
uv run python policy_tool_mapper/run_pipeline.py \
--domain airline \
--models gpt-4.1-mini gpt-4.1 gpt-5.1 gpt-5.4 \
--mode llmFor telecom:
uv run python policy_tool_mapper/run_pipeline.py \
--domain telecom \
--models gpt-4.1-mini gpt-4.1 gpt-5.1 gpt-5.4 \
--mode llmFor retail:
uv run python policy_tool_mapper/run_pipeline.py \
--domain retail \
--models gpt-4.1-mini gpt-4.1 gpt-5.1 gpt-5.4 \
--mode llmuv run python policy_tool_mapper/run_pipeline.py \
--domain airline \
--models gpt-4.1-mini gpt-4.1 gpt-5.1 gpt-5.4 \
--mode retrieval \
--ce-top-k 20The default retrieval setup uses:
Embedding model: text-embedding-3-large
Cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
CE top-k: 20
The comparison table includes:
- model
- mode:
llmorretrieval - confidence slice:
highorall - macro precision
- macro recall
- macro F1
- micro precision
- micro recall
- micro F1
The files that feed the comparison look like:
policy_tool_mapper/output/airline-eval-gpt-4.1-mini-high.json
policy_tool_mapper/output/airline-eval-gpt-4.1-mini-all.json
policy_tool_mapper/output/airline-eval-gpt-4.1-mini-retrieval-high.json
policy_tool_mapper/output/airline-eval-gpt-4.1-mini-retrieval-all.json
You can also inspect and compare these results in the web viewer under the mapper analyzer.
Use policy-map directly when debugging one mapping run.
uv run policy-map \
--policy policy_tool_mapper/input/airlinePolicy.md \
--openapi policy_tool_mapper/input/airlineTools.json \
--output policy_tool_mapper/output/airline-mappings-debug.json \
--model gpt-4.1-mini \
--mode llmRetrieval mode:
uv run policy-map \
--policy policy_tool_mapper/input/airlinePolicy.md \
--openapi policy_tool_mapper/input/airlineTools.json \
--output policy_tool_mapper/output/airline-mappings-debug-retrieval.json \
--model gpt-4.1-mini \
--mode retrieval \
--ce-top-k 20Evaluate one mapping file:
uv run policy-map-eval \
--predicted policy_tool_mapper/output/airline-mappings-debug.json \
--ground-truth policy_tool_mapper/ground_truth/airline-ground-truth.json \
--output policy_tool_mapper/output/airline-eval-debug.jsonEvaluate high-confidence mappings only:
uv run policy-map-eval \
--predicted policy_tool_mapper/output/airline-mappings-debug.json \
--ground-truth policy_tool_mapper/ground_truth/airline-ground-truth.json \
--output policy_tool_mapper/output/airline-eval-debug-high.json \
--confidence-high-onlyTasks:
data/tau2/domains/airline/tasks.json
data/tau2/domains/telecom/tasks.json
data/tau2/domains/retail/tasks.json
Task splits:
data/tau2/domains/airline/split_tasks.json
data/tau2/domains/telecom/split_tasks.json
data/tau2/domains/retail/split_tasks.json
Policies:
data/tau2/domains/airline/policy.md
data/tau2/domains/telecom/main_policy.md
data/tau2/domains/retail/policy.md
Tools:
src/tau2/domains/airline/tools.py
src/tau2/domains/telecom/tools.py
src/tau2/domains/retail/tools.py
Databases:
data/tau2/domains/airline/db.json
data/tau2/domains/retail/db.json
data/tau2/domains/telecom/db.toml
data/tau2/domains/telecom/user_db.toml
Guard configs:
guardrail_configs/
Simulation output:
data/simulations/
Policy-mapper output:
policy_tool_mapper/output/
For each domain and model, run one baseline and one guarded experiment:
# Without guard
uv run tau2 run \
--domain telecom \
--agent llm_agent \
--guardrail-config guardrail_configs/null.json \
--agent-llm gpt-4.1-mini \
--user-llm gpt-5.1 \
--num-trials 3 \
--max-concurrency 3
# With guard
uv run tau2 run \
--domain telecom \
--agent llm_agent \
--guardrail-config guardrail_configs/telecom_llm_guard.json \
--agent-llm gpt-4.1-mini \
--user-llm gpt-5.1 \
--guard-llm gpt-4.1-mini \
--num-trials 3 \
--max-concurrency 3Then open the viewer and compare the pair in the Experiment Analyzer.
The main plots to report are:
- policy violation rate
- reward
- latency change
- guard block counts
For the policy mapper, run both:
--mode llm--mode retrieval
Then use:
uv run python policy_tool_mapper/run_pipeline.py --domain <domain> --compareor the mapper analyzer in the viewer.
If you hit provider rate limits:
- lower
--max-concurrency - use fewer models in one mapper batch
- run fewer trials
- resume later with
--auto-resume
For Anthropic especially, input-token-per-minute limits can be hit even when request count looks low.
The viewer’s unmatched block list is a conservative diagnostic, not a perfect semantic false-positive oracle.
It counts a guard block as “correctly blocked” only if the blocked tool call matches an explicit unauthorized_action compliance assertion in the task JSON. A block that does not match such an assertion is shown as unmatched so the chat can be inspected manually.
./
├── data/tau2/domains/ # Domain policies, tasks, DBs, splits
├── data/simulations/ # Completed tau2 experiment results
├── guardrail_configs/ # Guardrail JSON configs
├── policy_tool_mapper/ # Policy-to-tool mapping pipeline
│ ├── input/ # Mapper policies and tool JSON
│ ├── ground_truth/ # Human ground-truth mappings
│ └── output/ # Mapper outputs and eval files
├── src/tau2/guardrails/ # Guard middleware and guard implementations
├── src/tau2/domains/ # Domain tool implementations
└── viewer/ # Custom web viewer, scheduler, analyzers
For guardrail architecture experiments:
- primary safety metric: policy violation rate
- utility metric: reward
- cost metric: average task latency
- diagnostic metric: guard block counts and unmatched block chats
For policy-tool mapping:
- primary balanced metric: F1
- safety-oriented metric: recall
- precision-oriented metric: precision
- report both macro and micro scores
- compare
llmvsretrieval - compare
highvsallconfidence slices
My contribution covers the full guardrail middleware layer and all necessary tau2 adaptations around it, the complete custom viewer and web app, and the full policy-tool mapper pipeline.
