Add telecom tau2↔p2m Spearman correlation study#65
Conversation
|
A few model-selection + perf knobs from the model-coverage / inference-speedup discussion: Foundry-only model list for the tau2 study (range of quality)Range chosen for measurable spread on the same harness; only Foundry endpoints so latency is bounded by Azure data-plane. Confirm exact deployment names against our Foundry inventory before wiring.
Recommended 3-model rotation for the regression-gate CI (keep PR latency under ~15 min): I deliberately dropped Google Gemini — not on Foundry today. That's tracked separately under the 3P-endpoint testing pilot Jake owns. Inference-speedup knobs (the dev complaint about tau2 being slow)The dominant cost on tau2 telecom is not the framework — it's
On switching to UK AI Inspect to speed up tau2Short answer: it will not help. Inspect has the same fundamental loop (spec → generate → run → judge); switching frameworks does not reduce model-call count, which is the actual bottleneck. We also lose OpenInference auto-trace (8/8 → 1/8 observability on judge side) and the multi-turn auditor pressure model that makes tau2 telecom discriminative in the first place. Inspect-compatible export is on our roadmap so we can publish to that ecosystem without switching frameworks. Happy to file a separate issue tracking the tau2-speedup playbook if useful. |
concept.md: Telecom customer service agent behavior specification derived from tau2-bench's main_policy.md. Covers 7 operational areas (customer lookup, billing, line management, data refueling, plan changes, roaming, tech support) with quality and safety expectations. telecom_tools.yaml: 14 agent tools (7 READ + 6 WRITE + 1 GENERIC) extracted from tau2-bench's telecom tools.py in p2m YAML format.
Pipeline: policy (15 behaviors) → design → seeds (50 prompts + 20 scenarios) → rollout (model + simulated tools, 12 turns) → judge (4 dimensions: workflow_violation, policy_adherence, communication_quality, escalation_judgment). Default target is azure/gpt-4o-mini; override via --set for multi-model correlation runs.
Documents motivation, file layout, running instructions, suggested model set, correlation workflow, and design decisions.
Supports --stages tau2,p2m,correlate for selective execution, --models for custom model sets, and --dry-run for preview. Generates per-model YAML configs since p2m has no --set flag.
Replace --set flag examples (not supported) with run_correlation.py usage, stage descriptions, and selective execution examples.
tau2 saves simulations to {DATA_DIR}/simulations/, where DATA_DIR is
resolved by the tau2 package (typically <tau3-bench>/data/). The previous
hardcoded path 'data/tau2/simulations/' was incorrect.
Uses gpt-5.4-nano with tiny sample sizes (3 behaviors, 3 prompts, 2 scenarios) to verify the p2m pipeline works end-to-end in ~90s.
target.trace was only checked inside the callable target path, so model-only targets silently ignored trace config. Add a call to phoenix.otel.register(auto_instrument=True) in the hosted session path so litellm calls emit spans to Phoenix automatically. Idempotent (module-level flag) and graceful when arize-phoenix-otel is not installed.
- Change default_model and all stage models from gpt-5.4-mini to gpt-5.4-nano (gpt-5.4-mini is not deployed) - Fix toolset path: bare filename instead of full relative path (p2m resolves relative to config file directory) - Add Phoenix tracing to rollout target in both configs
Bug fixes: - Add missing shutil import for binary resolution - Resolve tau2/p2m binaries via shutil.which() with venv fallback - Write temp configs next to source (not in results/) so relative paths resolve correctly - Fix score parsing: verdict.dimensions (not verdicts/scores) - Change TAU2_USER_LLM to gpt-5.4-nano (gpt-5.4-mini not deployed) - Move global statement to function top Cost/progress improvements: - Stream subprocess output to terminal instead of capture_output=True (tau2 progress bars and per-sim costs are now visible) - Add elapsed time tracking per command and per stage - Add model progress counters (e.g. 'tau2 model 1/2: azure/gpt-5.4') - Add post-stage cost summaries (tau2: sim count + USD, p2m: tokens) - Add pre-run confirmation prompts with cost/time estimates - Add --yes/-y flag to skip confirmation (for CI/automated runs)
- Add --max-concurrency flag to control tau2 parallelism (was hardcoded) - Add shutil.which guard with helpful install instructions on tau2 missing
Stages: tau2, p2m, correlate (selectable via workflow_dispatch). Caches tau2 simulation results and p2m artifacts across runs.
- Add models.yaml with endpoint definitions, quick/full presets, and 9-model inventory with per-model preset membership - Rename run_correlation.py → run_comparison.py - Add load_models_config(), get_preset_models(), get_preset_overrides() to read models.yaml and resolve preset parameters - Add --preset CLI arg (quick/full) that sets models, trials, concurrency from models.yaml - Rename CLI args: --tau2-trials → --trials, --tau2-user-llm → --user-model, --max-concurrency → --concurrency, --p2m-seed-count → --test-cases - Remove hardcoded DEFAULT_MODELS list (replaced by models.yaml) - Rename constants: TAU2_NUM_TRIALS → DEFAULT_TRIALS, TAU2_USER_LLM → DEFAULT_USER_MODEL, TAU2_MAX_CONCURRENCY → DEFAULT_CONCURRENCY - Eliminate global mutation pattern (was: global TAU2_NUM_TRIALS)
- concept: → behavior: with inline description (absorb concept.md) - factors: → pipeline.test_set.stratify.dimensions: - pipeline.policy: → pipeline.systematize: - behavior_count → behavior_category_count - pipeline.seeds: → pipeline.test_set: - pipeline.rollout: → pipeline.inference: - auditor: → tester: - max_turns: 12 → 5 - suite name: telecom-tau2-correlation (drop -v1 suffix) - Delete concept.md (content now inline in behavior.description) - Update run_comparison.py config path and terminology
models.yaml maps each model to a region endpoint key, which resolves to a region-specific env var (e.g. AZURE_API_BASE_WESTUS2). Each subprocess inherits a tailored env dict with AZURE_API_BASE set to the correct URL for that model's region. - _model_entry(), resolve_endpoint_url(), resolve_endpoint_env_var() for the lookup chain: model → endpoint key → env var → URL - validate_endpoints() exits early with actionable diagnostics when required env vars are missing - make_model_env() builds per-subprocess env with AZURE_API_BASE set - run_tau2() and run_p2m() accept models_config and pass env to run_cmd() - main() calls validate_endpoints() before expensive API stages
Add discover_tau2_results() and discover_p2m_results() to detect per-model output files that already exist on disk. Replace the simple confirm_stage() with plan_and_confirm() which shows 'Done' vs 'To run' model lists and returns only the pending models. main() now skips already-completed models within a stage, merges existing results with new results, and extracts suite_name earlier so p2m discovery can use it.
--force re-runs all models even if results already exist on disk, bypassing auto-discovery. After each stage completes, save intermediate tau2_rewards.json and p2m_scores.json to results/. The correlate stage loads these automatically when run separately (--stages=correlate), enabling multi-session workflows where tau2 and p2m run independently.
_progress_line() shows '[3/9] elapsed 24m, ETA ~48m' based on average time per completed model. First model shows just '[1/N]' since no timing data is available yet. Applied to both run_tau2() and run_p2m() loop headers.
run_p2m() now accepts test_cases, max_turns, and judge_model parameters. These are resolved from CLI args and preset overrides in models.yaml (test_cases, max_turns, judge_model fields). When test_cases is set, both prompt and scenario sample_size are patched (scenario = test_cases // 3, min 1). Delete eval_config_smoke.yaml — its role is replaced by the 'quick' preset in models.yaml.
- Rewrite README with preset-based quick start, CLI reference, stage docs, multi-endpoint setup, and current model inventory - Fix CI workflow: rename run_correlation.py to run_comparison.py, replace --max-concurrency/--tau2-trials with --concurrency/--trials, add --preset input option, add multi-endpoint env vars, add --force
5bcff83 to
cce2def
Compare
- Import and call load_dotenv() to pick up .env from repo root - Add TAU2_DATA_DIR constant with env var override support - Fix DEFAULT_USER_MODEL from non-existent gpt-5.4-nano to gpt-5.4-mini
Multi-endpoint infrastructure for running tau2 across Azure regions: - models.yaml: add api_keys and user_simulator sections per endpoint - resolve_api_key_env_var(): look up API key env var by model endpoint - resolve_user_model(): pick user-sim model co-located on agent endpoint - validate_endpoints(): check both base URL and API key env vars - validate_tau2_data(): verify tau2 domain data exists before running - make_model_env(): set AZURE_API_KEY and TAU2_DATA_DIR in subprocess env - discover_tau2_results(): validate JSON has actual simulations - run_tau2(): use per-endpoint user simulator, use TAU2_DATA_DIR paths - .gitignore: exclude local data/ symlink
Quick validation script that tests each configured Azure endpoint by sending a minimal chat completion request. Supports --list to show endpoint configuration without making API calls.
- .env.example: add per-region API key/base vars and TAU2_DATA_DIR - README.md: add data directory setup instructions with symlink example
314b3bd to
15cd2f4
Compare
Filter stderr from tau2 subprocesses to suppress repetitive litellm 'model isn't mapped yet' warnings (show only first occurrence) and drop litellm promo/feedback lines entirely.
15cd2f4 to
428f99b
Compare
p2m only supports systematize, test_set, inference, and judge. The 'design' stage caused 'Unknown stage(s): design' on every run.
After suppressing noisy litellm/loguru lines, adjacent blank lines were still passing through, creating long stretches of empty output. Track whether the last line was suppressed and skip blank lines that immediately follow.
- Default: 5 → 10 - Quick preset: 10 → 30 - Full preset: 24 → 40 Azure OpenAI deployments easily handle 30-40 concurrent requests. With 114 tau2 tasks (57 scenarios × 2 trials), higher concurrency cuts wall-clock time significantly.
p2m's pipeline uses multiple models across stages (systematize, test_set, tester, judge) that all share a single AZURE_API_BASE env var. When the inference target model is on a different endpoint (e.g. westus2) than the pipeline models (e.g. default), overriding AZURE_API_BASE globally caused all pipeline stages to fail with DeploymentNotFound. Fix: keep AZURE_API_BASE pointing at the default endpoint for pipeline models. Route only the target model to its specific endpoint by passing a _P2M_MODEL_ROUTING JSON env var to _p2m_shim.py, which monkey-patches litellm.acompletion to inject per-model api_base and api_key.
- run_p2m() now calls _p2m_shim.py instead of p2m directly - New make_p2m_env() builds routing table instead of overriding AZURE_API_BASE - Quick preset concurrency: 30 → 20 (less likely to hit rate limits)
When AZURE_API_BASE is not in the environment (only region-specific variants exist), p2m pipeline models (systematize, test_set, judge) fail with DeploymentNotFound because they have no default endpoint. - models.yaml: add pipeline_endpoint pointing to australiaeast - make_p2m_env(): fall back to pipeline_endpoint for AZURE_API_BASE - validate_endpoints(): check pipeline endpoint env vars are set
tau2 appends simulations to existing JSON files across runs, causing accumulated sim counts that don't match the requested trial count. collect_tau2_rewards() then averages over all accumulated sims, producing misleading reward scores. - discover_tau2_results(): accept expected_trials param, remove files where sim count doesn't match (stale from prior runs) - main(): pass trials count to discover_tau2_results()
quick: 20→5, full: 40→10
Resume: partial result files are kept on disk instead of deleted. The PTY now connects stdin so tau2's interactive 'resume?' prompt is auto-answered with 'y'. discover_tau2_results() uses the JSON tasks list to compute expected sim count (n_tasks × trials) — only files with MORE sims than expected are treated as stale. Error reporting: after a tau2 failure, _summarize_tau2_run() reads the partial output file and captured ERROR lines to produce a grouped summary (e.g. '12× empty model response, 3× rate-limited') instead of the previous generic 'tau2 failed, skipping'.
- Add --tau2-retries flag (default 3) for resilience against tau2 crashes - Print tau2 completion table after simulation stage - Report Spearman p-values, sample sizes, and significance markers - Show both full and filtered (>=50%% completion) correlation results - Suppress OTEL/gRPC exporter noise when Phoenix isn't running
tau2 upstream has two unhandled exceptions (JSONDecodeError, ValueError for empty AssistantMessage) that crash the entire ThreadPoolExecutor, killing all in-flight tasks. With concurrency=10, each crash wastes ~10 tasks. Lowering to 2 limits blast radius per crash; bumping retries from 3 to 10 gives the resume loop enough attempts to grind through error-prone models like gpt-oss-120b (2% completion at concurrency=10, 3 retries).
- Auto-log each run to logs/run_YYYYMMDD_HHMMSS.log (+ --log-file flag) - .gitignore: exclude logs/ directory - After tau2 exits 0, verify actual sim count matches expected before declaring success; retry on incomplete runs - Reduce DEFAULT_TAU2_RETRIES from 10 to 3 — investigation showed diminishing returns (~9 new sims per retry for failing models)
- Fix concept.md ghost reference — point to eval_config.yaml instead - Fix mini preset judge_model (gpt-5.4, not gpt-5.4-mini) - Fix tau2 results path (data/simulations/, not results/tau2/) - Fix p2m artifacts path (no -v1 suffix in suite name) - Add --log-file to CLI options table - Soften motivation framing — study checks ranking agreement, not making a trust claim - Add 'Inspecting results' section with data completeness checks, correlation output interpretation, and p2m artifact navigation
Loads results/correlation_results.json and raw simulation data to produce styled tables, grouped bar charts, a Spearman heatmap, scatter plots per dimension, reward distribution histograms, and per-task disagreement analysis. Charts saved to results/ as PNGs.
generate_report.py produces a self-contained HTML file with all charts embedded as base64 PNGs. Supports --out and --no-open flags. README updated with report generation section and file table entries.
- Add _recompute_correlations() helper for subset Spearman ρ - build_report() now accepts min_sims and subtitle parameters - Filter out models with fewer than min_sims tau2 simulations - main() generates report.html (all models) + report_filtered.html - Replace --no-open with --open (browser stays closed by default) - Add --min-sims flag (default: 50)
- Add 'report' to valid_stages and default stages - Report stage generates both full and filtered HTML reports - Runs automatically after correlate stage
- Document automatic report stage in pipeline - Document two-report output (all + filtered) - Document --open and --min-sims flags
… section - Single report.html with: data status, full analysis, filtered analysis, reward distributions - Data status section shows all tau2/p2m models with coverage gaps - Filtered section excludes models with <min_sims tau2 simulations - If no models are filtered, shows 'no filtering needed' message - Simplify main() and run_comparison report stage for single output
The correlate stage was a leftover from when correlation analysis was done inside the pipeline. It has been fully replaced by the standalone generate_report.py script that reads artifacts from disk. - Remove _build_model_correlation_results() and supporting imports - Remove backward-compat block that skipped missing correlate data - Remove 'correlate' from valid_stages and default stage list - Update docstring to reflect tau2 → p2m → report pipeline
Enrich the correlation report with two new sections: 1. Completion percentages in the data status table: - tau2 Complete: sims / (tasks × trials), color-coded green/orange - p2m Complete: scored / expected, color-coded green/orange - Track task counts and p2m progress from disk artifacts 2. Eval Configuration metadata box (shown at top of report): - Suite name, behavior, trials per task - Judge model, user simulator models - Full list of target models under evaluation Also fix trials resolution to use max across presets (matches full/default runs) instead of picking the first preset found.
Summary
Adds an end-to-end correlation study comparing tau2-bench (task-completion simulator) scores against p2m (spec-driven eval) scores for telecom domain agents.
The goal: validate that p2m eval dimensions capture the same signal as tau2 pass^k metrics by computing Spearman rank correlation across multiple models.
What's included
run_correlation.py) — single entry point that runs tau2, p2m, and correlation analysis for a configurable set of models with cost estimation, progress tracking, and dry-run supportworkflow_dispatchfor running individual stages in CICLI usage
Status
Commits
Small focused commits — each does one thing. See commit log for the full story.