Skip to content

Add telecom tau2↔p2m Spearman correlation study#65

Draft
tangym wants to merge 52 commits into
mainfrom
yemingtang/telecom-tau2-correlation
Draft

Add telecom tau2↔p2m Spearman correlation study#65
tangym wants to merge 52 commits into
mainfrom
yemingtang/telecom-tau2-correlation

Conversation

@tangym

@tangym tangym commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an end-to-end correlation study comparing tau2-bench (task-completion simulator) scores against p2m (spec-driven eval) scores for telecom domain agents.

The goal: validate that p2m eval dimensions capture the same signal as tau2 pass^k metrics by computing Spearman rank correlation across multiple models.

What's included

  • Telecom behavior spec & tool schemas — 4 judge dimensions (workflow_violation, policy_adherence, communication_quality, escalation_judgment) grounded in telecom customer-service scenarios
  • Orchestration script (run_correlation.py) — single entry point that runs tau2, p2m, and correlation analysis for a configurable set of models with cost estimation, progress tracking, and dry-run support
  • p2m eval config — 70 seeds (50 prompt + 20 scenario), max 12 turns, 5 concurrent rollouts
  • GitHub Actions workflow — manual workflow_dispatch for running individual stages in CI
  • Phoenix auto-instrumentation — for hosted (Prompt Agent) targets via OpenTelemetry

CLI usage

# Full pipeline
python examples/telecom_tau2_correlation/run_correlation.py \
  --stages tau2 p2m correlate \
  --models gpt-4o-mini gpt-4.1-nano gpt-5-mini gpt-5.4-nano

# Individual stages
python run_correlation.py --stages p2m --models gpt-5.4-nano

Status

  • p2m runs complete for gpt-5.4-nano and gpt-5.4
  • tau2 run complete for gpt-5.4-nano (4 trials, 326 sims)
  • tau2 + p2m runs for remaining models (gpt-4o-mini, gpt-4.1-nano, gpt-5-mini)
  • Correlation analysis (need ≥3 models)

Commits

Small focused commits — each does one thing. See commit log for the full story.

@changliu2

Copy link
Copy Markdown
Collaborator

A few model-selection + perf knobs from the model-coverage / inference-speedup discussion:

Foundry-only model list for the tau2 study (range of quality)

Range chosen for measurable spread on the same harness; only Foundry endpoints so latency is bounded by Azure data-plane. Confirm exact deployment names against our Foundry inventory before wiring.

Tier Model LiteLLM string Notes
Frontier grok-4.3 azure/grok-4-3 top of the AA chart we are calibrating against
Frontier grok-4-20-reasoning azure/grok-4-20-reasoning reasoning variant
Frontier GPT-5.5 (xhigh) azure/gpt-5.5 OpenAI top-tier anchor
Frontier Claude Opus 4.7 (max) azure/claude-opus-4-7-max Anthropic top-tier; cross-vendor anchor
Strong mid grok-4 azure/grok-4 xAI mid-frontier
Strong mid GPT-5.4 (xhigh) azure/gpt-5.4 default-strong baseline
Strong mid DeepSeek-V3.1 azure/deepseek-v3-1 strong OSS reasoning class
Strong mid Claude Sonnet 4.6 azure/claude-sonnet-4-6 Anthropic mid-tier
Strong mid Mistral Medium 3.5 azure/mistral-medium-3-5 EU-hosted mid-tier
Fast mid grok-4-1-fast-reasoning azure/grok-4-1-fast-reasoning speed-optimized reasoning
Fast mid grok-4-fast-reasoning azure/grok-4-fast-reasoning speed-optimized reasoning
Fast mid GPT-5.4 mini (xhigh) azure/gpt-5.4-mini our current default judge/auditor
Fast mid grok-3 azure/grok-3 older frontier; reference point
Small/fast DeepSeek-V3-0324 azure/deepseek-v3-0324 mid-tier OSS reference
Small/fast Llama-3.3-70B-Instruct azure/llama-3-3-70b-instruct Meta open-weight baseline
Small/fast gpt-oss-120B (high) azure/gpt-oss-120b OpenAI open-weight
Small/fast gpt-oss-20B (high) azure/gpt-oss-20b smallest OSS — judge-failure floor probe

Recommended 3-model rotation for the regression-gate CI (keep PR latency under ~15 min): azure/gpt-5.4-mini (default), azure/gpt-5.4, one cross-vendor (azure/claude-sonnet-4-6 or azure/grok-4-fast-reasoning). The full 17 above run in the monthly bulk pass.

I deliberately dropped Google Gemini — not on Foundry today. That's tracked separately under the 3P-endpoint testing pilot Jake owns.

Inference-speedup knobs (the dev complaint about tau2 being slow)

The dominant cost on tau2 telecom is not the framework — it's (auditor turns) × (agent turns) × (tool roundtrips) × (judge calls). Three knobs, ordered by ROI:

  1. Bump concurrency. Default inference.concurrency is conservative (4). PR Joint AgentShield + p2m incident-triage demo (//build 2026 candidate) #43 family bumped it to 24 for the incident-triage demo — same change applies here. Single-line fix, ~6× wall-clock improvement on multi-seed runs:

    pipeline:
      inference:
        concurrency: 24
        request_timeout: 120
  2. Cap max_turns to a reasonable value. Today the example configs run with max_turns: 10 or unbounded. For tau2 telecom that's overkill — telecom triage conversations resolve in ≤ 5 turns most of the time. Recommend:

    pipeline:
      inference:
        target:
          max_turns: 5    # 10 if you want headroom for recovery beats; 5 is fine for the regression rail

    This alone cuts ~30–50% of wall time on scenarios that would otherwise drift to the 10-turn ceiling.

  3. Faster auditor and judge models. Drop the auditor and judge to azure/gpt-5.4-mini for the routine runs; only escalate to azure/gpt-5.4 for tie-breaker / publication runs. Saves another ~30%.

On switching to UK AI Inspect to speed up tau2

Short answer: it will not help. Inspect has the same fundamental loop (spec → generate → run → judge); switching frameworks does not reduce model-call count, which is the actual bottleneck. We also lose OpenInference auto-trace (8/8 → 1/8 observability on judge side) and the multi-turn auditor pressure model that makes tau2 telecom discriminative in the first place. Inspect-compatible export is on our roadmap so we can publish to that ecosystem without switching frameworks.

Happy to file a separate issue tracking the tau2-speedup playbook if useful.

tangym added 22 commits May 20, 2026 00:36
concept.md: Telecom customer service agent behavior specification
derived from tau2-bench's main_policy.md. Covers 7 operational areas
(customer lookup, billing, line management, data refueling, plan
changes, roaming, tech support) with quality and safety expectations.

telecom_tools.yaml: 14 agent tools (7 READ + 6 WRITE + 1 GENERIC)
extracted from tau2-bench's telecom tools.py in p2m YAML format.
Pipeline: policy (15 behaviors) → design → seeds (50 prompts +
20 scenarios) → rollout (model + simulated tools, 12 turns) →
judge (4 dimensions: workflow_violation, policy_adherence,
communication_quality, escalation_judgment).

Default target is azure/gpt-4o-mini; override via --set for
multi-model correlation runs.
Documents motivation, file layout, running instructions, suggested
model set, correlation workflow, and design decisions.
Supports --stages tau2,p2m,correlate for selective execution,
--models for custom model sets, and --dry-run for preview.
Generates per-model YAML configs since p2m has no --set flag.
Replace --set flag examples (not supported) with run_correlation.py
usage, stage descriptions, and selective execution examples.
tau2 saves simulations to {DATA_DIR}/simulations/, where DATA_DIR is
resolved by the tau2 package (typically <tau3-bench>/data/). The previous
hardcoded path 'data/tau2/simulations/' was incorrect.
Uses gpt-5.4-nano with tiny sample sizes (3 behaviors, 3 prompts,
2 scenarios) to verify the p2m pipeline works end-to-end in ~90s.
target.trace was only checked inside the callable target path, so
model-only targets silently ignored trace config.  Add a call to
phoenix.otel.register(auto_instrument=True) in the hosted session
path so litellm calls emit spans to Phoenix automatically.

Idempotent (module-level flag) and graceful when arize-phoenix-otel
is not installed.
- Change default_model and all stage models from gpt-5.4-mini to
  gpt-5.4-nano (gpt-5.4-mini is not deployed)
- Fix toolset path: bare filename instead of full relative path
  (p2m resolves relative to config file directory)
- Add Phoenix tracing to rollout target in both configs
Bug fixes:
- Add missing shutil import for binary resolution
- Resolve tau2/p2m binaries via shutil.which() with venv fallback
- Write temp configs next to source (not in results/) so relative
  paths resolve correctly
- Fix score parsing: verdict.dimensions (not verdicts/scores)
- Change TAU2_USER_LLM to gpt-5.4-nano (gpt-5.4-mini not deployed)
- Move global statement to function top

Cost/progress improvements:
- Stream subprocess output to terminal instead of capture_output=True
  (tau2 progress bars and per-sim costs are now visible)
- Add elapsed time tracking per command and per stage
- Add model progress counters (e.g. 'tau2 model 1/2: azure/gpt-5.4')
- Add post-stage cost summaries (tau2: sim count + USD, p2m: tokens)
- Add pre-run confirmation prompts with cost/time estimates
- Add --yes/-y flag to skip confirmation (for CI/automated runs)
- Add --max-concurrency flag to control tau2 parallelism (was hardcoded)
- Add shutil.which guard with helpful install instructions on tau2 missing
Stages: tau2, p2m, correlate (selectable via workflow_dispatch).
Caches tau2 simulation results and p2m artifacts across runs.
- Add models.yaml with endpoint definitions, quick/full presets, and
  9-model inventory with per-model preset membership
- Rename run_correlation.py → run_comparison.py
- Add load_models_config(), get_preset_models(), get_preset_overrides()
  to read models.yaml and resolve preset parameters
- Add --preset CLI arg (quick/full) that sets models, trials,
  concurrency from models.yaml
- Rename CLI args: --tau2-trials → --trials, --tau2-user-llm →
  --user-model, --max-concurrency → --concurrency,
  --p2m-seed-count → --test-cases
- Remove hardcoded DEFAULT_MODELS list (replaced by models.yaml)
- Rename constants: TAU2_NUM_TRIALS → DEFAULT_TRIALS,
  TAU2_USER_LLM → DEFAULT_USER_MODEL,
  TAU2_MAX_CONCURRENCY → DEFAULT_CONCURRENCY
- Eliminate global mutation pattern (was: global TAU2_NUM_TRIALS)
- concept: → behavior: with inline description (absorb concept.md)
- factors: → pipeline.test_set.stratify.dimensions:
- pipeline.policy: → pipeline.systematize:
- behavior_count → behavior_category_count
- pipeline.seeds: → pipeline.test_set:
- pipeline.rollout: → pipeline.inference:
- auditor: → tester:
- max_turns: 12 → 5
- suite name: telecom-tau2-correlation (drop -v1 suffix)
- Delete concept.md (content now inline in behavior.description)
- Update run_comparison.py config path and terminology
models.yaml maps each model to a region endpoint key, which resolves
to a region-specific env var (e.g. AZURE_API_BASE_WESTUS2).  Each
subprocess inherits a tailored env dict with AZURE_API_BASE set to
the correct URL for that model's region.

- _model_entry(), resolve_endpoint_url(), resolve_endpoint_env_var()
  for the lookup chain: model → endpoint key → env var → URL
- validate_endpoints() exits early with actionable diagnostics when
  required env vars are missing
- make_model_env() builds per-subprocess env with AZURE_API_BASE set
- run_tau2() and run_p2m() accept models_config and pass env to
  run_cmd()
- main() calls validate_endpoints() before expensive API stages
Add discover_tau2_results() and discover_p2m_results() to detect
per-model output files that already exist on disk. Replace the simple
confirm_stage() with plan_and_confirm() which shows 'Done' vs 'To run'
model lists and returns only the pending models.

main() now skips already-completed models within a stage, merges
existing results with new results, and extracts suite_name earlier
so p2m discovery can use it.
--force re-runs all models even if results already exist on disk,
bypassing auto-discovery.

After each stage completes, save intermediate tau2_rewards.json and
p2m_scores.json to results/. The correlate stage loads these
automatically when run separately (--stages=correlate), enabling
multi-session workflows where tau2 and p2m run independently.
_progress_line() shows '[3/9] elapsed 24m, ETA ~48m' based on
average time per completed model. First model shows just '[1/N]'
since no timing data is available yet.

Applied to both run_tau2() and run_p2m() loop headers.
run_p2m() now accepts test_cases, max_turns, and judge_model
parameters. These are resolved from CLI args and preset overrides
in models.yaml (test_cases, max_turns, judge_model fields).

When test_cases is set, both prompt and scenario sample_size are
patched (scenario = test_cases // 3, min 1).

Delete eval_config_smoke.yaml — its role is replaced by the
'quick' preset in models.yaml.
- Rewrite README with preset-based quick start, CLI reference, stage
  docs, multi-endpoint setup, and current model inventory
- Fix CI workflow: rename run_correlation.py to run_comparison.py,
  replace --max-concurrency/--tau2-trials with --concurrency/--trials,
  add --preset input option, add multi-endpoint env vars, add --force
@tangym tangym force-pushed the yemingtang/telecom-tau2-correlation branch from 5bcff83 to cce2def Compare May 20, 2026 03:36
tangym added 4 commits May 20, 2026 08:12
- Import and call load_dotenv() to pick up .env from repo root
- Add TAU2_DATA_DIR constant with env var override support
- Fix DEFAULT_USER_MODEL from non-existent gpt-5.4-nano to gpt-5.4-mini
Multi-endpoint infrastructure for running tau2 across Azure regions:

- models.yaml: add api_keys and user_simulator sections per endpoint
- resolve_api_key_env_var(): look up API key env var by model endpoint
- resolve_user_model(): pick user-sim model co-located on agent endpoint
- validate_endpoints(): check both base URL and API key env vars
- validate_tau2_data(): verify tau2 domain data exists before running
- make_model_env(): set AZURE_API_KEY and TAU2_DATA_DIR in subprocess env
- discover_tau2_results(): validate JSON has actual simulations
- run_tau2(): use per-endpoint user simulator, use TAU2_DATA_DIR paths
- .gitignore: exclude local data/ symlink
Quick validation script that tests each configured Azure endpoint
by sending a minimal chat completion request. Supports --list to
show endpoint configuration without making API calls.
- .env.example: add per-region API key/base vars and TAU2_DATA_DIR
- README.md: add data directory setup instructions with symlink example
@tangym tangym force-pushed the yemingtang/telecom-tau2-correlation branch 2 times, most recently from 314b3bd to 15cd2f4 Compare May 20, 2026 08:25
Filter stderr from tau2 subprocesses to suppress repetitive litellm
'model isn't mapped yet' warnings (show only first occurrence) and
drop litellm promo/feedback lines entirely.
@tangym tangym force-pushed the yemingtang/telecom-tau2-correlation branch from 15cd2f4 to 428f99b Compare May 20, 2026 08:29
tangym and others added 25 commits May 20, 2026 08:32
p2m only supports systematize, test_set, inference, and judge.
The 'design' stage caused 'Unknown stage(s): design' on every run.
After suppressing noisy litellm/loguru lines, adjacent blank lines
were still passing through, creating long stretches of empty output.
Track whether the last line was suppressed and skip blank lines that
immediately follow.
- Default: 5 → 10
- Quick preset: 10 → 30
- Full preset: 24 → 40

Azure OpenAI deployments easily handle 30-40 concurrent requests.
With 114 tau2 tasks (57 scenarios × 2 trials), higher concurrency
cuts wall-clock time significantly.
p2m's pipeline uses multiple models across stages (systematize, test_set,
tester, judge) that all share a single AZURE_API_BASE env var. When the
inference target model is on a different endpoint (e.g. westus2) than the
pipeline models (e.g. default), overriding AZURE_API_BASE globally caused
all pipeline stages to fail with DeploymentNotFound.

Fix: keep AZURE_API_BASE pointing at the default endpoint for pipeline
models. Route only the target model to its specific endpoint by passing a
_P2M_MODEL_ROUTING JSON env var to _p2m_shim.py, which monkey-patches
litellm.acompletion to inject per-model api_base and api_key.
- run_p2m() now calls _p2m_shim.py instead of p2m directly
- New make_p2m_env() builds routing table instead of overriding AZURE_API_BASE
- Quick preset concurrency: 30 → 20 (less likely to hit rate limits)
When AZURE_API_BASE is not in the environment (only region-specific
variants exist), p2m pipeline models (systematize, test_set, judge)
fail with DeploymentNotFound because they have no default endpoint.

- models.yaml: add pipeline_endpoint pointing to australiaeast
- make_p2m_env(): fall back to pipeline_endpoint for AZURE_API_BASE
- validate_endpoints(): check pipeline endpoint env vars are set
tau2 appends simulations to existing JSON files across runs, causing
accumulated sim counts that don't match the requested trial count.
collect_tau2_rewards() then averages over all accumulated sims,
producing misleading reward scores.

- discover_tau2_results(): accept expected_trials param, remove files
  where sim count doesn't match (stale from prior runs)
- main(): pass trials count to discover_tau2_results()
Resume: partial result files are kept on disk instead of deleted.
The PTY now connects stdin so tau2's interactive 'resume?' prompt is
auto-answered with 'y'.  discover_tau2_results() uses the JSON tasks
list to compute expected sim count (n_tasks × trials) — only files
with MORE sims than expected are treated as stale.

Error reporting: after a tau2 failure, _summarize_tau2_run() reads
the partial output file and captured ERROR lines to produce a
grouped summary (e.g. '12× empty model response, 3× rate-limited')
instead of the previous generic 'tau2 failed, skipping'.
- Add --tau2-retries flag (default 3) for resilience against tau2 crashes
- Print tau2 completion table after simulation stage
- Report Spearman p-values, sample sizes, and significance markers
- Show both full and filtered (>=50%% completion) correlation results
- Suppress OTEL/gRPC exporter noise when Phoenix isn't running
tau2 upstream has two unhandled exceptions (JSONDecodeError, ValueError for
empty AssistantMessage) that crash the entire ThreadPoolExecutor, killing all
in-flight tasks. With concurrency=10, each crash wastes ~10 tasks. Lowering
to 2 limits blast radius per crash; bumping retries from 3 to 10 gives the
resume loop enough attempts to grind through error-prone models like
gpt-oss-120b (2% completion at concurrency=10, 3 retries).
- Auto-log each run to logs/run_YYYYMMDD_HHMMSS.log (+ --log-file flag)
- .gitignore: exclude logs/ directory
- After tau2 exits 0, verify actual sim count matches expected before
  declaring success; retry on incomplete runs
- Reduce DEFAULT_TAU2_RETRIES from 10 to 3 — investigation showed
  diminishing returns (~9 new sims per retry for failing models)
- Fix concept.md ghost reference — point to eval_config.yaml instead
- Fix mini preset judge_model (gpt-5.4, not gpt-5.4-mini)
- Fix tau2 results path (data/simulations/, not results/tau2/)
- Fix p2m artifacts path (no -v1 suffix in suite name)
- Add --log-file to CLI options table
- Soften motivation framing — study checks ranking agreement,
  not making a trust claim
- Add 'Inspecting results' section with data completeness checks,
  correlation output interpretation, and p2m artifact navigation
Loads results/correlation_results.json and raw simulation data to
produce styled tables, grouped bar charts, a Spearman heatmap,
scatter plots per dimension, reward distribution histograms, and
per-task disagreement analysis. Charts saved to results/ as PNGs.
generate_report.py produces a self-contained HTML file with all charts
embedded as base64 PNGs. Supports --out and --no-open flags. README
updated with report generation section and file table entries.
- Add _recompute_correlations() helper for subset Spearman ρ
- build_report() now accepts min_sims and subtitle parameters
- Filter out models with fewer than min_sims tau2 simulations
- main() generates report.html (all models) + report_filtered.html
- Replace --no-open with --open (browser stays closed by default)
- Add --min-sims flag (default: 50)
- Add 'report' to valid_stages and default stages
- Report stage generates both full and filtered HTML reports
- Runs automatically after correlate stage
- Document automatic report stage in pipeline
- Document two-report output (all + filtered)
- Document --open and --min-sims flags
… section

- Single report.html with: data status, full analysis, filtered analysis, reward distributions
- Data status section shows all tau2/p2m models with coverage gaps
- Filtered section excludes models with <min_sims tau2 simulations
- If no models are filtered, shows 'no filtering needed' message
- Simplify main() and run_comparison report stage for single output
The correlate stage was a leftover from when correlation analysis
was done inside the pipeline. It has been fully replaced by the
standalone generate_report.py script that reads artifacts from disk.

- Remove _build_model_correlation_results() and supporting imports
- Remove backward-compat block that skipped missing correlate data
- Remove 'correlate' from valid_stages and default stage list
- Update docstring to reflect tau2 → p2m → report pipeline
Enrich the correlation report with two new sections:

1. Completion percentages in the data status table:
   - tau2 Complete: sims / (tasks × trials), color-coded green/orange
   - p2m Complete: scored / expected, color-coded green/orange
   - Track task counts and p2m progress from disk artifacts

2. Eval Configuration metadata box (shown at top of report):
   - Suite name, behavior, trials per task
   - Judge model, user simulator models
   - Full list of target models under evaluation

Also fix trials resolution to use max across presets (matches
full/default runs) instead of picking the first preset found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants