Add telecom tau2↔p2m Spearman correlation study by tangym · Pull Request #65 · responsibleai/ASSERT

tangym · 2026-05-19T07:20:21Z

Summary

Adds an end-to-end correlation study comparing tau2-bench (task-completion simulator) scores against p2m (spec-driven eval) scores for telecom domain agents.

The goal: validate that p2m eval dimensions capture the same signal as tau2 pass^k metrics by computing Spearman rank correlation across multiple models.

What's included

Telecom behavior spec & tool schemas — 4 judge dimensions (workflow_violation, policy_adherence, communication_quality, escalation_judgment) grounded in telecom customer-service scenarios
Orchestration script (run_correlation.py) — single entry point that runs tau2, p2m, and correlation analysis for a configurable set of models with cost estimation, progress tracking, and dry-run support
p2m eval config — 70 seeds (50 prompt + 20 scenario), max 12 turns, 5 concurrent rollouts
GitHub Actions workflow — manual workflow_dispatch for running individual stages in CI
Phoenix auto-instrumentation — for hosted (Prompt Agent) targets via OpenTelemetry

CLI usage

# Full pipeline
python examples/telecom_tau2_correlation/run_correlation.py \
  --stages tau2 p2m correlate \
  --models gpt-4o-mini gpt-4.1-nano gpt-5-mini gpt-5.4-nano

# Individual stages
python run_correlation.py --stages p2m --models gpt-5.4-nano

Status

p2m runs complete for gpt-5.4-nano and gpt-5.4
tau2 run complete for gpt-5.4-nano (4 trials, 326 sims)
tau2 + p2m runs for remaining models (gpt-4o-mini, gpt-4.1-nano, gpt-5-mini)
Correlation analysis (need ≥3 models)

Commits

Small focused commits — each does one thing. See commit log for the full story.

changliu2 · 2026-05-19T22:38:34Z

A few model-selection + perf knobs from the model-coverage / inference-speedup discussion:

Foundry-only model list for the tau2 study (range of quality)

Range chosen for measurable spread on the same harness; only Foundry endpoints so latency is bounded by Azure data-plane. Confirm exact deployment names against our Foundry inventory before wiring.

Tier	Model	LiteLLM string	Notes
Frontier	grok-4.3	`azure/grok-4-3`	top of the AA chart we are calibrating against
Frontier	grok-4-20-reasoning	`azure/grok-4-20-reasoning`	reasoning variant
Frontier	GPT-5.5 (xhigh)	`azure/gpt-5.5`	OpenAI top-tier anchor
Frontier	Claude Opus 4.7 (max)	`azure/claude-opus-4-7-max`	Anthropic top-tier; cross-vendor anchor
Strong mid	grok-4	`azure/grok-4`	xAI mid-frontier
Strong mid	GPT-5.4 (xhigh)	`azure/gpt-5.4`	default-strong baseline
Strong mid	DeepSeek-V3.1	`azure/deepseek-v3-1`	strong OSS reasoning class
Strong mid	Claude Sonnet 4.6	`azure/claude-sonnet-4-6`	Anthropic mid-tier
Strong mid	Mistral Medium 3.5	`azure/mistral-medium-3-5`	EU-hosted mid-tier
Fast mid	grok-4-1-fast-reasoning	`azure/grok-4-1-fast-reasoning`	speed-optimized reasoning
Fast mid	grok-4-fast-reasoning	`azure/grok-4-fast-reasoning`	speed-optimized reasoning
Fast mid	GPT-5.4 mini (xhigh)	`azure/gpt-5.4-mini`	our current default judge/auditor
Fast mid	grok-3	`azure/grok-3`	older frontier; reference point
Small/fast	DeepSeek-V3-0324	`azure/deepseek-v3-0324`	mid-tier OSS reference
Small/fast	Llama-3.3-70B-Instruct	`azure/llama-3-3-70b-instruct`	Meta open-weight baseline
Small/fast	gpt-oss-120B (high)	`azure/gpt-oss-120b`	OpenAI open-weight
Small/fast	gpt-oss-20B (high)	`azure/gpt-oss-20b`	smallest OSS — judge-failure floor probe

Recommended 3-model rotation for the regression-gate CI (keep PR latency under ~15 min): azure/gpt-5.4-mini (default), azure/gpt-5.4, one cross-vendor (azure/claude-sonnet-4-6 or azure/grok-4-fast-reasoning). The full 17 above run in the monthly bulk pass.

I deliberately dropped Google Gemini — not on Foundry today. That's tracked separately under the 3P-endpoint testing pilot Jake owns.

Inference-speedup knobs (the dev complaint about tau2 being slow)

The dominant cost on tau2 telecom is not the framework — it's (auditor turns) × (agent turns) × (tool roundtrips) × (judge calls). Three knobs, ordered by ROI:

Bump concurrency. Default inference.concurrency is conservative (4). PR Joint AgentShield + p2m incident-triage demo (//build 2026 candidate) #43 family bumped it to 24 for the incident-triage demo — same change applies here. Single-line fix, ~6× wall-clock improvement on multi-seed runs:
```
pipeline:
  inference:
    concurrency: 24
    request_timeout: 120
```
Cap max_turns to a reasonable value. Today the example configs run with max_turns: 10 or unbounded. For tau2 telecom that's overkill — telecom triage conversations resolve in ≤ 5 turns most of the time. Recommend:
```
pipeline:
  inference:
    target:
      max_turns: 5    # 10 if you want headroom for recovery beats; 5 is fine for the regression rail
```
This alone cuts ~30–50% of wall time on scenarios that would otherwise drift to the 10-turn ceiling.
Faster auditor and judge models. Drop the auditor and judge to azure/gpt-5.4-mini for the routine runs; only escalate to azure/gpt-5.4 for tie-breaker / publication runs. Saves another ~30%.

On switching to UK AI Inspect to speed up tau2

Short answer: it will not help. Inspect has the same fundamental loop (spec → generate → run → judge); switching frameworks does not reduce model-call count, which is the actual bottleneck. We also lose OpenInference auto-trace (8/8 → 1/8 observability on judge side) and the multi-turn auditor pressure model that makes tau2 telecom discriminative in the first place. Inspect-compatible export is on our roadmap so we can publish to that ecosystem without switching frameworks.

Happy to file a separate issue tracking the tau2-speedup playbook if useful.

concept.md: Telecom customer service agent behavior specification derived from tau2-bench's main_policy.md. Covers 7 operational areas (customer lookup, billing, line management, data refueling, plan changes, roaming, tech support) with quality and safety expectations. telecom_tools.yaml: 14 agent tools (7 READ + 6 WRITE + 1 GENERIC) extracted from tau2-bench's telecom tools.py in p2m YAML format.

Pipeline: policy (15 behaviors) → design → seeds (50 prompts + 20 scenarios) → rollout (model + simulated tools, 12 turns) → judge (4 dimensions: workflow_violation, policy_adherence, communication_quality, escalation_judgment). Default target is azure/gpt-4o-mini; override via --set for multi-model correlation runs.

Documents motivation, file layout, running instructions, suggested model set, correlation workflow, and design decisions.

Supports --stages tau2,p2m,correlate for selective execution, --models for custom model sets, and --dry-run for preview. Generates per-model YAML configs since p2m has no --set flag.

Replace --set flag examples (not supported) with run_correlation.py usage, stage descriptions, and selective execution examples.

tau2 saves simulations to {DATA_DIR}/simulations/, where DATA_DIR is resolved by the tau2 package (typically <tau3-bench>/data/). The previous hardcoded path 'data/tau2/simulations/' was incorrect.

Uses gpt-5.4-nano with tiny sample sizes (3 behaviors, 3 prompts, 2 scenarios) to verify the p2m pipeline works end-to-end in ~90s.

target.trace was only checked inside the callable target path, so model-only targets silently ignored trace config. Add a call to phoenix.otel.register(auto_instrument=True) in the hosted session path so litellm calls emit spans to Phoenix automatically. Idempotent (module-level flag) and graceful when arize-phoenix-otel is not installed.

- Change default_model and all stage models from gpt-5.4-mini to gpt-5.4-nano (gpt-5.4-mini is not deployed) - Fix toolset path: bare filename instead of full relative path (p2m resolves relative to config file directory) - Add Phoenix tracing to rollout target in both configs

Bug fixes: - Add missing shutil import for binary resolution - Resolve tau2/p2m binaries via shutil.which() with venv fallback - Write temp configs next to source (not in results/) so relative paths resolve correctly - Fix score parsing: verdict.dimensions (not verdicts/scores) - Change TAU2_USER_LLM to gpt-5.4-nano (gpt-5.4-mini not deployed) - Move global statement to function top Cost/progress improvements: - Stream subprocess output to terminal instead of capture_output=True (tau2 progress bars and per-sim costs are now visible) - Add elapsed time tracking per command and per stage - Add model progress counters (e.g. 'tau2 model 1/2: azure/gpt-5.4') - Add post-stage cost summaries (tau2: sim count + USD, p2m: tokens) - Add pre-run confirmation prompts with cost/time estimates - Add --yes/-y flag to skip confirmation (for CI/automated runs)

- Add --max-concurrency flag to control tau2 parallelism (was hardcoded) - Add shutil.which guard with helpful install instructions on tau2 missing

Stages: tau2, p2m, correlate (selectable via workflow_dispatch). Caches tau2 simulation results and p2m artifacts across runs.

- Add models.yaml with endpoint definitions, quick/full presets, and 9-model inventory with per-model preset membership - Rename run_correlation.py → run_comparison.py - Add load_models_config(), get_preset_models(), get_preset_overrides() to read models.yaml and resolve preset parameters - Add --preset CLI arg (quick/full) that sets models, trials, concurrency from models.yaml - Rename CLI args: --tau2-trials → --trials, --tau2-user-llm → --user-model, --max-concurrency → --concurrency, --p2m-seed-count → --test-cases - Remove hardcoded DEFAULT_MODELS list (replaced by models.yaml) - Rename constants: TAU2_NUM_TRIALS → DEFAULT_TRIALS, TAU2_USER_LLM → DEFAULT_USER_MODEL, TAU2_MAX_CONCURRENCY → DEFAULT_CONCURRENCY - Eliminate global mutation pattern (was: global TAU2_NUM_TRIALS)

- concept: → behavior: with inline description (absorb concept.md) - factors: → pipeline.test_set.stratify.dimensions: - pipeline.policy: → pipeline.systematize: - behavior_count → behavior_category_count - pipeline.seeds: → pipeline.test_set: - pipeline.rollout: → pipeline.inference: - auditor: → tester: - max_turns: 12 → 5 - suite name: telecom-tau2-correlation (drop -v1 suffix) - Delete concept.md (content now inline in behavior.description) - Update run_comparison.py config path and terminology

models.yaml maps each model to a region endpoint key, which resolves to a region-specific env var (e.g. AZURE_API_BASE_WESTUS2). Each subprocess inherits a tailored env dict with AZURE_API_BASE set to the correct URL for that model's region. - _model_entry(), resolve_endpoint_url(), resolve_endpoint_env_var() for the lookup chain: model → endpoint key → env var → URL - validate_endpoints() exits early with actionable diagnostics when required env vars are missing - make_model_env() builds per-subprocess env with AZURE_API_BASE set - run_tau2() and run_p2m() accept models_config and pass env to run_cmd() - main() calls validate_endpoints() before expensive API stages

Add discover_tau2_results() and discover_p2m_results() to detect per-model output files that already exist on disk. Replace the simple confirm_stage() with plan_and_confirm() which shows 'Done' vs 'To run' model lists and returns only the pending models. main() now skips already-completed models within a stage, merges existing results with new results, and extracts suite_name earlier so p2m discovery can use it.

--force re-runs all models even if results already exist on disk, bypassing auto-discovery. After each stage completes, save intermediate tau2_rewards.json and p2m_scores.json to results/. The correlate stage loads these automatically when run separately (--stages=correlate), enabling multi-session workflows where tau2 and p2m run independently.

_progress_line() shows '[3/9] elapsed 24m, ETA ~48m' based on average time per completed model. First model shows just '[1/N]' since no timing data is available yet. Applied to both run_tau2() and run_p2m() loop headers.

run_p2m() now accepts test_cases, max_turns, and judge_model parameters. These are resolved from CLI args and preset overrides in models.yaml (test_cases, max_turns, judge_model fields). When test_cases is set, both prompt and scenario sample_size are patched (scenario = test_cases // 3, min 1). Delete eval_config_smoke.yaml — its role is replaced by the 'quick' preset in models.yaml.

- Rewrite README with preset-based quick start, CLI reference, stage docs, multi-endpoint setup, and current model inventory - Fix CI workflow: rename run_correlation.py to run_comparison.py, replace --max-concurrency/--tau2-trials with --concurrency/--trials, add --preset input option, add multi-endpoint env vars, add --force

- Import and call load_dotenv() to pick up .env from repo root - Add TAU2_DATA_DIR constant with env var override support - Fix DEFAULT_USER_MODEL from non-existent gpt-5.4-nano to gpt-5.4-mini

Multi-endpoint infrastructure for running tau2 across Azure regions: - models.yaml: add api_keys and user_simulator sections per endpoint - resolve_api_key_env_var(): look up API key env var by model endpoint - resolve_user_model(): pick user-sim model co-located on agent endpoint - validate_endpoints(): check both base URL and API key env vars - validate_tau2_data(): verify tau2 domain data exists before running - make_model_env(): set AZURE_API_KEY and TAU2_DATA_DIR in subprocess env - discover_tau2_results(): validate JSON has actual simulations - run_tau2(): use per-endpoint user simulator, use TAU2_DATA_DIR paths - .gitignore: exclude local data/ symlink

Quick validation script that tests each configured Azure endpoint by sending a minimal chat completion request. Supports --list to show endpoint configuration without making API calls.

- .env.example: add per-region API key/base vars and TAU2_DATA_DIR - README.md: add data directory setup instructions with symlink example

Filter stderr from tau2 subprocesses to suppress repetitive litellm 'model isn't mapped yet' warnings (show only first occurrence) and drop litellm promo/feedback lines entirely.

p2m only supports systematize, test_set, inference, and judge. The 'design' stage caused 'Unknown stage(s): design' on every run.

After suppressing noisy litellm/loguru lines, adjacent blank lines were still passing through, creating long stretches of empty output. Track whether the last line was suppressed and skip blank lines that immediately follow.

- Default: 5 → 10 - Quick preset: 10 → 30 - Full preset: 24 → 40 Azure OpenAI deployments easily handle 30-40 concurrent requests. With 114 tau2 tasks (57 scenarios × 2 trials), higher concurrency cuts wall-clock time significantly.

p2m's pipeline uses multiple models across stages (systematize, test_set, tester, judge) that all share a single AZURE_API_BASE env var. When the inference target model is on a different endpoint (e.g. westus2) than the pipeline models (e.g. default), overriding AZURE_API_BASE globally caused all pipeline stages to fail with DeploymentNotFound. Fix: keep AZURE_API_BASE pointing at the default endpoint for pipeline models. Route only the target model to its specific endpoint by passing a _P2M_MODEL_ROUTING JSON env var to _p2m_shim.py, which monkey-patches litellm.acompletion to inject per-model api_base and api_key.

- run_p2m() now calls _p2m_shim.py instead of p2m directly - New make_p2m_env() builds routing table instead of overriding AZURE_API_BASE - Quick preset concurrency: 30 → 20 (less likely to hit rate limits)

When AZURE_API_BASE is not in the environment (only region-specific variants exist), p2m pipeline models (systematize, test_set, judge) fail with DeploymentNotFound because they have no default endpoint. - models.yaml: add pipeline_endpoint pointing to australiaeast - make_p2m_env(): fall back to pipeline_endpoint for AZURE_API_BASE - validate_endpoints(): check pipeline endpoint env vars are set

tau2 appends simulations to existing JSON files across runs, causing accumulated sim counts that don't match the requested trial count. collect_tau2_rewards() then averages over all accumulated sims, producing misleading reward scores. - discover_tau2_results(): accept expected_trials param, remove files where sim count doesn't match (stale from prior runs) - main(): pass trials count to discover_tau2_results()

quick: 20→5, full: 40→10

Resume: partial result files are kept on disk instead of deleted. The PTY now connects stdin so tau2's interactive 'resume?' prompt is auto-answered with 'y'. discover_tau2_results() uses the JSON tasks list to compute expected sim count (n_tasks × trials) — only files with MORE sims than expected are treated as stale. Error reporting: after a tau2 failure, _summarize_tau2_run() reads the partial output file and captured ERROR lines to produce a grouped summary (e.g. '12× empty model response, 3× rate-limited') instead of the previous generic 'tau2 failed, skipping'.

- Add --tau2-retries flag (default 3) for resilience against tau2 crashes - Print tau2 completion table after simulation stage - Report Spearman p-values, sample sizes, and significance markers - Show both full and filtered (>=50%% completion) correlation results - Suppress OTEL/gRPC exporter noise when Phoenix isn't running

…ents

tau2 upstream has two unhandled exceptions (JSONDecodeError, ValueError for empty AssistantMessage) that crash the entire ThreadPoolExecutor, killing all in-flight tasks. With concurrency=10, each crash wastes ~10 tasks. Lowering to 2 limits blast radius per crash; bumping retries from 3 to 10 gives the resume loop enough attempts to grind through error-prone models like gpt-oss-120b (2% completion at concurrency=10, 3 retries).

- Auto-log each run to logs/run_YYYYMMDD_HHMMSS.log (+ --log-file flag) - .gitignore: exclude logs/ directory - After tau2 exits 0, verify actual sim count matches expected before declaring success; retry on incomplete runs - Reduce DEFAULT_TAU2_RETRIES from 10 to 3 — investigation showed diminishing returns (~9 new sims per retry for failing models)

- Fix concept.md ghost reference — point to eval_config.yaml instead - Fix mini preset judge_model (gpt-5.4, not gpt-5.4-mini) - Fix tau2 results path (data/simulations/, not results/tau2/) - Fix p2m artifacts path (no -v1 suffix in suite name) - Add --log-file to CLI options table - Soften motivation framing — study checks ranking agreement, not making a trust claim - Add 'Inspecting results' section with data completeness checks, correlation output interpretation, and p2m artifact navigation

Loads results/correlation_results.json and raw simulation data to produce styled tables, grouped bar charts, a Spearman heatmap, scatter plots per dimension, reward distribution histograms, and per-task disagreement analysis. Charts saved to results/ as PNGs.

generate_report.py produces a self-contained HTML file with all charts embedded as base64 PNGs. Supports --out and --no-open flags. README updated with report generation section and file table entries.

- Add _recompute_correlations() helper for subset Spearman ρ - build_report() now accepts min_sims and subtitle parameters - Filter out models with fewer than min_sims tau2 simulations - main() generates report.html (all models) + report_filtered.html - Replace --no-open with --open (browser stays closed by default) - Add --min-sims flag (default: 50)

- Add 'report' to valid_stages and default stages - Report stage generates both full and filtered HTML reports - Runs automatically after correlate stage

- Document automatic report stage in pipeline - Document two-report output (all + filtered) - Document --open and --min-sims flags

… section - Single report.html with: data status, full analysis, filtered analysis, reward distributions - Data status section shows all tau2/p2m models with coverage gaps - Filtered section excludes models with <min_sims tau2 simulations - If no models are filtered, shows 'no filtering needed' message - Simplify main() and run_comparison report stage for single output

The correlate stage was a leftover from when correlation analysis was done inside the pipeline. It has been fully replaced by the standalone generate_report.py script that reads artifacts from disk. - Remove _build_model_correlation_results() and supporting imports - Remove backward-compat block that skipped missing correlate data - Remove 'correlate' from valid_stages and default stage list - Update docstring to reflect tau2 → p2m → report pipeline

Enrich the correlation report with two new sections: 1. Completion percentages in the data status table: - tau2 Complete: sims / (tasks × trials), color-coded green/orange - p2m Complete: scored / expected, color-coded green/orange - Track task counts and p2m progress from disk artifacts 2. Eval Configuration metadata box (shown at top of report): - Suite name, behavior, trials per task - Judge model, user simulator models - Full list of target models under evaluation Also fix trials resolution to use max across presets (matches full/default runs) instead of picking the first preset found.

tangym added 22 commits May 20, 2026 00:36

Add README for telecom tau2 correlation example

f362ac6

Documents motivation, file layout, running instructions, suggested model set, correlation workflow, and design decisions.

Add orchestration script for tau2/p2m correlation study

d452fab

Supports --stages tau2,p2m,correlate for selective execution, --models for custom model sets, and --dry-run for preview. Generates per-model YAML configs since p2m has no --set flag.

Update README with orchestration script docs

32ff036

Replace --set flag examples (not supported) with run_correlation.py usage, stage descriptions, and selective execution examples.

Fix tau2 output path to use DATA_DIR instead of hardcoded relative path

591b590

tau2 saves simulations to {DATA_DIR}/simulations/, where DATA_DIR is resolved by the tau2 package (typically <tau3-bench>/data/). The previous hardcoded path 'data/tau2/simulations/' was incorrect.

Add minimal smoke-test config for quick pipeline verification

d4dc460

Uses gpt-5.4-nano with tiny sample sizes (3 behaviors, 3 prompts, 2 scenarios) to verify the p2m pipeline works end-to-end in ~90s.

docs(telecom): add prerequisites section with tau2 install instructions

a1aff6f

feat(telecom): add --max-concurrency CLI arg and tau2 binary check

e34ae3b

- Add --max-concurrency flag to control tau2 parallelism (was hardcoded) - Add shutil.which guard with helpful install instructions on tau2 missing

ci: add manual workflow for tau2-p2m correlation study

4b3224b

Stages: tau2, p2m, correlate (selectable via workflow_dispatch). Caches tau2 simulation results and p2m artifacts across runs.

chore(telecom): gitignore generated per-model config files

55dbe36

feat(telecom): add progress display with elapsed time and ETA

770021c

_progress_line() shows '[3/9] elapsed 24m, ETA ~48m' based on average time per completed model. First model shows just '[1/N]' since no timing data is available yet. Applied to both run_tau2() and run_p2m() loop headers.

tangym force-pushed the yemingtang/telecom-tau2-correlation branch from 5bcff83 to cce2def Compare May 20, 2026 03:36

tangym added 4 commits May 20, 2026 08:12

fix: add load_dotenv, TAU2_DATA_DIR, and correct DEFAULT_USER_MODEL

63966f0

- Import and call load_dotenv() to pick up .env from repo root - Add TAU2_DATA_DIR constant with env var override support - Fix DEFAULT_USER_MODEL from non-existent gpt-5.4-nano to gpt-5.4-mini

feat: add endpoint connectivity smoke test

7e7ec61

Quick validation script that tests each configured Azure endpoint by sending a minimal chat completion request. Supports --list to show endpoint configuration without making API calls.

docs: add env var and data setup documentation

22c3f5f

- .env.example: add per-region API key/base vars and TAU2_DATA_DIR - README.md: add data directory setup instructions with symlink example

tangym force-pushed the yemingtang/telecom-tau2-correlation branch 2 times, most recently from 314b3bd to 15cd2f4 Compare May 20, 2026 08:25

chore: suppress noisy litellm cost warnings in subprocess output

428f99b

Filter stderr from tau2 subprocesses to suppress repetitive litellm 'model isn't mapped yet' warnings (show only first occurrence) and drop litellm promo/feedback lines entirely.

tangym force-pushed the yemingtang/telecom-tau2-correlation branch from 15cd2f4 to 428f99b Compare May 20, 2026 08:29

tangym and others added 25 commits May 20, 2026 08:32

fix(eval_config): remove unsupported 'design' pipeline stage

44d96b1

p2m only supports systematize, test_set, inference, and judge. The 'design' stage caused 'Unknown stage(s): design' on every run.

perf: increase tau2 concurrency defaults

a33810a

- Default: 5 → 10 - Quick preset: 10 → 30 - Full preset: 24 → 40 Azure OpenAI deployments easily handle 30-40 concurrent requests. With 114 tau2 tasks (57 scenarios × 2 trials), higher concurrency cuts wall-clock time significantly.

fix: use p2m shim for multi-endpoint routing, moderate concurrency

fd5c7e3

- run_p2m() now calls _p2m_shim.py instead of p2m directly - New make_p2m_env() builds routing table instead of overriding AZURE_API_BASE - Quick preset concurrency: 30 → 20 (less likely to hit rate limits)

fix: correct p2m CLI entry point (cli, not main)

640c6b6

Lower default concurrency to avoid rate limits

48f39d3

quick: 20→5, full: 40→10

fix: replace unavailable gpt-5.4-nano with gpt-5.4-mini in pipeline

78f9c13

feat: add mini preset (3 models, 70 test cases, 4 trials)

0251454

docs: update README with mini preset, retry, and correlation enhancem…

020fae8

…ents

Add HTML report generator + document report tools in README

7406763

generate_report.py produces a self-contained HTML file with all charts embedded as base64 PNGs. Supports --out and --no-open flags. README updated with report generation section and file table entries.

run_comparison: add report stage to pipeline

508aa06

- Add 'report' to valid_stages and default stages - Report stage generates both full and filtered HTML reports - Runs automatically after correlate stage

README: update report generation section

caed9ef

- Document automatic report stage in pipeline - Document two-report output (all + filtered) - Document --open and --min-sims flags

github-actions Bot requested review from AaronAspinwall123, changliu2 and jakepresent June 13, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add telecom tau2↔p2m Spearman correlation study#65

Add telecom tau2↔p2m Spearman correlation study#65
tangym wants to merge 52 commits into
mainfrom
yemingtang/telecom-tau2-correlation

tangym commented May 19, 2026

Uh oh!

changliu2 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangym commented May 19, 2026

Summary

What's included

CLI usage

Status

Commits

Uh oh!

changliu2 commented May 19, 2026

Foundry-only model list for the tau2 study (range of quality)

Inference-speedup knobs (the dev complaint about tau2 being slow)

On switching to UK AI Inspect to speed up tau2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants