feat: benchmark evidence + maintained provider price table (#103, #156, #207)#224
Conversation
#207) Implements the "benchmark evidence & cost-claim artifacts" group as one PR. #156 — Maintained provider price table: - chainweaver/cost.py: PriceSnap, dated PROVIDER_PRICES table, lookup_price, CostProfile.from_provider (blended per-token cost), and provider/model/ price_as_of fields surfaced on every CostReport. compute_cost_report now accepts provider/model. No live HTTP — snapshots ship in-tree. - New CostProfileError for unknown (provider, model) pairs. - scripts/refresh_prices.py + .github/workflows/update-prices.yml open a maintainer-reviewed monthly PR; prices are never auto-merged. #103 — Correctness benchmark (benchmarks/bench_correctness.py): - Seeded LLMCorruptionProfile injects the five corruption types (field hallucination, data loss, type corruption, schema drift, routing variance); CorrectnessReport aggregates them. Compiled execution shows zero corruption by construction. Three scenarios + a corruption-compounds-with-length table. #207 — Aggregate report (benchmarks/report.py): - build_report packages latency, decisions-avoided, cost-avoided (via #156), correctness (via #103), environment metadata, and caveats into versioned results/latest.{json,md}. tests/test_benchmark_artifacts.py guards the format (the CI gate #207 asks for) and the zero-corruption invariant. Docs: README cost-avoided section (real $ figure) + error-table row, benchmarks/README correctness + report sections, AGENTS.md repo map. Validation: ruff check + ruff format --check + mypy all clean; pytest 1183 passed, coverage 91.46%. https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'ChainWeaver microbenchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.25.
| Benchmark suite | Current: 2d4c871 | Previous: 3e3ac00 | Ratio |
|---|---|---|---|
compiled_overhead_ms_n5_llm500_tool50 |
0.33671400001367147 ms |
0.2267619999827275 ms |
1.48 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @dgenio
There was a problem hiding this comment.
Pull request overview
This PR adds reproducible “evidence artifacts” to support README/docs claims around cost-avoided, latency, and correctness: a maintained provider price snapshot table (offline), a seeded correctness/data-corruption benchmark, and an aggregate benchmark report emitted as committed latest.{json,md} artifacts.
Changes:
- Add an in-repo, dated provider/model price snapshot table with lookup +
CostProfile.from_provider(), and surface the snapshot date on cost reports. - Add a seeded correctness benchmark that simulates common LLM-mediated corruption modes vs compiled flow execution.
- Add an aggregate benchmark report generator (
benchmarks/report.py) plus committedbenchmarks/results/latest.{json,md}and CI format guards.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
chainweaver/cost.py |
Introduces PriceSnap, PROVIDER_PRICES, lookup_price(), CostProfile.from_provider(), and extends compute_cost_report()/CostReport.__str__() to support priced reporting. |
chainweaver/exceptions.py |
Adds CostProfileError for unknown (provider, model) pricing lookups. |
chainweaver/__init__.py |
Exports new cost symbols (PriceSnap, PROVIDER_PRICES, lookup_price, CostProfileError) via __all__. |
README.md |
Documents priced cost-avoided reporting and adds CostProfileError to the error table. |
scripts/refresh_prices.py |
Adds a maintainer-facing staleness/reporting helper for price snapshots (no runtime network). |
.github/workflows/update-prices.yml |
Adds a scheduled workflow that can open a human-reviewed “refresh prices” PR. |
benchmarks/bench_correctness.py |
Adds the seeded correctness/data-corruption benchmark and JSON/table output. |
benchmarks/report.py |
Adds aggregate report generation (environment + latency + cost + correctness + caveats) and artifact writing. |
benchmarks/README.md |
Documents how to run correctness and aggregate report benchmarks and how artifacts are used/guarded. |
benchmarks/results/latest.json |
Adds committed “latest” machine-readable benchmark artifact for docs/README evidence. |
benchmarks/results/latest.md |
Adds committed “latest” human-readable benchmark artifact for docs/README evidence. |
tests/test_cost.py |
Adds tests for the price table, lookup behavior, provider-derived profiles, and provider-priced cost reports. |
tests/test_benchmark_artifacts.py |
Adds CI guards validating report/correctness artifact shapes and reproducibility. |
tests/fixtures/public_api.json |
Updates the public API snapshot for new exported symbols/fields. |
AGENTS.md |
Updates the repo map to include the new cost/benchmark/workflow additions. |
Address the Copilot review on #224: - Validate output_fraction is in [0.0, 1.0] in PriceSnap.blended_cost_per_token_usd (raise ValueError); out-of-range values previously produced negative/inflated blended costs. - Require provider/model together-or-neither in compute_cost_report so a partial config no longer silently returns an unpriced default report. - Only render the "Priced against" line when price_as_of is also present, avoiding a confusing "(as of None)". - Expose PROVIDER_PRICES as a read-only MappingProxyType so callers cannot mutate the shared in-process price table. - Drop the unused CostProfile import from the README cost example. - Redefine the benchmark determinism_rate as the most-common-outcome frequency / runs (consistency across runs) instead of a correctness alias; compiled execution is 1.0 by construction. Refresh the public-API snapshot (PROVIDER_PRICES is now a mappingproxy) and regenerate benchmarks/results/latest.{json,md}. Tests added for every fix; ruff, ruff format, mypy, and pytest all pass.
Audit follow-up to the determinism_rate fix. benchmark_compiled_correctness counted successful_runs only when the output also matched the truth value, while benchmark_naive_correctness counts any run that executed without failure. The values coincide for the compiled path (success implies correctness by construction), but the field then meant two different things across approaches. Gate compiled successful_runs on result.success alone so the metric is directly comparable; correctness-by-construction stays asserted via corruption_rate, data_integrity_score, and determinism_rate. Also document determinism_rate on CorrectnessReport: it measures outcome *consistency* (frequency of the most common (final_value, routing) outcome), not end-to-end correctness, so future readers don't assume the old meaning. https://claude.ai/code/session_01CF3z555AAGbJCCJwAerMpJ
Summary
Implements the triage report's recommended "benchmark evidence & cost-claim artifacts" group as a single PR: a maintained provider price table, a data-corruption correctness benchmark, and an aggregate benchmark report that packages reproducible numbers for README/docs. These three issues cross-reference each other (#207 explicitly depends on #103 and #156), share the
benchmarks/+costarea, and touch nothing in the executor hot path.Changes
#156 — Maintained provider price table
chainweaver/cost.py:PriceSnap(dated input/output per-Mtok),PROVIDER_PRICESsnapshot table (OpenAI / Anthropic / Google / Bedrock,as_of2026-05-01),lookup_price(), andCostProfile.from_provider()which derives a blended per-token cost.CostProfilegainsprovider/model/price_as_of; everyCostReportnow surfaces the snapshot date.compute_cost_report(..., provider=, model=)builds the profile from the table. No live HTTP — snapshots ship in-tree and always work offline.PROVIDER_PRICESis exposed as a read-onlyMappingProxyTypeso the shared table cannot be mutated at runtime.chainweaver/exceptions.py: newCostProfileErrorfor unknown(provider, model)pairs (exported in__all__, added to the README error table).scripts/refresh_prices.py+.github/workflows/update-prices.yml: monthly maintainer-reviewed price-refresh PR, never auto-merged.#103 — Correctness benchmark —
benchmarks/bench_correctness.py: seededLLMCorruptionProfileinjects the five corruption types (field hallucination, data loss, type corruption, schema drift, routing variance);CorrectnessReportaggregates them. Compiled execution reports zero corruption by construction. Three scenarios (numeric / data-enrichment / long-chain) plus a "corruption compounds with chain length" table; human table + JSON output.determinism_ratemeasures outcome consistency (frequency of the most common(final_value, routing)outcome across runs), not end-to-end correctness, and is documented as such onCorrectnessReport.#207 — Aggregate report —
benchmarks/report.py:build_report()packages latency, model-decisions-avoided, cost-avoided (via #156), correctness (via #103), environment metadata (Python / ChainWeaver / OS / commit SHA), and explicit caveats into versionedbenchmarks/results/latest.{json,md}.Docs — README cost-avoided section (real $ figure) + error-table row;
benchmarks/README.mdcorrectness + report sections;AGENTS.mdrepo map (cost.py,scripts/,benchmarks/, workflows).Testing
ruff check chainweaver/ tests/ examples/) — All checks passedruff format --check chainweaver/ tests/ examples/) — 162 files already formattedpython -m mypy chainweaver/ tests/) — no issues in 130 filespython -m pytest tests/ -v) — 1187 passed, 2 skipped, coverage 91.46%tests/test_cost.py(price table, lookup,from_provider, provider-priced reports,output_fractionbounds, partial provider/model rejection,MappingProxyTypeimmutability) andtests/test_benchmark_artifacts.py(correctness invariants, seed reproducibility, compounding, report format — the CI format gate Publish benchmark report artifacts for latency, cost, and correctness claims #207 asks for, plus naive-vs-compileddeterminism_rateassertions)Benchmark/script files (
benchmarks/,scripts/) are also ruff- and mypy-clean, though they sit outside the CI-gated paths.Review-cycle hardening
Applied after the initial review pass (commits
34f66fe,2d4c871):output_fractionis validated to[0.0, 1.0];compute_cost_reportrejects a partial(provider, model)pair instead of silently mispricing.CostReport.__str__no longer renders(as of None)when no snapshot date is present.PROVIDER_PRICESmade immutable (MappingProxyType); a stray unused import was removed from the README example.determinism_ratewas relabelled/documented as a consistency metric (it was previously mislabelled), andsuccessful_runsnow has identical semantics across the naive and compiled paths (gated on execution success alone) so the column is directly comparable.Related Issues
Closes #103, #156, #207.
Checklist
AGENTS.mdanddocs/agent-context/)__all__+ AGENTS.md; public-API snapshot fixture regenerated)Scope notes (Mode B)
CostProfilecost-avoided reporting #156 monthly workflow ships as a conservative scaffold —scripts/refresh_prices.pyreports snapshot staleness and exits 0; it does not wire live provider scraping (pricing pages are hostile to scraping and the issue requires the in-tree snapshot to keep working regardless).peter-evans/create-pull-requestopens a review PR only when a maintainer wires a real scraper or edits the table. This honors "humans must verify pricing / never auto-merge."benchmarks/results/latest.{json,md}record the commit they were generated against and are regenerated on demand. They are not refreshed for review-cycle commits that leave every correctness metric unchanged (only sub-ms latency jitter would churn).chainweaver-actionreusable GitHub Action wrappingchainweaver check#149 (GitHub Action — partially shipped), Implement runtime chain observer with auto-flow suggestion #78 (runtime ChainObserver), Add RedactionPolicy.redact_step_record / redact_execution_result helpers #217 (RedactionPolicy docs-sync).https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y
Generated by Claude Code