feat: benchmark evidence + maintained provider price table (#103, #156, #207) by dgenio · Pull Request #224 · dgenio/ChainWeaver

dgenio · 2026-05-30T12:57:50Z

Summary

Implements the triage report's recommended "benchmark evidence & cost-claim artifacts" group as a single PR: a maintained provider price table, a data-corruption correctness benchmark, and an aggregate benchmark report that packages reproducible numbers for README/docs. These three issues cross-reference each other (#207 explicitly depends on #103 and #156), share the benchmarks/ + cost area, and touch nothing in the executor hot path.

Changes

#156 — Maintained provider price table

chainweaver/cost.py: PriceSnap (dated input/output per-Mtok), PROVIDER_PRICES snapshot table (OpenAI / Anthropic / Google / Bedrock, as_of 2026-05-01), lookup_price(), and CostProfile.from_provider() which derives a blended per-token cost. CostProfile gains provider / model / price_as_of; every CostReport now surfaces the snapshot date. compute_cost_report(..., provider=, model=) builds the profile from the table. No live HTTP — snapshots ship in-tree and always work offline. PROVIDER_PRICES is exposed as a read-only MappingProxyType so the shared table cannot be mutated at runtime.
chainweaver/exceptions.py: new CostProfileError for unknown (provider, model) pairs (exported in __all__, added to the README error table).
scripts/refresh_prices.py + .github/workflows/update-prices.yml: monthly maintainer-reviewed price-refresh PR, never auto-merged.

#103 — Correctness benchmark — benchmarks/bench_correctness.py: seeded LLMCorruptionProfile injects the five corruption types (field hallucination, data loss, type corruption, schema drift, routing variance); CorrectnessReport aggregates them. Compiled execution reports zero corruption by construction. Three scenarios (numeric / data-enrichment / long-chain) plus a "corruption compounds with chain length" table; human table + JSON output. determinism_rate measures outcome consistency (frequency of the most common (final_value, routing) outcome across runs), not end-to-end correctness, and is documented as such on CorrectnessReport.

#207 — Aggregate report — benchmarks/report.py: build_report() packages latency, model-decisions-avoided, cost-avoided (via #156), correctness (via #103), environment metadata (Python / ChainWeaver / OS / commit SHA), and explicit caveats into versioned benchmarks/results/latest.{json,md}.

Docs — README cost-avoided section (real $ figure) + error-table row; benchmarks/README.md correctness + report sections; AGENTS.md repo map (cost.py, scripts/, benchmarks/, workflows).

Testing

Linting passes (ruff check chainweaver/ tests/ examples/) — All checks passed
Formatting check passes (ruff format --check chainweaver/ tests/ examples/) — 162 files already formatted
Type checking passes (python -m mypy chainweaver/ tests/) — no issues in 130 files
All existing tests pass (python -m pytest tests/ -v) — 1187 passed, 2 skipped, coverage 91.46%
New tests added for new functionality — tests/test_cost.py (price table, lookup, from_provider, provider-priced reports, output_fraction bounds, partial provider/model rejection, MappingProxyType immutability) and tests/test_benchmark_artifacts.py (correctness invariants, seed reproducibility, compounding, report format — the CI format gate Publish benchmark report artifacts for latency, cost, and correctness claims #207 asks for, plus naive-vs-compiled determinism_rate assertions)

Benchmark/script files (benchmarks/, scripts/) are also ruff- and mypy-clean, though they sit outside the CI-gated paths.

Review-cycle hardening

Applied after the initial review pass (commits 34f66fe, 2d4c871):

output_fraction is validated to [0.0, 1.0]; compute_cost_report rejects a partial (provider, model) pair instead of silently mispricing.
CostReport.__str__ no longer renders (as of None) when no snapshot date is present.
PROVIDER_PRICES made immutable (MappingProxyType); a stray unused import was removed from the README example.
determinism_rate was relabelled/documented as a consistency metric (it was previously mislabelled), and successful_runs now has identical semantics across the naive and compiled paths (gated on execution success alone) so the column is directly comparable.

Related Issues

Closes #103, #156, #207.

Checklist

Code follows project conventions (see AGENTS.md and docs/agent-context/)
Public API changes are documented (README + __all__ + AGENTS.md; public-API snapshot fixture regenerated)
No secrets or credentials included

Scope notes (Mode B)

Documented delta: the Add maintained provider price table for CostProfile cost-avoided reporting #156 monthly workflow ships as a conservative scaffold — scripts/refresh_prices.py reports snapshot staleness and exits 0; it does not wire live provider scraping (pricing pages are hostile to scraping and the issue requires the in-tree snapshot to keep working regardless). peter-evans/create-pull-request opens a review PR only when a maintainer wires a real scraper or edits the table. This honors "humans must verify pricing / never auto-merge."
benchmarks/results/latest.{json,md} record the commit they were generated against and are regenerated on demand. They are not refreshed for review-cycle commits that leave every correctness metric unchanged (only sub-ms latency jitter would churn).
Adjacent issues left for follow-up (not bundled): Publish chainweaver-action reusable GitHub Action wrapping chainweaver check #149 (GitHub Action — partially shipped), Implement runtime chain observer with auto-flow suggestion #78 (runtime ChainObserver), Add RedactionPolicy.redact_step_record / redact_execution_result helpers #217 (RedactionPolicy docs-sync).

https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y

Generated by Claude Code

#207) Implements the "benchmark evidence & cost-claim artifacts" group as one PR. #156 — Maintained provider price table: - chainweaver/cost.py: PriceSnap, dated PROVIDER_PRICES table, lookup_price, CostProfile.from_provider (blended per-token cost), and provider/model/ price_as_of fields surfaced on every CostReport. compute_cost_report now accepts provider/model. No live HTTP — snapshots ship in-tree. - New CostProfileError for unknown (provider, model) pairs. - scripts/refresh_prices.py + .github/workflows/update-prices.yml open a maintainer-reviewed monthly PR; prices are never auto-merged. #103 — Correctness benchmark (benchmarks/bench_correctness.py): - Seeded LLMCorruptionProfile injects the five corruption types (field hallucination, data loss, type corruption, schema drift, routing variance); CorrectnessReport aggregates them. Compiled execution shows zero corruption by construction. Three scenarios + a corruption-compounds-with-length table. #207 — Aggregate report (benchmarks/report.py): - build_report packages latency, decisions-avoided, cost-avoided (via #156), correctness (via #103), environment metadata, and caveats into versioned results/latest.{json,md}. tests/test_benchmark_artifacts.py guards the format (the CI gate #207 asks for) and the zero-corruption invariant. Docs: README cost-avoided section (real $ figure) + error-table row, benchmarks/README correctness + report sections, AGENTS.md repo map. Validation: ruff check + ruff format --check + mypy all clean; pytest 1183 passed, coverage 91.46%. https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'ChainWeaver microbenchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.25.

Benchmark suite	Current: `2d4c871`	Previous: `3e3ac00`	Ratio
`compiled_overhead_ms_n5_llm500_tool50`	`0.33671400001367147` ms	`0.2267619999827275` ms	`1.48`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @dgenio

Copilot

Pull request overview

This PR adds reproducible “evidence artifacts” to support README/docs claims around cost-avoided, latency, and correctness: a maintained provider price snapshot table (offline), a seeded correctness/data-corruption benchmark, and an aggregate benchmark report emitted as committed latest.{json,md} artifacts.

Changes:

Add an in-repo, dated provider/model price snapshot table with lookup + CostProfile.from_provider(), and surface the snapshot date on cost reports.
Add a seeded correctness benchmark that simulates common LLM-mediated corruption modes vs compiled flow execution.
Add an aggregate benchmark report generator (benchmarks/report.py) plus committed benchmarks/results/latest.{json,md} and CI format guards.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`chainweaver/cost.py`	Introduces `PriceSnap`, `PROVIDER_PRICES`, `lookup_price()`, `CostProfile.from_provider()`, and extends `compute_cost_report()`/`CostReport.__str__()` to support priced reporting.
`chainweaver/exceptions.py`	Adds `CostProfileError` for unknown `(provider, model)` pricing lookups.
`chainweaver/__init__.py`	Exports new cost symbols (`PriceSnap`, `PROVIDER_PRICES`, `lookup_price`, `CostProfileError`) via `__all__`.
`README.md`	Documents priced cost-avoided reporting and adds `CostProfileError` to the error table.
`scripts/refresh_prices.py`	Adds a maintainer-facing staleness/reporting helper for price snapshots (no runtime network).
`.github/workflows/update-prices.yml`	Adds a scheduled workflow that can open a human-reviewed “refresh prices” PR.
`benchmarks/bench_correctness.py`	Adds the seeded correctness/data-corruption benchmark and JSON/table output.
`benchmarks/report.py`	Adds aggregate report generation (environment + latency + cost + correctness + caveats) and artifact writing.
`benchmarks/README.md`	Documents how to run correctness and aggregate report benchmarks and how artifacts are used/guarded.
`benchmarks/results/latest.json`	Adds committed “latest” machine-readable benchmark artifact for docs/README evidence.
`benchmarks/results/latest.md`	Adds committed “latest” human-readable benchmark artifact for docs/README evidence.
`tests/test_cost.py`	Adds tests for the price table, lookup behavior, provider-derived profiles, and provider-priced cost reports.
`tests/test_benchmark_artifacts.py`	Adds CI guards validating report/correctness artifact shapes and reproducibility.
`tests/fixtures/public_api.json`	Updates the public API snapshot for new exported symbols/fields.
`AGENTS.md`	Updates the repo map to include the new cost/benchmark/workflow additions.

Address the Copilot review on #224: - Validate output_fraction is in [0.0, 1.0] in PriceSnap.blended_cost_per_token_usd (raise ValueError); out-of-range values previously produced negative/inflated blended costs. - Require provider/model together-or-neither in compute_cost_report so a partial config no longer silently returns an unpriced default report. - Only render the "Priced against" line when price_as_of is also present, avoiding a confusing "(as of None)". - Expose PROVIDER_PRICES as a read-only MappingProxyType so callers cannot mutate the shared in-process price table. - Drop the unused CostProfile import from the README cost example. - Redefine the benchmark determinism_rate as the most-common-outcome frequency / runs (consistency across runs) instead of a correctness alias; compiled execution is 1.0 by construction. Refresh the public-API snapshot (PROVIDER_PRICES is now a mappingproxy) and regenerate benchmarks/results/latest.{json,md}. Tests added for every fix; ruff, ruff format, mypy, and pytest all pass.

Audit follow-up to the determinism_rate fix. benchmark_compiled_correctness counted successful_runs only when the output also matched the truth value, while benchmark_naive_correctness counts any run that executed without failure. The values coincide for the compiled path (success implies correctness by construction), but the field then meant two different things across approaches. Gate compiled successful_runs on result.success alone so the metric is directly comparable; correctness-by-construction stays asserted via corruption_rate, data_integrity_score, and determinism_rate. Also document determinism_rate on CorrectnessReport: it measures outcome *consistency* (frequency of the most common (final_value, routing) outcome), not end-to-end correctness, so future readers don't assume the old meaning. https://claude.ai/code/session_01CF3z555AAGbJCCJwAerMpJ

Copilot AI review requested due to automatic review settings May 30, 2026 12:57

Copilot started reviewing on behalf of dgenio May 30, 2026 12:57 View session

github-actions Bot reviewed May 30, 2026

View reviewed changes

Copilot AI reviewed May 30, 2026

View reviewed changes

claude added 2 commits May 30, 2026 21:10

dgenio merged commit 741a4b4 into main May 30, 2026
15 checks passed

dgenio deleted the claude/github-issues-triage-SnIJB branch May 30, 2026 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: benchmark evidence + maintained provider price table (#103, #156, #207)#224

feat: benchmark evidence + maintained provider price table (#103, #156, #207)#224
dgenio merged 3 commits into
mainfrom
claude/github-issues-triage-SnIJB

dgenio commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Review-cycle hardening

Related Issues

Checklist

Scope notes (Mode B)

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dgenio commented May 30, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading