Skip to content

feat: benchmark evidence + maintained provider price table (#103, #156, #207)#224

Merged
dgenio merged 3 commits into
mainfrom
claude/github-issues-triage-SnIJB
May 30, 2026
Merged

feat: benchmark evidence + maintained provider price table (#103, #156, #207)#224
dgenio merged 3 commits into
mainfrom
claude/github-issues-triage-SnIJB

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 30, 2026

Summary

Implements the triage report's recommended "benchmark evidence & cost-claim artifacts" group as a single PR: a maintained provider price table, a data-corruption correctness benchmark, and an aggregate benchmark report that packages reproducible numbers for README/docs. These three issues cross-reference each other (#207 explicitly depends on #103 and #156), share the benchmarks/ + cost area, and touch nothing in the executor hot path.

Changes

#156 — Maintained provider price table

  • chainweaver/cost.py: PriceSnap (dated input/output per-Mtok), PROVIDER_PRICES snapshot table (OpenAI / Anthropic / Google / Bedrock, as_of 2026-05-01), lookup_price(), and CostProfile.from_provider() which derives a blended per-token cost. CostProfile gains provider / model / price_as_of; every CostReport now surfaces the snapshot date. compute_cost_report(..., provider=, model=) builds the profile from the table. No live HTTP — snapshots ship in-tree and always work offline. PROVIDER_PRICES is exposed as a read-only MappingProxyType so the shared table cannot be mutated at runtime.
  • chainweaver/exceptions.py: new CostProfileError for unknown (provider, model) pairs (exported in __all__, added to the README error table).
  • scripts/refresh_prices.py + .github/workflows/update-prices.yml: monthly maintainer-reviewed price-refresh PR, never auto-merged.

#103 — Correctness benchmarkbenchmarks/bench_correctness.py: seeded LLMCorruptionProfile injects the five corruption types (field hallucination, data loss, type corruption, schema drift, routing variance); CorrectnessReport aggregates them. Compiled execution reports zero corruption by construction. Three scenarios (numeric / data-enrichment / long-chain) plus a "corruption compounds with chain length" table; human table + JSON output. determinism_rate measures outcome consistency (frequency of the most common (final_value, routing) outcome across runs), not end-to-end correctness, and is documented as such on CorrectnessReport.

#207 — Aggregate reportbenchmarks/report.py: build_report() packages latency, model-decisions-avoided, cost-avoided (via #156), correctness (via #103), environment metadata (Python / ChainWeaver / OS / commit SHA), and explicit caveats into versioned benchmarks/results/latest.{json,md}.

Docs — README cost-avoided section (real $ figure) + error-table row; benchmarks/README.md correctness + report sections; AGENTS.md repo map (cost.py, scripts/, benchmarks/, workflows).

Testing

  • Linting passes (ruff check chainweaver/ tests/ examples/) — All checks passed
  • Formatting check passes (ruff format --check chainweaver/ tests/ examples/) — 162 files already formatted
  • Type checking passes (python -m mypy chainweaver/ tests/) — no issues in 130 files
  • All existing tests pass (python -m pytest tests/ -v) — 1187 passed, 2 skipped, coverage 91.46%
  • New tests added for new functionality — tests/test_cost.py (price table, lookup, from_provider, provider-priced reports, output_fraction bounds, partial provider/model rejection, MappingProxyType immutability) and tests/test_benchmark_artifacts.py (correctness invariants, seed reproducibility, compounding, report format — the CI format gate Publish benchmark report artifacts for latency, cost, and correctness claims #207 asks for, plus naive-vs-compiled determinism_rate assertions)

Benchmark/script files (benchmarks/, scripts/) are also ruff- and mypy-clean, though they sit outside the CI-gated paths.

Review-cycle hardening

Applied after the initial review pass (commits 34f66fe, 2d4c871):

  • output_fraction is validated to [0.0, 1.0]; compute_cost_report rejects a partial (provider, model) pair instead of silently mispricing.
  • CostReport.__str__ no longer renders (as of None) when no snapshot date is present.
  • PROVIDER_PRICES made immutable (MappingProxyType); a stray unused import was removed from the README example.
  • determinism_rate was relabelled/documented as a consistency metric (it was previously mislabelled), and successful_runs now has identical semantics across the naive and compiled paths (gated on execution success alone) so the column is directly comparable.

Related Issues

Closes #103, #156, #207.

Checklist

  • Code follows project conventions (see AGENTS.md and docs/agent-context/)
  • Public API changes are documented (README + __all__ + AGENTS.md; public-API snapshot fixture regenerated)
  • No secrets or credentials included

Scope notes (Mode B)

https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y


Generated by Claude Code

#207)

Implements the "benchmark evidence & cost-claim artifacts" group as one PR.

#156 — Maintained provider price table:
- chainweaver/cost.py: PriceSnap, dated PROVIDER_PRICES table, lookup_price,
  CostProfile.from_provider (blended per-token cost), and provider/model/
  price_as_of fields surfaced on every CostReport. compute_cost_report now
  accepts provider/model. No live HTTP — snapshots ship in-tree.
- New CostProfileError for unknown (provider, model) pairs.
- scripts/refresh_prices.py + .github/workflows/update-prices.yml open a
  maintainer-reviewed monthly PR; prices are never auto-merged.

#103 — Correctness benchmark (benchmarks/bench_correctness.py):
- Seeded LLMCorruptionProfile injects the five corruption types (field
  hallucination, data loss, type corruption, schema drift, routing variance);
  CorrectnessReport aggregates them. Compiled execution shows zero corruption
  by construction. Three scenarios + a corruption-compounds-with-length table.

#207 — Aggregate report (benchmarks/report.py):
- build_report packages latency, decisions-avoided, cost-avoided (via #156),
  correctness (via #103), environment metadata, and caveats into versioned
  results/latest.{json,md}. tests/test_benchmark_artifacts.py guards the
  format (the CI gate #207 asks for) and the zero-corruption invariant.

Docs: README cost-avoided section (real $ figure) + error-table row,
benchmarks/README correctness + report sections, AGENTS.md repo map.

Validation: ruff check + ruff format --check + mypy all clean;
pytest 1183 passed, coverage 91.46%.

https://claude.ai/code/session_01KBU8gNNWLVswrH9WP83F3y
Copilot AI review requested due to automatic review settings May 30, 2026 12:57
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'ChainWeaver microbenchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.25.

Benchmark suite Current: 2d4c871 Previous: 3e3ac00 Ratio
compiled_overhead_ms_n5_llm500_tool50 0.33671400001367147 ms 0.2267619999827275 ms 1.48

This comment was automatically generated by workflow using github-action-benchmark.

CC: @dgenio

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds reproducible “evidence artifacts” to support README/docs claims around cost-avoided, latency, and correctness: a maintained provider price snapshot table (offline), a seeded correctness/data-corruption benchmark, and an aggregate benchmark report emitted as committed latest.{json,md} artifacts.

Changes:

  • Add an in-repo, dated provider/model price snapshot table with lookup + CostProfile.from_provider(), and surface the snapshot date on cost reports.
  • Add a seeded correctness benchmark that simulates common LLM-mediated corruption modes vs compiled flow execution.
  • Add an aggregate benchmark report generator (benchmarks/report.py) plus committed benchmarks/results/latest.{json,md} and CI format guards.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
chainweaver/cost.py Introduces PriceSnap, PROVIDER_PRICES, lookup_price(), CostProfile.from_provider(), and extends compute_cost_report()/CostReport.__str__() to support priced reporting.
chainweaver/exceptions.py Adds CostProfileError for unknown (provider, model) pricing lookups.
chainweaver/__init__.py Exports new cost symbols (PriceSnap, PROVIDER_PRICES, lookup_price, CostProfileError) via __all__.
README.md Documents priced cost-avoided reporting and adds CostProfileError to the error table.
scripts/refresh_prices.py Adds a maintainer-facing staleness/reporting helper for price snapshots (no runtime network).
.github/workflows/update-prices.yml Adds a scheduled workflow that can open a human-reviewed “refresh prices” PR.
benchmarks/bench_correctness.py Adds the seeded correctness/data-corruption benchmark and JSON/table output.
benchmarks/report.py Adds aggregate report generation (environment + latency + cost + correctness + caveats) and artifact writing.
benchmarks/README.md Documents how to run correctness and aggregate report benchmarks and how artifacts are used/guarded.
benchmarks/results/latest.json Adds committed “latest” machine-readable benchmark artifact for docs/README evidence.
benchmarks/results/latest.md Adds committed “latest” human-readable benchmark artifact for docs/README evidence.
tests/test_cost.py Adds tests for the price table, lookup behavior, provider-derived profiles, and provider-priced cost reports.
tests/test_benchmark_artifacts.py Adds CI guards validating report/correctness artifact shapes and reproducibility.
tests/fixtures/public_api.json Updates the public API snapshot for new exported symbols/fields.
AGENTS.md Updates the repo map to include the new cost/benchmark/workflow additions.

Comment thread chainweaver/cost.py
Comment thread chainweaver/cost.py
Comment thread chainweaver/cost.py Outdated
Comment thread chainweaver/cost.py Outdated
Comment thread README.md
Comment thread chainweaver/cost.py
Comment thread chainweaver/cost.py Outdated
Comment thread chainweaver/cost.py Outdated
Comment thread README.md
Comment thread benchmarks/bench_correctness.py
claude added 2 commits May 30, 2026 21:10
Address the Copilot review on #224:
- Validate output_fraction is in [0.0, 1.0] in
  PriceSnap.blended_cost_per_token_usd (raise ValueError); out-of-range
  values previously produced negative/inflated blended costs.
- Require provider/model together-or-neither in compute_cost_report so a
  partial config no longer silently returns an unpriced default report.
- Only render the "Priced against" line when price_as_of is also present,
  avoiding a confusing "(as of None)".
- Expose PROVIDER_PRICES as a read-only MappingProxyType so callers cannot
  mutate the shared in-process price table.
- Drop the unused CostProfile import from the README cost example.
- Redefine the benchmark determinism_rate as the most-common-outcome
  frequency / runs (consistency across runs) instead of a correctness
  alias; compiled execution is 1.0 by construction.

Refresh the public-API snapshot (PROVIDER_PRICES is now a mappingproxy)
and regenerate benchmarks/results/latest.{json,md}. Tests added for every
fix; ruff, ruff format, mypy, and pytest all pass.
Audit follow-up to the determinism_rate fix. benchmark_compiled_correctness
counted successful_runs only when the output also matched the truth value,
while benchmark_naive_correctness counts any run that executed without
failure. The values coincide for the compiled path (success implies
correctness by construction), but the field then meant two different things
across approaches. Gate compiled successful_runs on result.success alone so
the metric is directly comparable; correctness-by-construction stays
asserted via corruption_rate, data_integrity_score, and determinism_rate.

Also document determinism_rate on CorrectnessReport: it measures outcome
*consistency* (frequency of the most common (final_value, routing) outcome),
not end-to-end correctness, so future readers don't assume the old meaning.

https://claude.ai/code/session_01CF3z555AAGbJCCJwAerMpJ
@dgenio dgenio merged commit 741a4b4 into main May 30, 2026
15 checks passed
@dgenio dgenio deleted the claude/github-issues-triage-SnIJB branch May 30, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add correctness benchmark: data corruption in naive LLM chaining vs compiled flows

3 participants