Skip to content

feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3)#84

Merged
skuenzli merged 4 commits into
mainfrom
feat/issue-69-median-of-n-gate
May 25, 2026
Merged

feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3)#84
skuenzli merged 4 commits into
mainfrom
feat/issue-69-median-of-n-gate

Conversation

@skuenzli
Copy link
Copy Markdown
Contributor

Summary

Verification

End-to-end smoke test on this branch:

N test_simple_policy_allow Δ Result
1 +83.8% (single-run noise on 144μs benchmark) FAIL
5 +1.2% PASS

At N=5 all 26 benchmarks pass within ±2.4% of baseline. The N=1 vs N=5 contrast is the punchline of goal 3.

8 unit tests added in tests/unit/test_benchmark_compare.py covering pass / fail / faster-than-baseline / benchmark-missing-from-baseline / benchmark-missing-from-current / N=1 / missing run files / missing baseline.

Test plan

  • pytest tests/unit/test_benchmark_compare.py — 8/8 pass
  • make benchmark-compare (default N=5) — PASS on all 26 benchmarks
  • BENCHMARK_RUNS=1 make benchmark-compare — FAILs as expected (single-run noise on fast benchmarks)
  • Reviewer: confirm tests/benchmark/results/current/ stays out of git status after a run (gitignored)
  • Reviewer: confirm the dropped mean threshold is the desired behavior — benchmark process improvements: release builds, refreshed baseline, multi-run signal #69's out-of-scope note flagged it for removal once goal 3 lands

🤖 Generated with Claude Code

skuenzli and others added 4 commits May 25, 2026 14:15
Replaces the single pytest-benchmark run + median:5%/mean:15% gate with an
N=5 release-mode multi-run gate that compares median Δ across runs against
baseline.json. Removes the false-regression risk from any single-run tail
outlier — the same noise that motivated #69. Mean threshold dropped (noisy
by design); only median Δ is gated.

- tests/benchmark/run_current.sh: HEAD-only N-run release-mode runner;
  writes ephemeral tests/benchmark/results/current/run<N>.json (gitignored,
  wiped each invocation).
- aggregate.py --compare-current: loads run*.json, computes per-benchmark
  median μs, compares against baseline.json's stats.median, prints PASS/FAIL
  table on both pass and fail.
- make benchmark-compare: rewired to invoke the runner + comparator.
  BENCHMARK_RUNS=N honored.
- 8 unit tests covering pass/fail/faster-than-baseline/missing-benchmark/
  N=1/missing-files cases.

Verified end-to-end: N=5 at this branch tip — all 26 benchmarks PASS within
±2.4% of baseline. Same code at N=1 had test_simple_policy_allow at +83.8%
(single-run noise on a 144μs benchmark), illustrating exactly what the
multi-run aggregation fixes.

Closes #69 goal 3 (goals 1, 2 landed in PR #72; historical record in
PR #71/#73).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skuenzli skuenzli merged commit d4844ad into main May 25, 2026
7 checks passed
@skuenzli skuenzli self-assigned this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmark process improvements: release builds, refreshed baseline, multi-run signal

1 participant