feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3)#84
Merged
Merged
Conversation
Replaces the single pytest-benchmark run + median:5%/mean:15% gate with an N=5 release-mode multi-run gate that compares median Δ across runs against baseline.json. Removes the false-regression risk from any single-run tail outlier — the same noise that motivated #69. Mean threshold dropped (noisy by design); only median Δ is gated. - tests/benchmark/run_current.sh: HEAD-only N-run release-mode runner; writes ephemeral tests/benchmark/results/current/run<N>.json (gitignored, wiped each invocation). - aggregate.py --compare-current: loads run*.json, computes per-benchmark median μs, compares against baseline.json's stats.median, prints PASS/FAIL table on both pass and fail. - make benchmark-compare: rewired to invoke the runner + comparator. BENCHMARK_RUNS=N honored. - 8 unit tests covering pass/fail/faster-than-baseline/missing-benchmark/ N=1/missing-files cases. Verified end-to-end: N=5 at this branch tip — all 26 benchmarks PASS within ±2.4% of baseline. Same code at N=1 had test_simple_policy_allow at +83.8% (single-run noise on a 144μs benchmark), illustrating exactly what the multi-run aggregation fixes. Closes #69 goal 3 (goals 1, 2 landed in PR #72; historical record in PR #71/#73). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
make benchmark-compare's single pytest-benchmark run +--benchmark-compare-fail=median:5%,mean:15%with an N=5 release-mode multi-run gate that compares median Δ across runs againsttests/benchmark/results/baseline.json. Closes benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 3 (goals 1, 2 landed in feat(benchmark): release-mode targets and v4.8.0 median baseline (issue #69 goals 1+2) #72; historical record in feat(benchmark): add benchmark-history capture and initial record #71/chore(benchmark): record 6th historical state for PR #70 (restores v4.8.0 perf) #73).tests/benchmark/run_current.shruns N release-mode benchmarks at HEAD into an ephemeral, gitignoredresults/current/. Newaggregate.py --compare-currentloads them, computes per-benchmark median, gates on median Δ > 5%, and prints a PASS/FAIL table on every run.BENCHMARK_RUNS=Nhonored end-to-end (default 5).Verification
End-to-end smoke test on this branch:
test_simple_policy_allowΔAt N=5 all 26 benchmarks pass within ±2.4% of baseline. The N=1 vs N=5 contrast is the punchline of goal 3.
8 unit tests added in
tests/unit/test_benchmark_compare.pycovering pass / fail / faster-than-baseline / benchmark-missing-from-baseline / benchmark-missing-from-current / N=1 / missing run files / missing baseline.Test plan
pytest tests/unit/test_benchmark_compare.py— 8/8 passmake benchmark-compare(default N=5) — PASS on all 26 benchmarksBENCHMARK_RUNS=1 make benchmark-compare— FAILs as expected (single-run noise on fast benchmarks)tests/benchmark/results/current/stays out ofgit statusafter a run (gitignored)meanthreshold is the desired behavior — benchmark process improvements: release builds, refreshed baseline, multi-run signal #69's out-of-scope note flagged it for removal once goal 3 lands🤖 Generated with Claude Code