feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3) by skuenzli · Pull Request #84 · k9securityio/cedar-py

skuenzli · 2026-05-25T21:49:18Z

Summary

Replaces make benchmark-compare's single pytest-benchmark run + --benchmark-compare-fail=median:5%,mean:15% with an N=5 release-mode multi-run gate that compares median Δ across runs against tests/benchmark/results/baseline.json. Closes benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 3 (goals 1, 2 landed in feat(benchmark): release-mode targets and v4.8.0 median baseline (issue #69 goals 1+2) #72; historical record in feat(benchmark): add benchmark-history capture and initial record #71/chore(benchmark): record 6th historical state for PR #70 (restores v4.8.0 perf) #73).
New tests/benchmark/run_current.sh runs N release-mode benchmarks at HEAD into an ephemeral, gitignored results/current/. New aggregate.py --compare-current loads them, computes per-benchmark median, gates on median Δ > 5%, and prints a PASS/FAIL table on every run.
Mean threshold dropped — noisy by design per benchmark process improvements: release builds, refreshed baseline, multi-run signal #69's out-of-scope note. Median-only gating.
BENCHMARK_RUNS=N honored end-to-end (default 5).

Verification

End-to-end smoke test on this branch:

N	`test_simple_policy_allow` Δ	Result
1	+83.8% (single-run noise on 144μs benchmark)	FAIL
5	+1.2%	PASS

At N=5 all 26 benchmarks pass within ±2.4% of baseline. The N=1 vs N=5 contrast is the punchline of goal 3.

8 unit tests added in tests/unit/test_benchmark_compare.py covering pass / fail / faster-than-baseline / benchmark-missing-from-baseline / benchmark-missing-from-current / N=1 / missing run files / missing baseline.

Test plan

pytest tests/unit/test_benchmark_compare.py — 8/8 pass
make benchmark-compare (default N=5) — PASS on all 26 benchmarks
BENCHMARK_RUNS=1 make benchmark-compare — FAILs as expected (single-run noise on fast benchmarks)
Reviewer: confirm tests/benchmark/results/current/ stays out of git status after a run (gitignored)
Reviewer: confirm the dropped mean threshold is the desired behavior — benchmark process improvements: release builds, refreshed baseline, multi-run signal #69's out-of-scope note flagged it for removal once goal 3 lands

🤖 Generated with Claude Code

Replaces the single pytest-benchmark run + median:5%/mean:15% gate with an N=5 release-mode multi-run gate that compares median Δ across runs against baseline.json. Removes the false-regression risk from any single-run tail outlier — the same noise that motivated #69. Mean threshold dropped (noisy by design); only median Δ is gated. - tests/benchmark/run_current.sh: HEAD-only N-run release-mode runner; writes ephemeral tests/benchmark/results/current/run<N>.json (gitignored, wiped each invocation). - aggregate.py --compare-current: loads run*.json, computes per-benchmark median μs, compares against baseline.json's stats.median, prints PASS/FAIL table on both pass and fail. - make benchmark-compare: rewired to invoke the runner + comparator. BENCHMARK_RUNS=N honored. - 8 unit tests covering pass/fail/faster-than-baseline/missing-benchmark/ N=1/missing-files cases. Verified end-to-end: N=5 at this branch tip — all 26 benchmarks PASS within ±2.4% of baseline. Same code at N=1 had test_simple_policy_allow at +83.8% (single-run noise on a 144μs benchmark), illustrating exactly what the multi-run aggregation fixes. Closes #69 goal 3 (goals 1, 2 landed in PR #72; historical record in PR #71/#73). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skuenzli and others added 4 commits May 25, 2026 14:15

docs: init median of N gate in release process task

e23cb2a

docs: clarify median of N gate in release process task

0199e3e

docs: plan median of N gate in release process task

bc63f01

skuenzli merged commit d4844ad into main May 25, 2026
7 checks passed

skuenzli self-assigned this May 25, 2026

skuenzli mentioned this pull request May 28, 2026

docs: README section for is_authorized_partial + CLAUDE.md follow-ups #86

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3)#84

feat(benchmark): median-of-N regression gate for make benchmark-compare (#69 goal 3)#84
skuenzli merged 4 commits into
mainfrom
feat/issue-69-median-of-n-gate

skuenzli commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skuenzli commented May 25, 2026

Summary

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant