The behavior table on /suite/[id]/compare shows a single rate delta per row, computed as last.rate - first.rate at viewer/src/lib/server/data.ts:895. The delta is rendered with a % glyph and arrow whenever |d| > 0.05 at viewer/src/routes/suite/[suite_id]/compare/+page.svelte:64-79.
When per-behavior sample counts differ between runs — which happens any time a customer changes test_set.sample_size, dimensions, target, the policy text, or any of the test_set prompt templates (cache key composition: p2m/core/artifact_cache.py:632-707) — the rate delta becomes statistically unreliable.
Worked example
3/6 (50%) vs 9/17 (53%) renders today as +3% ▲, but:
- Wilson 95% CIs:
[19%, 81%] and [26%, 79%] overlap completely
- Two-proportion z-test:
z ≈ 0.124, p ≈ 0.90
The +2.94 pp delta is statistically indistinguishable from zero. Even a much larger gap (3/6 vs 13/17 = 76%, a +26 pp delta) yields z ≈ 1.1, p ≈ 0.27 — still not significant at conventional thresholds.
Two compounding UX bugs
- The unit shown is
% but the value is percentage points (rate_b - rate_a), not a percent change. +3% reads as "3 percent of something" when it really means "3 percentage points".
- The widget hides
n for both runs. A reviewer cannot tell that +3 pp came from 3/6 vs 9/17 (noise) rather than 300/600 vs 900/1700 (real signal).
Proposed fixes (any subset)
- Change
deltaText at +page.svelte:65 to print +3 pp instead of +3%.
- When
n_first and n_last differ by more than 20%, annotate the delta with (n: 6 → 17).
- Compute a two-proportion z-test (or Wilson interval overlap) and suppress the colored arrow when the delta is not statistically significant.
Acceptance criteria
- Delta text reads
pp, not %.
- Delta widget includes
n annotation when |n_a - n_b| / max(n_a, n_b) > 0.2.
- Unit test in
viewer/src/lib/server/data.test.ts (or a new test file) covers the 3/6 vs 9/17 case and the 0/0 vs k/n corner case.
Context
Flagged during audit of PR #160 (the callable-target label-truncation fix). PR #160 itself does not touch the delta logic — this issue is a pre-existing UX gap on main. See related issue #163 for the per-cell X/N display.
The behavior table on
/suite/[id]/compareshows a single rate delta per row, computed aslast.rate - first.rateatviewer/src/lib/server/data.ts:895. The delta is rendered with a%glyph and arrow whenever|d| > 0.05atviewer/src/routes/suite/[suite_id]/compare/+page.svelte:64-79.When per-behavior sample counts differ between runs — which happens any time a customer changes
test_set.sample_size,dimensions,target, the policy text, or any of the test_set prompt templates (cache key composition:p2m/core/artifact_cache.py:632-707) — the rate delta becomes statistically unreliable.Worked example
3/6 (50%)vs9/17 (53%)renders today as+3% ▲, but:[19%, 81%]and[26%, 79%]overlap completelyz ≈ 0.124,p ≈ 0.90The
+2.94 ppdelta is statistically indistinguishable from zero. Even a much larger gap (3/6vs13/17 = 76%, a +26 pp delta) yieldsz ≈ 1.1,p ≈ 0.27— still not significant at conventional thresholds.Two compounding UX bugs
%but the value is percentage points (rate_b - rate_a), not a percent change.+3%reads as "3 percent of something" when it really means "3 percentage points".nfor both runs. A reviewer cannot tell that+3 ppcame from3/6vs9/17(noise) rather than300/600vs900/1700(real signal).Proposed fixes (any subset)
deltaTextat+page.svelte:65to print+3 ppinstead of+3%.n_firstandn_lastdiffer by more than 20%, annotate the delta with(n: 6 → 17).Acceptance criteria
pp, not%.nannotation when|n_a - n_b| / max(n_a, n_b) > 0.2.viewer/src/lib/server/data.test.ts(or a new test file) covers the3/6 vs 9/17case and the0/0 vs k/ncorner case.Context
Flagged during audit of PR #160 (the callable-target label-truncation fix). PR #160 itself does not touch the delta logic — this issue is a pre-existing UX gap on
main. See related issue #163 for the per-cellX/Ndisplay.