Skip to content

compare view: per-behavior rate delta is misleading when n differs across runs #162

@changliu2

Description

@changliu2

The behavior table on /suite/[id]/compare shows a single rate delta per row, computed as last.rate - first.rate at viewer/src/lib/server/data.ts:895. The delta is rendered with a % glyph and arrow whenever |d| > 0.05 at viewer/src/routes/suite/[suite_id]/compare/+page.svelte:64-79.

When per-behavior sample counts differ between runs — which happens any time a customer changes test_set.sample_size, dimensions, target, the policy text, or any of the test_set prompt templates (cache key composition: p2m/core/artifact_cache.py:632-707) — the rate delta becomes statistically unreliable.

Worked example

3/6 (50%) vs 9/17 (53%) renders today as +3% ▲, but:

  • Wilson 95% CIs: [19%, 81%] and [26%, 79%] overlap completely
  • Two-proportion z-test: z ≈ 0.124, p ≈ 0.90

The +2.94 pp delta is statistically indistinguishable from zero. Even a much larger gap (3/6 vs 13/17 = 76%, a +26 pp delta) yields z ≈ 1.1, p ≈ 0.27 — still not significant at conventional thresholds.

Two compounding UX bugs

  1. The unit shown is % but the value is percentage points (rate_b - rate_a), not a percent change. +3% reads as "3 percent of something" when it really means "3 percentage points".
  2. The widget hides n for both runs. A reviewer cannot tell that +3 pp came from 3/6 vs 9/17 (noise) rather than 300/600 vs 900/1700 (real signal).

Proposed fixes (any subset)

  1. Change deltaText at +page.svelte:65 to print +3 pp instead of +3%.
  2. When n_first and n_last differ by more than 20%, annotate the delta with (n: 6 → 17).
  3. Compute a two-proportion z-test (or Wilson interval overlap) and suppress the colored arrow when the delta is not statistically significant.

Acceptance criteria

  • Delta text reads pp, not %.
  • Delta widget includes n annotation when |n_a - n_b| / max(n_a, n_b) > 0.2.
  • Unit test in viewer/src/lib/server/data.test.ts (or a new test file) covers the 3/6 vs 9/17 case and the 0/0 vs k/n corner case.

Context

Flagged during audit of PR #160 (the callable-target label-truncation fix). PR #160 itself does not touch the delta logic — this issue is a pre-existing UX gap on main. See related issue #163 for the per-cell X/N display.

Metadata

Metadata

Assignees

No one assigned

    Labels

    designenhancementNew feature or requestfollow-upPolish or post-launch improvement

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions