compare view: per-behavior rate delta is misleading when n differs across runs

The behavior table on `/suite/[id]/compare` shows a single rate delta per row, computed as `last.rate - first.rate` at `viewer/src/lib/server/data.ts:895`. The delta is rendered with a `%` glyph and arrow whenever `|d| > 0.05` at `viewer/src/routes/suite/[suite_id]/compare/+page.svelte:64-79`.

When per-behavior sample counts differ between runs — which happens any time a customer changes `test_set.sample_size`, `dimensions`, `target`, the policy text, or any of the test_set prompt templates (cache key composition: `p2m/core/artifact_cache.py:632-707`) — the rate delta becomes statistically unreliable.

## Worked example

`3/6 (50%)` vs `9/17 (53%)` renders today as `+3% ▲`, but:
- Wilson 95% CIs: `[19%, 81%]` and `[26%, 79%]` overlap completely
- Two-proportion z-test: `z ≈ 0.124`, `p ≈ 0.90`

The `+2.94 pp` delta is statistically indistinguishable from zero. Even a much larger gap (`3/6` vs `13/17 = 76%`, a +26 pp delta) yields `z ≈ 1.1`, `p ≈ 0.27` — still not significant at conventional thresholds.

## Two compounding UX bugs

1. The unit shown is `%` but the value is **percentage points** (`rate_b - rate_a`), not a percent change. `+3%` reads as "3 percent of something" when it really means "3 percentage points".
2. The widget hides `n` for both runs. A reviewer cannot tell that `+3 pp` came from `3/6` vs `9/17` (noise) rather than `300/600` vs `900/1700` (real signal).

## Proposed fixes (any subset)

1. Change `deltaText` at `+page.svelte:65` to print `+3 pp` instead of `+3%`.
2. When `n_first` and `n_last` differ by more than 20%, annotate the delta with `(n: 6 → 17)`.
3. Compute a two-proportion z-test (or Wilson interval overlap) and suppress the colored arrow when the delta is not statistically significant.

## Acceptance criteria

- Delta text reads `pp`, not `%`.
- Delta widget includes `n` annotation when `|n_a - n_b| / max(n_a, n_b) > 0.2`.
- Unit test in `viewer/src/lib/server/data.test.ts` (or a new test file) covers the `3/6 vs 9/17` case and the `0/0 vs k/n` corner case.

## Context

Flagged during audit of PR #160 (the callable-target label-truncation fix). PR #160 itself does not touch the delta logic — this issue is a pre-existing UX gap on `main`. See related issue #163 for the per-cell `X/N` display.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare view: per-behavior rate delta is misleading when n differs across runs #162

Worked example

Two compounding UX bugs

Proposed fixes (any subset)

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

compare view: per-behavior rate delta is misleading when n differs across runs #162

Description

Worked example

Two compounding UX bugs

Proposed fixes (any subset)

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions