v3 Slice 3: Stats core - bootstrapped CIs, marginals, single-source metrics, export

## Parent

#13

## What to build

The statistics layer, decoupled from collection - it reads persisted results only and never calls a provider. Establishes the single source of truth for every reported number.

- Shared metric helpers (domain accuracy, composite) that every downstream consumer must use - no second computation path may exist.
- Bootstrapped 95% confidence intervals on the headline conjunctive accuracy.
- Per-part marginal accuracies and per-skill / per-bundle breakdowns (which causal sub-skills fail).
- Refusal and invalid reported as separate columns, never folded into accuracy.
- `stats` (console summary) and `export` (JSON) CLI commands.

## Acceptance criteria

- [ ] Metrics run over synthetic persisted results with no provider calls
- [ ] CI brackets the point estimate; each per-part marginal is >= the conjunctive accuracy (sanity invariant)
- [ ] Refusal/invalid column counts reconcile exactly with raw stop_reason counts
- [ ] Only one accuracy computation path exists (consumers call the shared helper)
- [ ] `export` output is deterministic from the results files
- [ ] pytest green

## Blocked by

- #15


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3 Slice 3: Stats core - bootstrapped CIs, marginals, single-source metrics, export #16

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v3 Slice 3: Stats core - bootstrapped CIs, marginals, single-source metrics, export #16

Description

Parent

What to build

Acceptance criteria

Blocked by

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions