Skip to content

v3 Slice 3: Stats core - bootstrapped CIs, marginals, single-source metrics, export #16

Description

@mark-allwyn

Parent

#13

What to build

The statistics layer, decoupled from collection - it reads persisted results only and never calls a provider. Establishes the single source of truth for every reported number.

  • Shared metric helpers (domain accuracy, composite) that every downstream consumer must use - no second computation path may exist.
  • Bootstrapped 95% confidence intervals on the headline conjunctive accuracy.
  • Per-part marginal accuracies and per-skill / per-bundle breakdowns (which causal sub-skills fail).
  • Refusal and invalid reported as separate columns, never folded into accuracy.
  • stats (console summary) and export (JSON) CLI commands.

Acceptance criteria

  • Metrics run over synthetic persisted results with no provider calls
  • CI brackets the point estimate; each per-part marginal is >= the conjunctive accuracy (sanity invariant)
  • Refusal/invalid column counts reconcile exactly with raw stop_reason counts
  • Only one accuracy computation path exists (consumers call the shared helper)
  • export output is deterministic from the results files
  • pytest green

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    ready-for-agentReady for autonomous agent pickup

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions