Skip to content

feat(statistics): consume projection statistics in the bundle pipeline#295

Draft
jcoludar wants to merge 7 commits into
mainfrom
feat/projection-statistics
Draft

feat(statistics): consume projection statistics in the bundle pipeline#295
jcoludar wants to merge 7 commits into
mainfrom
feat/projection-statistics

Conversation

@jcoludar

Copy link
Copy Markdown
Collaborator

Summary

The protspace_web half of the projection-statistics MVP (issue #219). The engine
(protspace) computes per-projection cluster-validity + faithfulness statistics and bakes them into
the .parquetbundle as an optional fifth part; this PR makes the web side produce and tolerate
that part. Rendering of the statistics is a deliberate follow-up — out of scope here.

Engine PR (lands first): tsenoner/protspace#61. This PR depends on it and on a
stats-bearing protspace release. It is safe to merge ahead of that release: the prep step
feature-probes the protspace stats subcommand and no-ops if absent — statistics simply don't
appear yet, and nothing else changes.

What's in this PR

  • Bundle reader (@protspace/core data-loader + @protspace/utils): accept 3–5 parts.
    The reader now branches on an empty settings slot (statistics-without-settings ⇒ a zero-byte
    4th part) rather than the raw delimiter count, so the optional fifth statistics.parquet is read
    without error. The statistics part is parsed-but-ignored for now; createParquetBundle
    re-export still drops it (documented).
  • Prep service (services/protspace-prep): a best-effort stats step that runs after the
    core bundle is produced and outside the pipeline timeout budget, so it can never cost the job
    or lose the bundle. It has:
    • its own nested timeout (stats_timeout_seconds), caught locally so it never reaches the
      parent handler;
    • a bounded, non-latching version probe for protspace stats (single-flight lock + hard
      timeout + kill of a hung subprocess; transient spawn errors are not latched, so a later job
      retries);
    • an atomic re-bundle — writes a sibling temp .parquetbundle and os.replaces it, so a
      stats timeout/kill mid-write can't corrupt the already-shipped bundle.
  • Frontend: a new computing_statistics SSE stage wired through FastaPrepStage and the explore
    runtime (progress creep stopped, 95%, "Computing statistics…").
  • Tests: the stale "five-part bundles are rejected" assertion is inverted to accept them;
    added 5-part round-trip + zero-byte settings-slot coverage; services/protspace-prep/tests
    exercise the success / timeout / probe-absent paths.
  • OpenSpec: openspec/changes/add-projection-statistics/ (proposal, design, spec, tasks).

Verification

  • @protspace/core bundle suite: 24 passed (incl. the inverted 5-part + zero-byte-slot tests).
  • protspace-prep pipeline suite: 19 passed.
  • Pre-commit (lint-staged && quality && docs:annotations:check && docs:build): green.

Data-format change

Additive, backward compatible — existing 3- and 4-part bundles read and write unchanged.

Refs #219

The protspace_web half of the projection-statistics MVP (engine PR:
tsenoner/protspace#61; tracking issue #219). The prep service folds
engine-computed stats into the bundle, the reader accepts the optional fifth
part, and the UI surfaces the new stage.

- data-loader/bundle.ts (+ @protspace/utils bundle-writer): accept 3-5 parts;
  branch on an empty settings slot (statistics-without-settings) rather than the
  raw delimiter count. The statistics part is parsed-but-ignored for now —
  rendering is a committed follow-up, out of scope here.
- services/protspace-prep: a best-effort `stats` step AFTER the core bundle and
  OUTSIDE the pipeline timeout, with its own nested timeout, a bounded
  non-latching `protspace stats` version probe (lock + timeout + kill, no
  false-latch on transient errors), and an atomic temp-bundle + os.replace
  re-bundle so a stats timeout/kill can never corrupt or lose the shipped bundle.
- app: a `computing_statistics` SSE stage (progress 95%, "Computing statistics…").
- openspec/changes/add-projection-statistics: proposal, design, spec, tasks.
- Invert the stale "five-part bundles are rejected" test; add round-trip and
  zero-byte settings-slot coverage.

Depends on the engine PR (tsenoner/protspace#61) and a stats-bearing protspace
release; the prep step feature-probes the subcommand and no-ops if absent, so
this is safe to merge ahead of the release (stats simply don't appear yet).

Refs #219

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI's `quality:ci` runs `format:check` before lint/quality and prettier
flagged this openspec doc (markdown list-continuation indentation).
Pure formatting — no content change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tsenoner tsenoner marked this pull request as draft June 24, 2026 13:51
jcoludar and others added 5 commits June 24, 2026 18:25
…ction metadata

Phase 1B of route-projection-statistics. The engine now folds per-projection
faithfulness (kNN-overlap / trustworthiness / continuity) into each projection's
info_json under a `quality` object. The projection-metadata panel flattened
info_json only one level, so a nested `quality` rendered as a raw JSON.stringify
blob.

- Extract the metadata-row building into a pure, tested helper
  (projection-metadata-helpers.ts), matching the package's *-helpers pattern.
- Expand info_json.quality into discrete per-metric rows: each shows its value
  plus compact provenance (distance metric, k); a skipped metric (value null)
  renders as N/A with its marker; a flat scalar shape is tolerated.
- projection-metadata.ts now delegates to the helper (no behavior change beyond
  the quality expansion).

Tests: projection-metadata-helpers.test.ts (flatten, quality expansion, skip,
flat-scalar). Full core suite green (1073).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gn/spec/tasks)

The design for routing each statistic to the bundle part whose existing frontend
consumer matches its granularity (tsenoner's PR #61 review): faithfulness →
projections_metadata.info_json.quality, per-protein cluster/silhouette →
protein_annotations, aggregate validity → statistics.parquet. Includes the
4-lens fan-out review outcomes and the phased plan (Phase 1 low-risk routing;
Phase 2 per-protein annotations behind a flag).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bundle

Phase 2A of route-projection-statistics. `protspace stats` now enriches the
annotations parquet in place with per-protein cluster-membership + silhouette
columns when given -a; the prep re-bundle step passes the annotations path so the
following `bundle -a` carries those columns (and faithfulness rides in
projections_metadata, both from the same stats call). Still best-effort: any stats
failure leaves the core bundle untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gs part

Phase 2A.4 of route-projection-statistics. The stats step now writes the
auto-generated cluster-membership legend styles (`stats --settings-out
cluster_styles.json`) and the re-bundle folds them in (`bundle --settings ...`)
so clusters are colored when selected. The --settings flag is only added when the
styles file exists, so an older engine without --settings-out degrades gracefully
(columns still ship, just without pre-baked colors). Still best-effort.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… section

Phase 2B of route-projection-statistics. The color-by dropdown already
auto-discovers the computed `cluster_<proj>` / `silhouette_<proj>` annotation
columns (no allowlist hides them, and content-based inference types membership as
categorical and silhouette as continuous). This adds a dedicated "Statistics"
section to `groupAnnotations` so the ~12 computed columns (one pair per projection)
don't flood the catch-all "Other" group.

Test: cluster_/silhouette_ columns land in Statistics (sorted), non-computed labels
stay in Other.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant