Skip to content

feat(stats): projection statistics (cluster-validity + faithfulness)#61

Open
jcoludar wants to merge 6 commits into
mainfrom
feat/projection-statistics
Open

feat(stats): projection statistics (cluster-validity + faithfulness)#61
jcoludar wants to merge 6 commits into
mainfrom
feat/projection-statistics

Conversation

@jcoludar

Copy link
Copy Markdown
Collaborator

Summary

Adds a protspace.stats subsystem that computes per-projection quality statistics at
preparation time and bakes them into the .parquetbundle as an optional fifth part. This is the
engine half of the projection-statistics MVP (tracking issue: tsenoner/protspace_web#219;
related: #31). The protspace_web PR that consumes it follows separately — this PR lands first.

Today the pipeline (embed → project → annotate → bundle) ships coordinates with zero quality
metrics, so judging a projection is purely visual. This change answers two questions per projection:

  • Cluster structurecluster_validity: KMeans with a distance-to-chord elbow estimate of
    K, scored by silhouette, Davies–Bouldin, Calinski–Harabasz on the projection coords.
  • Projection faithfulnessfaithfulness: kNN-overlap and trustworthiness / continuity
    between the source embedding and the projection — i.e. how much the reduction distorted the
    neighbourhood structure.

What's in this PR

  • src/protspace/stats/ — a generalized Statistic contract (each statistic declares whether it
    needs the embedding and returns one or more StatRows) + a lazy STATISTICS registry mirroring
    the existing REDUCERS pattern. sklearn imports stay function-local to preserve CLI startup.
    • stats/cluster/kmeans_elbow.py — KMeans sweep K ∈ [2, min(round(√n), 50)], elbow via
      perpendicular deviation from the first→last inertia chord (argmax index → K).
    • stats/metrics/validity.py — silhouette (seeded sample above threshold) / DB / CH at that K.
    • stats/metrics/faithfulness.py — kNN-overlap@k and trustworthiness/continuity vs the embedding.
    • stats/driver.pycompute_statistics(...) iterating registered statistics per projection,
      isolating per-statistic and per-projection failures (a bad reduction is logged and skipped,
      never sinks the report).
  • Tidy long-format output — an 8-column table (space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json). New statistics add rows, not columns.
  • Bundle I/O (data/io/bundle.py) — an optional fifth part statistics.parquet. Layout
    core(3) + settings? + statistics?; when statistics is present without settings, a zero-byte
    settings slot
    keeps the fifth position unambiguous. write_bundle / read_bundle /
    extract_bundle_to_dir and replace_settings_in_bundle are all updated — the last so
    protspace style no longer silently drops a trailing stats part.
  • WiringReductionPipeline.run (the one stage holding embeddings and projections)
    computes stats behind prepare --stats/--no-stats; new protspace stats -i emb.h5 -p project_dir -o statistics.parquet for the discrete path; bundle -s/--statistics folds a stats parquet in.
  • No new dependency — scikit-learn (KMeans, silhouette/DB/CH, manifold.trustworthiness) is
    already a core dep.

Robustness hardening (post adversarial review)

This branch was put through a multi-agent review; the confirmed findings are fixed here:

  • Faithfulness no longer silently drops trustworthiness/continuity for small n. trustworthiness
    requires n_neighbors < n/2 (strict); k is now clamped to (n-1)//2 instead of n-2.
  • cluster_validity scores the full projection; faithfulness scores the embedding-aligned
    subset.
    Previously clustering could be scored on the id-intersection subset only.
  • High-dim metric is threaded through (default_metric) so faithfulness uses the run's metric
    rather than defaulting to euclidean for PCA/MDS/PaCMAP.
  • Multi-embedding attribution — exact id-set cover wins immediately (no ambiguous tie-break); a
    source column maps each projection to its own embedding.
  • Precomputed (n×n) similarity/distance matrices are skipped for faithfulness (not embeddings).
  • kNN self-exclusion is tie-safe on coincident points (filter self, not column-0 slice).
  • n_clusters reports the achieved distinct-cluster count (KMeans can collapse on coincident
    points), keeping the requested K in extra.

Tests

  • tests/test_stats.py, tests/test_stats_bundle.py, tests/test_stats_cli.py — known-answer
    numeric fixtures (blob separation; faithfulness on identity vs random projections; label-permutation
    alignment), the 8-column schema, the 5-part bundle round-trip, and the protspace style round-trip.
  • Full suite: 534 passed locally.

Scope (MVP) & non-goals

In: per-projection cluster_validity (unsupervised/elbow) + faithfulness, baked at prep time,
carried in the bundle. Explicitly out (non-breaking future expansions): embedding-space
cluster-validity, annotation-feature label sources, on-demand recompute, the broader
ProtSpaceExtractor pair/edge/set analyses (future typed bundle parts), and frontend rendering. The
registry + long-format table leave seams for the scalar expansions.

Data-format change: additive, backward compatible — existing 3- and 4-part bundles read and
write unchanged.

Refs tsenoner/protspace_web#219, #31

Add a protspace.stats package computing per-projection statistics, baked
into the .parquetbundle as an optional fifth part:

- cluster_validity: KMeans + distance-to-chord elbow -> silhouette,
  Davies-Bouldin, Calinski-Harabasz on the projection coordinates.
- faithfulness: kNN-overlap + trustworthiness/continuity vs the source
  embedding (high-dim metric from the reducer; large-n sampling guard).

Tidy long-format table (8 cols: space_kind, space_name, stat_family,
label_kind, metric, metric_kind, value, extra_json) — new statistics add
rows, not columns. Registry mirrors the lazy REDUCERS pattern; sklearn
imports stay function-local to preserve CLI startup.

Bundle I/O carries an optional 5th statistics part (core+settings?+stats?)
with a zero-byte settings slot keeping it unambiguous; read_bundle keeps
its 2-tuple shape (new read_statistics_from_bundle accessor) and
replace_settings_in_bundle preserves a trailing stats part so
`protspace style` is non-lossy.

Wiring: ReductionPipeline computes stats (best-effort, never fatal) behind
prepare --stats/--no-stats; new `protspace stats` subcommand for the
discrete path; `bundle -s/--statistics` folds a stats parquet in.

Refs tsenoner/protspace_web#219

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jcoludar

Copy link
Copy Markdown
Collaborator Author

Web counterpart (consumes this, lands after a stats-bearing release): tsenoner/protspace_web#295. Tracking issue: tsenoner/protspace_web#219.

CI's `ruff format --check` flagged 9 files that were committed without
running `ruff format` (`ruff check` lint passed, but the formatter check
is a separate CI step). Pure formatting — no behavior change.
Stats suite still 30 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tsenoner

Copy link
Copy Markdown
Owner

Where to put the information:

  • Cluster Elbow Kx -> save as an annotation
  • Cluster Silhoutte Kx -> save as annotation with x beeing variable
  • kNN-overlap + trustworthiness + continuity -> metadata per dimensionality reduction method (store with the other infos)
  • silhouette, Davies–Bouldin, Calinski–Harabasz -> separate parquet file (as currently)

jcoludar and others added 4 commits June 24, 2026 16:35
…lity)

Phase 1A of route-projection-statistics: carry each statistic in the bundle
part whose existing frontend consumer matches its granularity, instead of one
opaque fifth part.

- StatRow gains a `destination` (default "statistics_part", not a tidy-table
  column); StatsReport.partition() groups rows by destination and to_arrow()
  serialises only the statistics_part bucket -- the fifth part is now aggregate
  cluster-validity only.
- Faithfulness rows (kNN-overlap / trustworthiness / continuity, incl. the skip
  row) are marked destination="projection_metadata".
- New stats/carriage.py route_faithfulness_to_metadata() folds those rows into
  each projection's info_json.quality (per-metric value + k/metric/sampling
  provenance; NaN skip value -> null so info_json stays valid JSON). Wired into
  ReductionPipeline._compute_statistics before create_output serialises info_json.
- `protspace stats` stays a pure aggregate-only producer (faithfulness no longer
  written to statistics.parquet); the prep stats+bundle path is unaffected.

Tests: destination/partition/to_arrow restriction; faithfulness routing incl.
skip row; carriage router (provenance, NaN->null, info_json round-trip,
multi-embedding); end-to-end `protspace stats` aggregate-only. Existing
fifth-part tests updated to the narrowed contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thfulness

The deployed prep pipeline builds bundles via standalone `protspace project` +
`stats` + `bundle` subprocesses, not the in-process `prepare` pipeline. After the
Phase-1A routing, `protspace stats` wrote faithfulness nowhere (only aggregate
validity → statistics.parquet), so the prep path lost it.

`protspace stats` now folds faithfulness into `projections_metadata.parquet` in
place (parses each row's info_json, injects `quality`, preserves all other columns
and the reducer's existing info, re-serialises). The existing `protspace bundle -p`
then carries the enriched metadata into the bundle's 2nd part with no bundle/prep
code change. statistics.parquet stays aggregate-only.

This matches the spec scenario "the standalone stats path recomputes and merges it
into projections_metadata" and makes Phase 1 deliver faithfulness end-to-end in the
production prep flow.

Tests: stats rewrites metadata.info_json.quality (columns/rows preserved, reducer
info kept); end-to-end `stats` → `bundle -p` ships a bundle whose
projections_metadata carries quality while the fifth part stays validity-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n columns

Phase 2A of route-projection-statistics (tsenoner #61 review bullets 1-2): surface
the elbow-K labelling and per-point silhouette as per-protein annotation columns
so the frontend color-by control renders them with no new UI.

- New AnnotationColumn output type (name, kind categorical|numeric, values keyed by
  identifier); StatsReport carries an annotation_columns channel and add() routes
  mixed StatRow / AnnotationColumn lists.
- ClusterValidityStatistic emits `cluster_<projection>` (non-numeric "cluster N"
  labels → categorical inference) and `silhouette_<projection>` (per-point
  silhouette_samples over the full labelled set → numeric). Per-point silhouette is
  O(n^2) with no subsample path, so it has its own hard-ceiling skip guard; both are
  gated by the cluster_annotations param and emitted only for a genuine (>=2)
  clustering with aligned ids.
- carriage.merge_annotation_columns joins the columns onto the annotations frame by
  identifier (absent proteins get no value); wired into the prepare pipeline before
  create_output's .astype(str) so typing survives.
- `protspace stats` gains -a/--annotations: enriches the annotations parquet in
  place with the computed columns (stringified to match the prepare path), so the
  prep `project -> stats -a -> bundle -a` flow carries them. Without -a the
  expensive per-protein computation is skipped.

Tests: validity per-protein outputs + ceiling guard + disable; carriage join +
annotations-table typing; stats -a enrichment; end-to-end stats -a -> bundle -a
ships cluster_/silhouette_ columns in the bundle's annotations part. Auto-styling
(colored-without-manual-step) is the next increment; columns already color via the
default palette when selected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2A.4 of route-projection-statistics. Generate a full LegendPersistedSettings
envelope per cluster-membership column so clusters are colored when selected with no
manual styling step.

- carriage.build_cluster_legend_settings: for each categorical AnnotationColumn build
  a complete envelope the frontend's sanitizeLegendSettingsEntry accepts —
  maxVisibleValues / shapeSize / sortMode / hiddenValues / enableDuplicateStackUI /
  selectedPaletteId + categories keyed by the exact label with a Kelly-palette
  color, zOrder and shape. Numeric (silhouette) columns keep the default ramp.
- prepare path: BaseProcessor.save_output gains settings=; the pipeline builds the
  cluster styles from the report and writes them into the bundle's settings part.
- prep path: `protspace stats --settings-out <json>` writes the styles; `protspace
  bundle --settings <json>` folds them into the settings part.

Tests: envelope validity (every required field/type + distinct palette colors);
end-to-end stats --settings-out -> bundle --settings styles clusters in the settings
part.

Deferred (follow-up): preserving the generated cluster styles across a later
`protspace style` rewrite (replace_settings_in_bundle) — a rare re-style path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jcoludar

Copy link
Copy Markdown
Collaborator Author

Routing is implemented across this PR and protspace_web#295. Where each statistic lands:

  • kNN-overlap, trustworthiness, continuity → projection metadata (info_json.quality), shown per-metric in the projection panel.
  • silhouette, Davies–Bouldin, Calinski–Harabasz → statistics.parquet (now the only thing in that part).
  • cluster elbow-K label → per-protein cluster_ annotation column (categorical, auto-styled with a Kelly palette).
  • per-point silhouette → per-protein silhouette_ column (numeric ramp).

Computed columns appear in color-by automatically, grouped under a new "Statistics" section. protspace stats handles carriage: faithfulness into projections_metadata, per-protein columns into the annotations parquet (-a), and the cluster legend styles via --settings-out for bundle --settings. The prep pipeline picks all of it up. Design + phased plan: the route-projection-statistics openspec change in the web repo. Both PRs green.

Defaults chosen:

  • Naming: cluster_ / silhouette_, elbow K only (no sweep), K stored in metadata, not the column name (keeps names stable if K shifts on re-run). Your review said "Kx with x variable" — if K belongs in the visible label, that's a small change.
  • Coloring: clusters are colored when the column is selected, not forced as the initial color-by. Making a cluster column the default initial selection is separate frontend work (initial-view).

Deferred (not blocking):

  • Re-style round-trip: running protspace style on a routed bundle doesn't yet preserve the generated cluster styles — replace_settings_in_bundle rewrites settings wholesale.
  • protspace prepare has no --no-stats-annotations toggle to skip the O(n²) per-point silhouette; the prep path is gated by -a. Defaulted on.
  • statistics.parquet validity scalars are carried but not surfaced as a UI table.
  • The prep service still pins protspace>=0.6; needs bumping to the release that includes this once it's cut. Older engines degrade gracefully — columns ship unstyled.

Sequencing: I folded the routing directly into #61/#295 rather than landing the opaque fifth part first and stacking. Can split the per-protein annotations into a separate PR if that's easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants