feat(stats): projection statistics (cluster-validity + faithfulness) by jcoludar · Pull Request #61 · tsenoner/protspace

jcoludar · 2026-06-24T06:51:23Z

Summary

Adds a protspace.stats subsystem that computes per-projection quality statistics at
preparation time and bakes them into the .parquetbundle as an optional fifth part. This is the
engine half of the projection-statistics MVP (tracking issue: tsenoner/protspace_web#219;
related: #31). The protspace_web PR that consumes it follows separately — this PR lands first.

Today the pipeline (embed → project → annotate → bundle) ships coordinates with zero quality
metrics, so judging a projection is purely visual. This change answers two questions per projection:

Cluster structure — cluster_validity: KMeans with a distance-to-chord elbow estimate of
K, scored by silhouette, Davies–Bouldin, Calinski–Harabasz on the projection coords.
Projection faithfulness — faithfulness: kNN-overlap and trustworthiness / continuity
between the source embedding and the projection — i.e. how much the reduction distorted the
neighbourhood structure.

What's in this PR

src/protspace/stats/ — a generalized Statistic contract (each statistic declares whether it
needs the embedding and returns one or more StatRows) + a lazy STATISTICS registry mirroring
the existing REDUCERS pattern. sklearn imports stay function-local to preserve CLI startup.
- stats/cluster/kmeans_elbow.py — KMeans sweep K ∈ [2, min(round(√n), 50)], elbow via
  perpendicular deviation from the first→last inertia chord (argmax index → K).
- stats/metrics/validity.py — silhouette (seeded sample above threshold) / DB / CH at that K.
- stats/metrics/faithfulness.py — kNN-overlap@k and trustworthiness/continuity vs the embedding.
- stats/driver.py — compute_statistics(...) iterating registered statistics per projection,
  isolating per-statistic and per-projection failures (a bad reduction is logged and skipped,
  never sinks the report).
Tidy long-format output — an 8-column table (space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json). New statistics add rows, not columns.
Bundle I/O (data/io/bundle.py) — an optional fifth part statistics.parquet. Layout
core(3) + settings? + statistics?; when statistics is present without settings, a zero-byte
settings slot keeps the fifth position unambiguous. write_bundle / read_bundle /
extract_bundle_to_dir and replace_settings_in_bundle are all updated — the last so
protspace style no longer silently drops a trailing stats part.
Wiring — ReductionPipeline.run (the one stage holding embeddings and projections)
computes stats behind prepare --stats/--no-stats; new protspace stats -i emb.h5 -p project_dir -o statistics.parquet for the discrete path; bundle -s/--statistics folds a stats parquet in.
No new dependency — scikit-learn (KMeans, silhouette/DB/CH, manifold.trustworthiness) is
already a core dep.

Robustness hardening (post adversarial review)

This branch was put through a multi-agent review; the confirmed findings are fixed here:

Faithfulness no longer silently drops trustworthiness/continuity for small n. trustworthiness
requires n_neighbors < n/2 (strict); k is now clamped to (n-1)//2 instead of n-2.
cluster_validity scores the full projection; faithfulness scores the embedding-aligned
subset. Previously clustering could be scored on the id-intersection subset only.
High-dim metric is threaded through (default_metric) so faithfulness uses the run's metric
rather than defaulting to euclidean for PCA/MDS/PaCMAP.
Multi-embedding attribution — exact id-set cover wins immediately (no ambiguous tie-break); a
source column maps each projection to its own embedding.
Precomputed (n×n) similarity/distance matrices are skipped for faithfulness (not embeddings).
kNN self-exclusion is tie-safe on coincident points (filter self, not column-0 slice).
n_clusters reports the achieved distinct-cluster count (KMeans can collapse on coincident
points), keeping the requested K in extra.

Tests

tests/test_stats.py, tests/test_stats_bundle.py, tests/test_stats_cli.py — known-answer
numeric fixtures (blob separation; faithfulness on identity vs random projections; label-permutation
alignment), the 8-column schema, the 5-part bundle round-trip, and the protspace style round-trip.
Full suite: 534 passed locally.

Scope (MVP) & non-goals

In: per-projection cluster_validity (unsupervised/elbow) + faithfulness, baked at prep time,
carried in the bundle. Explicitly out (non-breaking future expansions): embedding-space
cluster-validity, annotation-feature label sources, on-demand recompute, the broader
ProtSpaceExtractor pair/edge/set analyses (future typed bundle parts), and frontend rendering. The
registry + long-format table leave seams for the scalar expansions.

Data-format change: additive, backward compatible — existing 3- and 4-part bundles read and
write unchanged.

Refs tsenoner/protspace_web#219, #31

Add a protspace.stats package computing per-projection statistics, baked into the .parquetbundle as an optional fifth part: - cluster_validity: KMeans + distance-to-chord elbow -> silhouette, Davies-Bouldin, Calinski-Harabasz on the projection coordinates. - faithfulness: kNN-overlap + trustworthiness/continuity vs the source embedding (high-dim metric from the reducer; large-n sampling guard). Tidy long-format table (8 cols: space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json) — new statistics add rows, not columns. Registry mirrors the lazy REDUCERS pattern; sklearn imports stay function-local to preserve CLI startup. Bundle I/O carries an optional 5th statistics part (core+settings?+stats?) with a zero-byte settings slot keeping it unambiguous; read_bundle keeps its 2-tuple shape (new read_statistics_from_bundle accessor) and replace_settings_in_bundle preserves a trailing stats part so `protspace style` is non-lossy. Wiring: ReductionPipeline computes stats (best-effort, never fatal) behind prepare --stats/--no-stats; new `protspace stats` subcommand for the discrete path; `bundle -s/--statistics` folds a stats parquet in. Refs tsenoner/protspace_web#219 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jcoludar · 2026-06-24T06:54:41Z

Web counterpart (consumes this, lands after a stats-bearing release): tsenoner/protspace_web#295. Tracking issue: tsenoner/protspace_web#219.

CI's `ruff format --check` flagged 9 files that were committed without running `ruff format` (`ruff check` lint passed, but the formatter check is a separate CI step). Pure formatting — no behavior change. Stats suite still 30 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tsenoner · 2026-06-24T13:38:45Z

Where to put the information:

Cluster Elbow Kx -> save as an annotation
Cluster Silhoutte Kx -> save as annotation with x beeing variable
kNN-overlap + trustworthiness + continuity -> metadata per dimensionality reduction method (store with the other infos)
silhouette, Davies–Bouldin, Calinski–Harabasz -> separate parquet file (as currently)

…lity) Phase 1A of route-projection-statistics: carry each statistic in the bundle part whose existing frontend consumer matches its granularity, instead of one opaque fifth part. - StatRow gains a `destination` (default "statistics_part", not a tidy-table column); StatsReport.partition() groups rows by destination and to_arrow() serialises only the statistics_part bucket -- the fifth part is now aggregate cluster-validity only. - Faithfulness rows (kNN-overlap / trustworthiness / continuity, incl. the skip row) are marked destination="projection_metadata". - New stats/carriage.py route_faithfulness_to_metadata() folds those rows into each projection's info_json.quality (per-metric value + k/metric/sampling provenance; NaN skip value -> null so info_json stays valid JSON). Wired into ReductionPipeline._compute_statistics before create_output serialises info_json. - `protspace stats` stays a pure aggregate-only producer (faithfulness no longer written to statistics.parquet); the prep stats+bundle path is unaffected. Tests: destination/partition/to_arrow restriction; faithfulness routing incl. skip row; carriage router (provenance, NaN->null, info_json round-trip, multi-embedding); end-to-end `protspace stats` aggregate-only. Existing fifth-part tests updated to the narrowed contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…thfulness The deployed prep pipeline builds bundles via standalone `protspace project` + `stats` + `bundle` subprocesses, not the in-process `prepare` pipeline. After the Phase-1A routing, `protspace stats` wrote faithfulness nowhere (only aggregate validity → statistics.parquet), so the prep path lost it. `protspace stats` now folds faithfulness into `projections_metadata.parquet` in place (parses each row's info_json, injects `quality`, preserves all other columns and the reducer's existing info, re-serialises). The existing `protspace bundle -p` then carries the enriched metadata into the bundle's 2nd part with no bundle/prep code change. statistics.parquet stays aggregate-only. This matches the spec scenario "the standalone stats path recomputes and merges it into projections_metadata" and makes Phase 1 deliver faithfulness end-to-end in the production prep flow. Tests: stats rewrites metadata.info_json.quality (columns/rows preserved, reducer info kept); end-to-end `stats` → `bundle -p` ships a bundle whose projections_metadata carries quality while the fifth part stays validity-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n columns Phase 2A of route-projection-statistics (tsenoner #61 review bullets 1-2): surface the elbow-K labelling and per-point silhouette as per-protein annotation columns so the frontend color-by control renders them with no new UI. - New AnnotationColumn output type (name, kind categorical|numeric, values keyed by identifier); StatsReport carries an annotation_columns channel and add() routes mixed StatRow / AnnotationColumn lists. - ClusterValidityStatistic emits `cluster_<projection>` (non-numeric "cluster N" labels → categorical inference) and `silhouette_<projection>` (per-point silhouette_samples over the full labelled set → numeric). Per-point silhouette is O(n^2) with no subsample path, so it has its own hard-ceiling skip guard; both are gated by the cluster_annotations param and emitted only for a genuine (>=2) clustering with aligned ids. - carriage.merge_annotation_columns joins the columns onto the annotations frame by identifier (absent proteins get no value); wired into the prepare pipeline before create_output's .astype(str) so typing survives. - `protspace stats` gains -a/--annotations: enriches the annotations parquet in place with the computed columns (stringified to match the prepare path), so the prep `project -> stats -a -> bundle -a` flow carries them. Without -a the expensive per-protein computation is skipped. Tests: validity per-protein outputs + ceiling guard + disable; carriage join + annotations-table typing; stats -a enrichment; end-to-end stats -a -> bundle -a ships cluster_/silhouette_ columns in the bundle's annotations part. Auto-styling (colored-without-manual-step) is the next increment; columns already color via the default palette when selected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Phase 2A.4 of route-projection-statistics. Generate a full LegendPersistedSettings envelope per cluster-membership column so clusters are colored when selected with no manual styling step. - carriage.build_cluster_legend_settings: for each categorical AnnotationColumn build a complete envelope the frontend's sanitizeLegendSettingsEntry accepts — maxVisibleValues / shapeSize / sortMode / hiddenValues / enableDuplicateStackUI / selectedPaletteId + categories keyed by the exact label with a Kelly-palette color, zOrder and shape. Numeric (silhouette) columns keep the default ramp. - prepare path: BaseProcessor.save_output gains settings=; the pipeline builds the cluster styles from the report and writes them into the bundle's settings part. - prep path: `protspace stats --settings-out <json>` writes the styles; `protspace bundle --settings <json>` folds them into the settings part. Tests: envelope validity (every required field/type + distinct palette colors); end-to-end stats --settings-out -> bundle --settings styles clusters in the settings part. Deferred (follow-up): preserving the generated cluster styles across a later `protspace style` rewrite (replace_settings_in_bundle) — a rare re-style path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jcoludar · 2026-06-24T17:53:28Z

Routing is implemented across this PR and protspace_web#295. Where each statistic lands:

kNN-overlap, trustworthiness, continuity → projection metadata (info_json.quality), shown per-metric in the projection panel.
silhouette, Davies–Bouldin, Calinski–Harabasz → statistics.parquet (now the only thing in that part).
cluster elbow-K label → per-protein cluster_ annotation column (categorical, auto-styled with a Kelly palette).
per-point silhouette → per-protein silhouette_ column (numeric ramp).

Computed columns appear in color-by automatically, grouped under a new "Statistics" section. protspace stats handles carriage: faithfulness into projections_metadata, per-protein columns into the annotations parquet (-a), and the cluster legend styles via --settings-out for bundle --settings. The prep pipeline picks all of it up. Design + phased plan: the route-projection-statistics openspec change in the web repo. Both PRs green.

Defaults chosen:

Naming: cluster_ / silhouette_, elbow K only (no sweep), K stored in metadata, not the column name (keeps names stable if K shifts on re-run). Your review said "Kx with x variable" — if K belongs in the visible label, that's a small change.
Coloring: clusters are colored when the column is selected, not forced as the initial color-by. Making a cluster column the default initial selection is separate frontend work (initial-view).

Deferred (not blocking):

Re-style round-trip: running protspace style on a routed bundle doesn't yet preserve the generated cluster styles — replace_settings_in_bundle rewrites settings wholesale.
protspace prepare has no --no-stats-annotations toggle to skip the O(n²) per-point silhouette; the prep path is gated by -a. Defaulted on.
statistics.parquet validity scalars are carried but not surfaced as a UI table.
The prep service still pins protspace>=0.6; needs bumping to the release that includes this once it's cut. Older engines degrade gracefully — columns ship unstyled.

Sequencing: I folded the routing directly into #61/#295 rather than landing the opaque fifth part first and stacking. Can split the per-protein annotations into a separate PR if that's easier to review.

jcoludar mentioned this pull request Jun 24, 2026

feat(statistics): consume projection statistics in the bundle pipeline tsenoner/protspace_web#295

Draft

jcoludar mentioned this pull request Jun 24, 2026

[FEATURE] Implementing statistics based on exploration from #216 tsenoner/protspace_web#219

Open

1 task

jcoludar and others added 4 commits June 24, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stats): projection statistics (cluster-validity + faithfulness)#61

feat(stats): projection statistics (cluster-validity + faithfulness)#61
jcoludar wants to merge 6 commits into
mainfrom
feat/projection-statistics

jcoludar commented Jun 24, 2026

Uh oh!

jcoludar commented Jun 24, 2026

Uh oh!

tsenoner commented Jun 24, 2026

Uh oh!

jcoludar commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jcoludar commented Jun 24, 2026

Summary

What's in this PR

Robustness hardening (post adversarial review)

Tests

Scope (MVP) & non-goals

Uh oh!

jcoludar commented Jun 24, 2026

Uh oh!

tsenoner commented Jun 24, 2026

Uh oh!

jcoludar commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants