feat(stats): projection statistics (cluster-validity + faithfulness)#61
feat(stats): projection statistics (cluster-validity + faithfulness)#61jcoludar wants to merge 6 commits into
Conversation
Add a protspace.stats package computing per-projection statistics, baked into the .parquetbundle as an optional fifth part: - cluster_validity: KMeans + distance-to-chord elbow -> silhouette, Davies-Bouldin, Calinski-Harabasz on the projection coordinates. - faithfulness: kNN-overlap + trustworthiness/continuity vs the source embedding (high-dim metric from the reducer; large-n sampling guard). Tidy long-format table (8 cols: space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json) — new statistics add rows, not columns. Registry mirrors the lazy REDUCERS pattern; sklearn imports stay function-local to preserve CLI startup. Bundle I/O carries an optional 5th statistics part (core+settings?+stats?) with a zero-byte settings slot keeping it unambiguous; read_bundle keeps its 2-tuple shape (new read_statistics_from_bundle accessor) and replace_settings_in_bundle preserves a trailing stats part so `protspace style` is non-lossy. Wiring: ReductionPipeline computes stats (best-effort, never fatal) behind prepare --stats/--no-stats; new `protspace stats` subcommand for the discrete path; `bundle -s/--statistics` folds a stats parquet in. Refs tsenoner/protspace_web#219 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Web counterpart (consumes this, lands after a stats-bearing release): tsenoner/protspace_web#295. Tracking issue: tsenoner/protspace_web#219. |
CI's `ruff format --check` flagged 9 files that were committed without running `ruff format` (`ruff check` lint passed, but the formatter check is a separate CI step). Pure formatting — no behavior change. Stats suite still 30 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Where to put the information:
|
…lity) Phase 1A of route-projection-statistics: carry each statistic in the bundle part whose existing frontend consumer matches its granularity, instead of one opaque fifth part. - StatRow gains a `destination` (default "statistics_part", not a tidy-table column); StatsReport.partition() groups rows by destination and to_arrow() serialises only the statistics_part bucket -- the fifth part is now aggregate cluster-validity only. - Faithfulness rows (kNN-overlap / trustworthiness / continuity, incl. the skip row) are marked destination="projection_metadata". - New stats/carriage.py route_faithfulness_to_metadata() folds those rows into each projection's info_json.quality (per-metric value + k/metric/sampling provenance; NaN skip value -> null so info_json stays valid JSON). Wired into ReductionPipeline._compute_statistics before create_output serialises info_json. - `protspace stats` stays a pure aggregate-only producer (faithfulness no longer written to statistics.parquet); the prep stats+bundle path is unaffected. Tests: destination/partition/to_arrow restriction; faithfulness routing incl. skip row; carriage router (provenance, NaN->null, info_json round-trip, multi-embedding); end-to-end `protspace stats` aggregate-only. Existing fifth-part tests updated to the narrowed contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thfulness The deployed prep pipeline builds bundles via standalone `protspace project` + `stats` + `bundle` subprocesses, not the in-process `prepare` pipeline. After the Phase-1A routing, `protspace stats` wrote faithfulness nowhere (only aggregate validity → statistics.parquet), so the prep path lost it. `protspace stats` now folds faithfulness into `projections_metadata.parquet` in place (parses each row's info_json, injects `quality`, preserves all other columns and the reducer's existing info, re-serialises). The existing `protspace bundle -p` then carries the enriched metadata into the bundle's 2nd part with no bundle/prep code change. statistics.parquet stays aggregate-only. This matches the spec scenario "the standalone stats path recomputes and merges it into projections_metadata" and makes Phase 1 deliver faithfulness end-to-end in the production prep flow. Tests: stats rewrites metadata.info_json.quality (columns/rows preserved, reducer info kept); end-to-end `stats` → `bundle -p` ships a bundle whose projections_metadata carries quality while the fifth part stays validity-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n columns Phase 2A of route-projection-statistics (tsenoner #61 review bullets 1-2): surface the elbow-K labelling and per-point silhouette as per-protein annotation columns so the frontend color-by control renders them with no new UI. - New AnnotationColumn output type (name, kind categorical|numeric, values keyed by identifier); StatsReport carries an annotation_columns channel and add() routes mixed StatRow / AnnotationColumn lists. - ClusterValidityStatistic emits `cluster_<projection>` (non-numeric "cluster N" labels → categorical inference) and `silhouette_<projection>` (per-point silhouette_samples over the full labelled set → numeric). Per-point silhouette is O(n^2) with no subsample path, so it has its own hard-ceiling skip guard; both are gated by the cluster_annotations param and emitted only for a genuine (>=2) clustering with aligned ids. - carriage.merge_annotation_columns joins the columns onto the annotations frame by identifier (absent proteins get no value); wired into the prepare pipeline before create_output's .astype(str) so typing survives. - `protspace stats` gains -a/--annotations: enriches the annotations parquet in place with the computed columns (stringified to match the prepare path), so the prep `project -> stats -a -> bundle -a` flow carries them. Without -a the expensive per-protein computation is skipped. Tests: validity per-protein outputs + ceiling guard + disable; carriage join + annotations-table typing; stats -a enrichment; end-to-end stats -a -> bundle -a ships cluster_/silhouette_ columns in the bundle's annotations part. Auto-styling (colored-without-manual-step) is the next increment; columns already color via the default palette when selected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2A.4 of route-projection-statistics. Generate a full LegendPersistedSettings envelope per cluster-membership column so clusters are colored when selected with no manual styling step. - carriage.build_cluster_legend_settings: for each categorical AnnotationColumn build a complete envelope the frontend's sanitizeLegendSettingsEntry accepts — maxVisibleValues / shapeSize / sortMode / hiddenValues / enableDuplicateStackUI / selectedPaletteId + categories keyed by the exact label with a Kelly-palette color, zOrder and shape. Numeric (silhouette) columns keep the default ramp. - prepare path: BaseProcessor.save_output gains settings=; the pipeline builds the cluster styles from the report and writes them into the bundle's settings part. - prep path: `protspace stats --settings-out <json>` writes the styles; `protspace bundle --settings <json>` folds them into the settings part. Tests: envelope validity (every required field/type + distinct palette colors); end-to-end stats --settings-out -> bundle --settings styles clusters in the settings part. Deferred (follow-up): preserving the generated cluster styles across a later `protspace style` rewrite (replace_settings_in_bundle) — a rare re-style path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Routing is implemented across this PR and protspace_web#295. Where each statistic lands:
Computed columns appear in color-by automatically, grouped under a new "Statistics" section. protspace stats handles carriage: faithfulness into projections_metadata, per-protein columns into the annotations parquet (-a), and the cluster legend styles via --settings-out for bundle --settings. The prep pipeline picks all of it up. Design + phased plan: the route-projection-statistics openspec change in the web repo. Both PRs green. Defaults chosen:
Deferred (not blocking):
Sequencing: I folded the routing directly into #61/#295 rather than landing the opaque fifth part first and stacking. Can split the per-protein annotations into a separate PR if that's easier to review. |
Summary
Adds a
protspace.statssubsystem that computes per-projection quality statistics atpreparation time and bakes them into the
.parquetbundleas an optional fifth part. This is theengine half of the projection-statistics MVP (tracking issue: tsenoner/protspace_web#219;
related: #31). The
protspace_webPR that consumes it follows separately — this PR lands first.Today the pipeline (
embed → project → annotate → bundle) ships coordinates with zero qualitymetrics, so judging a projection is purely visual. This change answers two questions per projection:
cluster_validity: KMeans with a distance-to-chord elbow estimate ofK, scored by silhouette, Davies–Bouldin, Calinski–Harabasz on the projection coords.
faithfulness: kNN-overlap and trustworthiness / continuitybetween the source embedding and the projection — i.e. how much the reduction distorted the
neighbourhood structure.
What's in this PR
src/protspace/stats/— a generalizedStatisticcontract (each statistic declares whether itneeds the embedding and returns one or more
StatRows) + a lazySTATISTICSregistry mirroringthe existing
REDUCERSpattern. sklearn imports stay function-local to preserve CLI startup.stats/cluster/kmeans_elbow.py— KMeans sweepK ∈ [2, min(round(√n), 50)], elbow viaperpendicular deviation from the first→last inertia chord (argmax index → K).
stats/metrics/validity.py— silhouette (seeded sample above threshold) / DB / CH at that K.stats/metrics/faithfulness.py— kNN-overlap@k and trustworthiness/continuity vs the embedding.stats/driver.py—compute_statistics(...)iterating registered statistics per projection,isolating per-statistic and per-projection failures (a bad reduction is logged and skipped,
never sinks the report).
space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json). New statistics add rows, not columns.data/io/bundle.py) — an optional fifth partstatistics.parquet. Layoutcore(3) + settings? + statistics?; when statistics is present without settings, a zero-bytesettings slot keeps the fifth position unambiguous.
write_bundle/read_bundle/extract_bundle_to_dirandreplace_settings_in_bundleare all updated — the last soprotspace styleno longer silently drops a trailing stats part.ReductionPipeline.run(the one stage holding embeddings and projections)computes stats behind
prepare --stats/--no-stats; newprotspace stats -i emb.h5 -p project_dir -o statistics.parquetfor the discrete path;bundle -s/--statisticsfolds a stats parquet in.manifold.trustworthiness) isalready a core dep.
Robustness hardening (post adversarial review)
This branch was put through a multi-agent review; the confirmed findings are fixed here:
trustworthinessrequires
n_neighbors < n/2(strict); k is now clamped to(n-1)//2instead ofn-2.cluster_validityscores the full projection;faithfulnessscores the embedding-alignedsubset. Previously clustering could be scored on the id-intersection subset only.
default_metric) so faithfulness uses the run's metricrather than defaulting to euclidean for PCA/MDS/PaCMAP.
source column maps each projection to its own embedding.
n_clustersreports the achieved distinct-cluster count (KMeans can collapse on coincidentpoints), keeping the requested K in
extra.Tests
tests/test_stats.py,tests/test_stats_bundle.py,tests/test_stats_cli.py— known-answernumeric fixtures (blob separation; faithfulness on identity vs random projections; label-permutation
alignment), the 8-column schema, the 5-part bundle round-trip, and the
protspace styleround-trip.Scope (MVP) & non-goals
In: per-projection
cluster_validity(unsupervised/elbow) +faithfulness, baked at prep time,carried in the bundle. Explicitly out (non-breaking future expansions): embedding-space
cluster-validity, annotation-feature label sources, on-demand recompute, the broader
ProtSpaceExtractorpair/edge/set analyses (future typed bundle parts), and frontend rendering. Theregistry + long-format table leave seams for the scalar expansions.
Data-format change: additive, backward compatible — existing 3- and 4-part bundles read and
write unchanged.
Refs tsenoner/protspace_web#219, #31