Skip to content

[FEATURE] Cluster Validation Metrics & Automated Group Detection #31

Description

@tsenoner

Problem

ProtSpace visualizes protein embeddings but provides no quantitative measure of how well biological annotations separate in embedding space, nor any way to automatically discover natural groupings.

Proposed Solution

1. Cluster Validation Metrics

Compute standard clustering quality scores for any selected feature/annotation:

  • Silhouette Score — cohesion vs. separation per point (higher = better)
  • Davies-Bouldin Index (DBI) — avg cluster overlap (lower = better)
  • Calinski-Harabasz Index (CH) — between/within variance ratio (higher = better)
  • ARI / NMI — agreement with ground-truth labels (external validation)

Key: Metrics should be computed on the original high-dimensional embeddings, not on UMAP/t-SNE projections which distort distances.

2. Automated Optimal k Detection

  • Elbow Method — inertia vs. k, detect the knee
  • Silhouette Analysis — avg silhouette for k = 2..k_max
  • Gap Statistic — compare to null reference
  • BIC/AIC via GMM — information criterion minimization

3. Auto-Clustering

Run K-means, HDBSCAN, or GMM with the detected k and store results as a new feature column in the parquetbundle for immediate visualization.

CLI Integration

protspace-local -i embeddings.h5 -o out/ --cluster-metrics --feature pfam
protspace-local -i embeddings.h5 -o out/ --auto-cluster --k-range 2:20

Output: metrics as JSON/CSV, elbow/silhouette plots as PNG/SVG, new auto_cluster feature in the parquetbundle.

Implementation Notes

  • All metrics available via scikit-learn (already a dependency)
  • HDBSCAN via sklearn.cluster.HDBSCAN (sklearn ≥ 1.3)
  • Web UI integration (metrics panel, interactive plots, auto-cluster button) → separate protspace_web issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions