Problem
ProtSpace visualizes protein embeddings but provides no quantitative measure of how well biological annotations separate in embedding space, nor any way to automatically discover natural groupings.
Proposed Solution
1. Cluster Validation Metrics
Compute standard clustering quality scores for any selected feature/annotation:
- Silhouette Score — cohesion vs. separation per point (higher = better)
- Davies-Bouldin Index (DBI) — avg cluster overlap (lower = better)
- Calinski-Harabasz Index (CH) — between/within variance ratio (higher = better)
- ARI / NMI — agreement with ground-truth labels (external validation)
Key: Metrics should be computed on the original high-dimensional embeddings, not on UMAP/t-SNE projections which distort distances.
2. Automated Optimal k Detection
- Elbow Method — inertia vs. k, detect the knee
- Silhouette Analysis — avg silhouette for k = 2..k_max
- Gap Statistic — compare to null reference
- BIC/AIC via GMM — information criterion minimization
3. Auto-Clustering
Run K-means, HDBSCAN, or GMM with the detected k and store results as a new feature column in the parquetbundle for immediate visualization.
CLI Integration
protspace-local -i embeddings.h5 -o out/ --cluster-metrics --feature pfam
protspace-local -i embeddings.h5 -o out/ --auto-cluster --k-range 2:20
Output: metrics as JSON/CSV, elbow/silhouette plots as PNG/SVG, new auto_cluster feature in the parquetbundle.
Implementation Notes
- All metrics available via scikit-learn (already a dependency)
- HDBSCAN via
sklearn.cluster.HDBSCAN (sklearn ≥ 1.3)
- Web UI integration (metrics panel, interactive plots, auto-cluster button) → separate
protspace_web issue
Problem
ProtSpace visualizes protein embeddings but provides no quantitative measure of how well biological annotations separate in embedding space, nor any way to automatically discover natural groupings.
Proposed Solution
1. Cluster Validation Metrics
Compute standard clustering quality scores for any selected feature/annotation:
2. Automated Optimal k Detection
3. Auto-Clustering
Run K-means, HDBSCAN, or GMM with the detected k and store results as a new feature column in the parquetbundle for immediate visualization.
CLI Integration
Output: metrics as JSON/CSV, elbow/silhouette plots as PNG/SVG, new
auto_clusterfeature in the parquetbundle.Implementation Notes
sklearn.cluster.HDBSCAN(sklearn ≥ 1.3)protspace_webissue