Problem
Duplicate embeddings end up at different positions in the UMAP projection. Because UMAP randomly initializes points and then moves them, identical inputs aren't guaranteed to map to the same location in DR space.
Proposed fix
- Compute UMAP only on the set of unique embeddings.
- Map each duplicate to the projected position of its matching unique embedding.
- Add a UMAP parameter flag to toggle this behavior on/off (default: off, to preserve current behavior).
Requirements
Deduplication must scale efficiently — fast even at ~1M embeddings. Hashing each embedding (e.g., hashing the byte representation of the array) should allow deduplication in roughly linear time rather than pairwise comparison.
Problem
Duplicate embeddings end up at different positions in the UMAP projection. Because UMAP randomly initializes points and then moves them, identical inputs aren't guaranteed to map to the same location in DR space.
Proposed fix
Requirements
Deduplication must scale efficiently — fast even at ~1M embeddings. Hashing each embedding (e.g., hashing the byte representation of the array) should allow deduplication in roughly linear time rather than pairwise comparison.