Duplicate embeddings not collapsed to same point in UMAP projection

**Problem**
Duplicate embeddings end up at different positions in the UMAP projection. Because UMAP randomly initializes points and then moves them, identical inputs aren't guaranteed to map to the same location in DR space.

**Proposed fix**
1. Compute UMAP only on the set of unique embeddings.
2. Map each duplicate to the projected position of its matching unique embedding.
3. Add a UMAP parameter flag to toggle this behavior on/off (default: off, to preserve current behavior).

**Requirements**
Deduplication must scale efficiently — fast even at ~1M embeddings. Hashing each embedding (e.g., hashing the byte representation of the array) should allow deduplication in roughly linear time rather than pairwise comparison.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate embeddings not collapsed to same point in UMAP projection #53

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Duplicate embeddings not collapsed to same point in UMAP projection #53

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions