Skip to content

Duplicate embeddings not collapsed to same point in UMAP projection #53

Description

@tsenoner

Problem
Duplicate embeddings end up at different positions in the UMAP projection. Because UMAP randomly initializes points and then moves them, identical inputs aren't guaranteed to map to the same location in DR space.

Proposed fix

  1. Compute UMAP only on the set of unique embeddings.
  2. Map each duplicate to the projected position of its matching unique embedding.
  3. Add a UMAP parameter flag to toggle this behavior on/off (default: off, to preserve current behavior).

Requirements
Deduplication must scale efficiently — fast even at ~1M embeddings. Hashing each embedding (e.g., hashing the byte representation of the array) should allow deduplication in roughly linear time rather than pairwise comparison.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions