Skip to content

feat: Semantic cluster merging via pgvector #4

@leo-aa88

Description

@leo-aa88

Summary

Use pgvector embeddings to merge or relate clusters that are semantically similar but diverge in text (different templates, minor wording), improving explain/timeline/compare quality.

Motivation

  • README Roadmap lists semantic cluster merging.
  • Fingerprinting alone can split one incident across multiple clusters when normalization does not fully align.

Scope (proposal)

  • Define merge policy: when to merge vs link (threshold, min cluster size, only post-ingest analysis?).
  • Pipeline hook after clustering or as an optional analysis pass for explain / clusters.
  • Respect determinism goals: document behavior when embeddings are disabled.

Acceptance criteria

  • Measurable reduction in duplicate “near duplicate” clusters on sample data without breaking grounded counts (or clearly documented semantics).
  • Tests that do not require live embedding API where possible (fixtures / mocks).

Risks

  • Merging can hide distinct errors if thresholds are wrong; make thresholds configurable and conservative defaults.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions