Skip to content

chore: investigate embeddings or stronger standardization to improve dedupe recall #5

Description

@ericfitz

Summary

Follow-up from the sem-powered code-health toolchain design (docs/superpowers/specs/2026-06-18-sem-code-health-design.md).

The rebuilt dedupe skill detects duplication with a cheap deterministic SQL pre-filter over SEM marker descriptions, then verifies survivors with LLM subagents. Because no embeddings are available (sem does not provide them) and an LLM pass over every entity pair is too expensive at scale, the pre-filter relies on lexical convergence of same-intent descriptions to achieve recall. We mitigate this with a write-side description standard (intent-not-mechanism, a canonical verb lexicon, canonical domain nouns, abstracted incidentals) plus compare-side normalization (stem, stopword-drop, verb-canonicalize, verb+subject match + Jaccard).

This is good, but lexical matching has a recall ceiling: genuinely-duplicate entities described with divergent vocabulary can still slip through as false negatives. This issue tracks investigating stronger approaches once the toolchain is in use against a real codebase (tmi server).

Areas to investigate

  • Embeddings for semantic similarity. Cost/latency of embedding SEM descriptions (or entity bodies via sem_context) and clustering by cosine similarity vs. the current lexical pre-filter. Where would vectors live (local store? extend the dedupe SQLite DB with a vector index)? Does it justify the added dependency?
  • Measure the lexical pre-filter's actual recall against a hand-labeled duplicate set from tmi, to quantify how many false negatives the current approach produces before investing in embeddings.
  • Stronger standardization as a cheaper alternative to embeddings: an accumulating per-project glossary of canonical domain nouns that sem-annotate consults and grows; a stricter (closed) verb taxonomy; optional structured fields in the marker.
  • Hybrid: lexical pre-filter for high-confidence clusters + embedding recall pass only for entities that the lexical filter leaves unclustered.

Notes

  • Decision context and the current standard live in the design spec's "Description content standard (load-bearing for dedup recall)" section.
  • Recall is the priority here (false negatives are permanent misses; false positives only cost one verifier subagent), so any approach should be evaluated on recall first.
  • Deferred until the base toolchain (sem-annotate, dedupe, sem-auto) is implemented and validated against tmi.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions