chore: investigate embeddings or stronger standardization to improve dedupe recall

## Summary

Follow-up from the sem-powered code-health toolchain design (`docs/superpowers/specs/2026-06-18-sem-code-health-design.md`).

The rebuilt `dedupe` skill detects duplication with a cheap deterministic SQL pre-filter over SEM marker descriptions, then verifies survivors with LLM subagents. Because no embeddings are available (sem does not provide them) and an LLM pass over every entity pair is too expensive at scale, the pre-filter relies on **lexical convergence** of same-intent descriptions to achieve recall. We mitigate this with a write-side description standard (intent-not-mechanism, a canonical verb lexicon, canonical domain nouns, abstracted incidentals) plus compare-side normalization (stem, stopword-drop, verb-canonicalize, verb+subject match + Jaccard).

This is good, but lexical matching has a recall ceiling: genuinely-duplicate entities described with divergent vocabulary can still slip through as false negatives. This issue tracks investigating stronger approaches once the toolchain is in use against a real codebase (tmi server).

## Areas to investigate

- [ ] **Embeddings for semantic similarity.** Cost/latency of embedding SEM descriptions (or entity bodies via `sem_context`) and clustering by cosine similarity vs. the current lexical pre-filter. Where would vectors live (local store? extend the dedupe SQLite DB with a vector index)? Does it justify the added dependency?
- [ ] **Measure the lexical pre-filter's actual recall** against a hand-labeled duplicate set from tmi, to quantify how many false negatives the current approach produces before investing in embeddings.
- [ ] **Stronger standardization** as a cheaper alternative to embeddings: an accumulating per-project glossary of canonical domain nouns that `sem-annotate` consults and grows; a stricter (closed) verb taxonomy; optional structured fields in the marker.
- [ ] **Hybrid:** lexical pre-filter for high-confidence clusters + embedding recall pass only for entities that the lexical filter leaves unclustered.

## Notes

- Decision context and the current standard live in the design spec's "Description content standard (load-bearing for dedup recall)" section.
- Recall is the priority here (false negatives are permanent misses; false positives only cost one verifier subagent), so any approach should be evaluated on recall first.
- Deferred until the base toolchain (`sem-annotate`, `dedupe`, `sem-auto`) is implemented and validated against tmi.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: investigate embeddings or stronger standardization to improve dedupe recall #5

Summary

Areas to investigate

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

chore: investigate embeddings or stronger standardization to improve dedupe recall #5

Description

Summary

Areas to investigate

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions