Summary
Follow-up from the sem-powered code-health toolchain design (docs/superpowers/specs/2026-06-18-sem-code-health-design.md).
The rebuilt dedupe skill detects duplication with a cheap deterministic SQL pre-filter over SEM marker descriptions, then verifies survivors with LLM subagents. Because no embeddings are available (sem does not provide them) and an LLM pass over every entity pair is too expensive at scale, the pre-filter relies on lexical convergence of same-intent descriptions to achieve recall. We mitigate this with a write-side description standard (intent-not-mechanism, a canonical verb lexicon, canonical domain nouns, abstracted incidentals) plus compare-side normalization (stem, stopword-drop, verb-canonicalize, verb+subject match + Jaccard).
This is good, but lexical matching has a recall ceiling: genuinely-duplicate entities described with divergent vocabulary can still slip through as false negatives. This issue tracks investigating stronger approaches once the toolchain is in use against a real codebase (tmi server).
Areas to investigate
Notes
- Decision context and the current standard live in the design spec's "Description content standard (load-bearing for dedup recall)" section.
- Recall is the priority here (false negatives are permanent misses; false positives only cost one verifier subagent), so any approach should be evaluated on recall first.
- Deferred until the base toolchain (
sem-annotate, dedupe, sem-auto) is implemented and validated against tmi.
Summary
Follow-up from the sem-powered code-health toolchain design (
docs/superpowers/specs/2026-06-18-sem-code-health-design.md).The rebuilt
dedupeskill detects duplication with a cheap deterministic SQL pre-filter over SEM marker descriptions, then verifies survivors with LLM subagents. Because no embeddings are available (sem does not provide them) and an LLM pass over every entity pair is too expensive at scale, the pre-filter relies on lexical convergence of same-intent descriptions to achieve recall. We mitigate this with a write-side description standard (intent-not-mechanism, a canonical verb lexicon, canonical domain nouns, abstracted incidentals) plus compare-side normalization (stem, stopword-drop, verb-canonicalize, verb+subject match + Jaccard).This is good, but lexical matching has a recall ceiling: genuinely-duplicate entities described with divergent vocabulary can still slip through as false negatives. This issue tracks investigating stronger approaches once the toolchain is in use against a real codebase (tmi server).
Areas to investigate
sem_context) and clustering by cosine similarity vs. the current lexical pre-filter. Where would vectors live (local store? extend the dedupe SQLite DB with a vector index)? Does it justify the added dependency?sem-annotateconsults and grows; a stricter (closed) verb taxonomy; optional structured fields in the marker.Notes
sem-annotate,dedupe,sem-auto) is implemented and validated against tmi.