Skip to content

fix(embedding-search): guard cosineSimilarity against mismatched-length vectors#451

Open
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/embedding-search-cosine-length
Open

fix(embedding-search): guard cosineSimilarity against mismatched-length vectors#451
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/embedding-search-cosine-length

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Problem

  • cosineSimilarity loops only over a.length and dereferences b[i] without verifying the lengths match. Its docstring promises a defined numeric result ('Returns 0 if either vector has zero magnitude'), but the function silently returns garbage when the two vectors differ in length. I verified with the exact code: cosineSimilarity([1,0,0,0.5],[1,0,0]) returns NaN (because b[3] is undefined, so a[3]*b[3] is NaN), and cosineSimilarity([1,0,0],[1,0,0,5]) returns 1 (the extra dimension of b is silently ignored, overstating similarity). In SemanticSearchEngine.search, a NaN similarity then fails the 'similarity >= threshold' check and the node is silently dropped from results with no error — so a single mis-sized stored embedding (e.g. one produced by a different embedding model/version, which is realistic when an embeddings map is loaded from disk and reused across runs) makes that node…

Fix

  • Guard against length mismatch at the top of the function so the contract is honoured (return 0 for incompatible vectors instead of NaN): if (a.length !== b.length) return 0; placed before the loop. (Optionally, the caller in SemanticSearchEngine.search could skip such embeddings, but returning 0 keeps the pure function's numeric contract intact.)

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the semantic/embedding search.

🤖 Generated with Claude Code

…th vectors

cosineSimilarity looped over a.length and dereferenced b[i] without
checking the lengths match, so mismatched vectors returned NaN (extra
dims in a) or silently overstated similarity (extra dims in b). A NaN
score then failed the threshold check in SemanticSearchEngine.search,
silently dropping the node. Guard with an early 'a.length !== b.length'
return 0 to honour the documented numeric contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@thejesh23 thejesh23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few concerns.

1. Silent guard hides the upstream bug. A length mismatch means an embedding in the map was produced by a different model/dimension than queryEmbedding — that is a corrupted index, not a normal data point. Returning 0 turns it into "this node is just dissimilar," so it silently disappears from results forever and nobody notices the index is stale. At minimum a one-time console.warn (or a counter on the engine) the first time it fires would surface the real problem; otherwise a user re-running search after a model upgrade gets quietly degraded recall with no signal.

2. Root cause not addressed in SemanticSearchEngine. The realistic source of mismatch (per the PR body) is embeddings persisted from a prior run/model being loaded alongside a fresh queryEmbedding. search() is the natural place to detect this — e.g. record the expected dimension on first call / from the constructor and reject or skip non-conforming stored embeddings up front, instead of paying a .length check inside the hot cosineSimilarity loop for every node on every query. The pure-function guard is fine as a belt; the engine-level check is the suspenders.

3. Test gap: the search() path itself. Both new asserts are on cosineSimilarity directly. There's no test that SemanticSearchEngine.search() with one mis-sized stored embedding (a) doesn't throw, (b) drops only that node, and (c) still returns the correctly-sized neighbours in score order. That's the actual user-visible behavior this PR is defending, and the 1 - similarity scoring + >= threshold interaction (mis-sized now scores 1, i.e. worst, which is the intended outcome but worth pinning).

Nit: the JSDoc still only documents the zero-magnitude case — worth adding "...or if the vectors have different lengths" so the next reader doesn't re-discover this.

Document and test the length-mismatch guard in cosineSimilarity:
- Expand JSDoc to cover the different-lengths case (returns 0).
- Comment why a stale/corrupt-index mismatch is intentionally swallowed
  as 0 in the pure helper, with a TODO to surface it engine-side
  (warn-once / skipped-count) as a follow-up.
- Add SemanticSearchEngine.search() tests: one mis-sized stored
  embedding does not throw, correctly-sized neighbours return in score
  order, the mismatched node ranks last at the default threshold (0)
  and is filtered out under a positive threshold, pinning the
  `1 - similarity` + `>= threshold` interaction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tirth8205

Copy link
Copy Markdown
Contributor Author

Addressed point by point.

3 (test gap, search() path): Added three SemanticSearchEngine.search() tests that build an engine with one mis-sized stored embedding (n2, length 3 vs the 4-dim query) alongside correctly-sized neighbours:

  • search() does not throw with a mismatched stored embedding.
  • At the default threshold (0) the mismatched node is included but ranked last — its similarity 0 satisfies 0 >= 0 and scores 1 - 0 = 1, so the ordering comes back ["n1", "n3", "n2"]. This pins the 1 - similarity + >= threshold interaction you flagged.
  • Under a positive threshold (0.5) the mismatched node is dropped (0 >= 0.5 fails) while the correctly-sized neighbours remain in score order. This makes the "drops only that node" behavior explicit and threshold-dependent.

nit (JSDoc): Updated to "Returns 0 if either vector has zero magnitude or if the two vectors have different lengths."

1 (silent guard hides corrupt index): Agreed the mismatch shouldn't vanish without a trail. I added a comment + TODO in cosineSimilarity documenting that a length mismatch indicates a stale/corrupt index and is intentionally swallowed as 0, and why a warning doesn't belong in this pure helper: it's invoked once per node per query in search()'s hot loop and has no state to dedupe a "warn once". Per your own framing ("a counter on the engine"), the actual signal belongs at the engine level — see below.

2 (detect at engine level): I left this as a follow-up rather than implementing it here. It's a stateful design change (record an expected dimension, then decide skip vs warn vs throw on non-conforming stored vectors) that expands past this PR's NaN-safety scope, and the per-call .length compare is O(1) and negligible against the O(dim) dot/magnitude loop that follows, so the perf motivation is thin. I added a comment at the embeddings.get site in search() cross-referencing the TODO so the engine-level validation (record expected dim, skip/flag mismatched stored embeddings, warn-once counter) has a clear home. Happy to spin that into its own PR.

Full core suite green (696 tests); tsc --noEmit clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants