Skip to content

fix(core): use (type, id) keys in vector search hydration to prevent id collisions#986

Merged
phernandez merged 1 commit into
mainfrom
fix/982-vector-search-id-collision
Jun 12, 2026
Merged

fix(core): use (type, id) keys in vector search hydration to prevent id collisions#986
phernandez merged 1 commit into
mainfrom
fix/982-vector-search-id-collision

Conversation

@phernandez

Copy link
Copy Markdown
Member

Root cause

Entity, observation, and relation rows in search_index carry ids from independent auto-increment sequences, so rows of different types routinely share the same numeric id — guaranteed in young databases. In src/basic_memory/repository/search_repository_base.py:

  1. _search_vector_only parsed each vector hit's chunk_key (e.g. entity:4:0) but discarded the type, keying similarity_by_si_id / chunks_by_si_id by bare id, which also collapsed colliding hits into one map slot.
  2. _fetch_search_index_rows_by_ids fetched WHERE id IN (...) and keyed its result dict by bare row.id — whichever row the database returned last clobbered the other, so the clobbered hit hydrated against the wrong row or was silently dropped (e.g. an entity vanishing entirely under a search_item_types filter when a relation shared its id).
  3. _search_hybrid fused FTS and vector results on bare row id, merging unrelated rows of different types and granting them a spurious dual-source fusion bonus.

The FTS-filter branch already guarded exactly this with (id, type) tuples — the primary vector lookup path and hybrid fusion missed the same treatment.

Fix

Introduce a SearchIndexKey = tuple[str, int] type alias and key every map in the vector/hybrid retrieval path by (type, id): the similarity and chunk maps in _search_vector_only, the _fetch_search_index_rows_by_ids result, the FTS-filter allowed-keys set, and the rows/fts/vec/fused score maps in _search_hybrid. SQL stays unchanged; bare ids are deduplicated before the IN query and rows are discriminated by row.type when building dict keys. The fix lives in the shared base class, so both SQLite and Postgres backends are covered.

Test evidence

  • test_sqlite_vector_search_survives_cross_type_id_collision (real SQLite + sqlite-vec harness): indexes an entity row and a relation row sharing id=7, syncs vectors for both, asserts vector retrieval returns both rows with correct types, and that a search_item_types=[ENTITY] filter still returns the entity.
  • test_cross_type_id_collision_keeps_both_results (hybrid fusion): an FTS entity and a vector relation sharing id=1 stay distinct with their single-source scores — no cross-type merge or fusion bonus.
  • Both tests fail on main without the src change (verified by stashing the fix: SQLite test returns 1 merged result instead of 2; hybrid test fuses the rows into one) and pass with it.
  • Existing mocked vector tests (test_vector_threshold.py, test_vector_pagination.py) updated for tuple-keyed fetch results.
  • 415 passed / 19 skipped across tests/repository/, search service, semantic search, and search schema suites; ruff and ty clean on src tests test-int.

Fixes #982

🤖 Generated with Claude Code

…id collisions

Root cause: entity, observation, and relation rows in search_index carry
ids from independent auto-increment sequences, so rows of different types
routinely share the same numeric id (guaranteed in young databases).
_search_vector_only parsed each vector hit's chunk_key (e.g. 'entity:4:0')
but discarded the type, and _fetch_search_index_rows_by_ids keyed its
result dict by bare row.id with no type discrimination. Whichever row the
database returned last clobbered the other in the dict; the clobbered hit
then hydrated against the wrong row or found None and was silently
dropped from results. The FTS-filter branch already guarded this with
(id, type) tuples, but the primary vector lookup path and the hybrid
fusion maps missed the same treatment.

Fix: introduce a SearchIndexKey = tuple[str, int] alias and key every map
in the vector/hybrid retrieval path by (type, id) — the similarity and
chunk maps in _search_vector_only, the _fetch_search_index_rows_by_ids
result, the FTS-filter allowed keys, and the rows/fts/vec/fused score
maps in _search_hybrid. The SQL stays unchanged; bare ids are deduped
before the IN query and rows are discriminated by row.type when building
dict keys.

Tests: end-to-end SQLite regression test indexes an entity row and a
relation row sharing id 7, syncs vectors for both, and asserts vector
search returns both rows (and that the entity survives a
search_item_types filter); a hybrid fusion unit test asserts an entity
and relation sharing id 1 stay distinct with single-source scores. Both
fail without the fix. Existing mocked vector tests updated for tuple keys.

Fixes #982

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez phernandez merged commit 253e240 into main Jun 12, 2026
23 checks passed
@phernandez phernandez deleted the fix/982-vector-search-id-collision branch June 12, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector search silently drops results when entity and relation search_index ids collide

1 participant