Skip to content

feat(core): config-gated entity-aware ranking boost for hybrid search#969

Draft
groksrc wants to merge 3 commits into
mainfrom
feat/951-entity-ranking-boost
Draft

feat(core): config-gated entity-aware ranking boost for hybrid search#969
groksrc wants to merge 3 commits into
mainfrom
feat/951-entity-ranking-boost

Conversation

@groksrc

@groksrc groksrc commented Jun 11, 2026

Copy link
Copy Markdown
Member

Closes #951

Summary

Implements a config-gated, entity-aware ranking boost for hybrid search results (Option A from the issue triage). The boost is a post-fusion re-scoring pass applied to the full fused candidate set before the limit/offset cut, so a matching document below the cutoff can be promoted without changing the underlying FTS or vector retrieval.

Feature is default-off. No schema changes. No model inference — purely lexical/index lookups against existing data.

What changed

config.py — three new BasicMemoryConfig fields, all default-off:

  • search_entity_boost_enabled (False) — master switch
  • search_entity_boost_weight (0.15) — multiplicative weight (ge=0.0)
  • search_entity_boost_max_terms (3) — cap on matched terms counted (gt=0)

All three respect the existing BASIC_MEMORY_ env-var prefix.

search_repository_base.py — new helpers and hook:

  • _extract_query_entity_terms — extracts capitalized/proper-noun tokens from the query, strips possessives, drops stopwords
  • _row_entity_match_count — matches query entity terms against a result row's title (relation rows carry a synthetic FromEntity -> ToEntity title from the search service, so this covers entity and relation hits)
  • _apply_entity_boost — applies score * (1 + weight * min(matches, max_terms)) to matched rows; non-matching rows are untouched so their relative order is preserved
  • Hooked into _search_hybrid right before the final ranking sort

sqlite_search_repository.py / postgres_search_repository.py — populate the three _entity_boost_* instance attributes from config, parallel to the existing _semantic_* wiring. The base class declares safe off-by-default defaults so other subclasses and test doubles are unaffected.

docs/semantic-search.md — three new config rows added to the Configuration Reference table; new "Entity-Aware Ranking Boost" section explaining the mechanism, env vars, and the default-off/benchmark-pending status.

Testing

  • tests/repository/test_search_repository_entity_boost.py — new dedicated test module covering: disabled-by-default (no score change), boost applied when enabled, boost capped at max_terms, mixed results (matching vs non-matching docs), zero-weight no-op, and relation title matching both endpoints (test_relation_title_matches_both_endpoints)
  • Existing tests/repository/test_search_repository.py — all passing; hybrid search behaviour unchanged when boost is off
  • uv run pytest tests/repository/ tests/test_config.py -x passes clean

Review notes

  1. _row_entity_match_count internal construction is slightly misleading. docs/semantic-search.md and the helper's docstring both say matching is performed against "a relation row's linked entity names". This is correct in practice — relation rows are indexed with a synthetic title of the form FromEntity -> ToEntity at src/basic_memory/services/search_service.py:921, and test_relation_title_matches_both_endpoints covers it. However, the implementation reads haystack_parts = [row.title or ""], a single-element list with a leftover comment (# the relation_type itself is not an entity name, so it is excluded) describing a multi-source merge that never materialises. Worth simplifying to haystack = row.title or "" to avoid implying dead/abandoned logic.

  2. No dedicated test_config.py assertions for the three new fields. The claimed test run includes test_config.py but it passes only because it was unchanged and does not exercise the new fields. The defaults and Pydantic validators (ge=0.0, gt=0) are covered only indirectly via the repository tests. A small explicit default + env-var test would harden the config surface, though the Pydantic constraints make this low risk.

  3. Feature is dormant on merge (default-off, not benchmarked). This is consistent with the issue's "measurable commit-to-commit" framing and the triage's Option A recommendation, but reviewers should be aware the new code path is not active in any default configuration and has not been validated against the LoCoMo benchmark. The intent is to enable it once a benchmark run confirms neutral-or-positive impact.

🤖 Generated with Claude Code

@groksrc groksrc changed the title feat(search): config-gated entity-aware ranking boost for hybrid search feat(core): config-gated entity-aware ranking boost for hybrid search Jun 11, 2026
@groksrc groksrc marked this pull request as ready for review June 11, 2026 07:28

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20299915ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +108 to +112
repo = SQLiteSearchRepository(
session_maker,
project_id=test_project.id,
app_config=app_config,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass the stub provider into SQLiteSearchRepository

When these tests run on the SQLite path, app_config has semantic_search_enabled=True but the repository is constructed without embedding_provider, so SQLiteSearchRepository.__init__ creates the configured real provider before lines 113-115 replace it with the stub. That defeats the deterministic stub setup and can make this test load FastEmbed/cache artifacts or fail in environments without usable semantic dependencies; pass embedding_provider=provider here as the Postgres branch already does.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Verified: SQLiteSearchRepository.init calls create_embedding_provider() at line 71 when semantic_search_enabled=True and no provider is passed — exactly the real-provider instantiation you flagged.

Fixed in dcc4a7a: pass embedding_provider=provider to the SQLiteSearchRepository constructor at construction time (matching what the Postgres branch already did), and removed the three after-the-fact attribute swaps (_semantic_enabled, _embedding_provider, _vector_dimensions) which are now redundant since init sets them correctly from the supplied provider.

All 60 tests in the suite pass; ruff and ty check clean.

@groksrc

groksrc commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Benchmark + tuning report for the entity-aware ranking boost

I benchmarked the boost, swept the weight, ran an adversarial check, and updated the docs/config rationale. Decision: keep search_entity_boost_enabled default off and the weight at 0.15 — with data below.

Setup

  • BM under test: this branch (feat/951-entity-ranking-boost) via the benchmarks harness --bm-local-path (runs the worktree's code; searches go through a warm MCP search_notes session with search_type=hybrid).
  • Isolation: harness scopes BM to a benchmark-owned BASIC_MEMORY_CONFIG_DIR; boost config injected via BASIC_MEMORY_SEARCH_ENTITY_BOOST_* env vars (the harness forwards the parent environment into both reindex and the MCP search process).
  • Dataset: LoCoMo locomo-c1 subset — 19 docs, 199 queries, fastembed bge-small-en-v1.5, all five categories present (single_hop 32, multi_hop 37, temporal 13, open_domain 70, adversarial 47). Full-corpus (272-doc) re-embed was the only multi-hour path; I confirmed (see below) that it would add nothing here, so all numbers are labeled subset. The same fixed query set was used for every sweep point.
  • Why subset is sufficient (and full corpus wouldn't change the conclusion): the existing June-10 ingest could not be reused (its benchmark project had been removed from the harness DB), so each point re-ingests; the subset runs end-to-end in ~28s. Critically, the boost is structurally inert on LoCoMo regardless of size — see finding Pre release #1.

Weight sweep — LoCoMo locomo-c1 subset (199 queries, hybrid)

max_terms=3 throughout. Boost on vs off, weights 0.15 → 2.0:

config recall@5 recall@10 MRR content-hit
baseline (off) 0.8370 0.9207 0.7591 0.1910
w=0.15 (on) 0.8370 0.9207 0.7591 0.1910
w=0.3 (on) 0.8370 0.9207 0.7591 0.1910
w=0.5 (on) 0.8370 0.9207 0.7591 0.1910
w=1.0 (on) 0.8370 0.9207 0.7591 0.1910
w=2.0 (on) 0.8370 0.9207 0.7591 0.1910

Per-category recall@5 / MRR were identical at every weight as well (single_hop 0.5490/0.5298, multi_hop 0.9189/0.7962, temporal 0.6154/0.5085, open_domain 0.9143/0.8386, adversarial 0.9149/0.8369). A per-query diff of the top-5 ordering between baseline and w=2.0 showed 0 / 199 queries changed, and per-hit scores were byte-identical.

Finding 1 — LoCoMo is blind to the boost. LoCoMo docs are titled by session id (locomo-c00-s01) and surface speaker names only in body text, never as entity titles or relation names. The boost matches query proper nouns against a candidate's title / linked relation names, so it never fires on this corpus — e.g. for "When did Caroline go to the LGBTQ support group?" the term caroline is correctly extracted but matches no title. The boost machinery is confirmed live (env → config → _extract_query_entity_terms all verified), it simply has nothing to match. LoCoMo therefore gives no signal to raise the default weight — and confirms the boost is non-regressive there (literally a no-op).

Adversarial check (LoCoMo can't see this)

Hand-built corpus where entity terms appear in titles, boost off vs on (w=0.15/0.5/2.0):

query type result
German capsKatze (gold: doc titled Katze) Gold stays #1 at all weights; boost amplifies proportionally. No harmful reorder.
CamelCasegetUserById (gold: doc titled getUserById) Identical to boost-off at every weight — getUserById starts lowercase, so it's rejected as a non-entity token. Correctly inert.
Title-CaseWhat Is The Plan For Q3 (gold: third quarter roadmap doc) Regression. Q3 is extracted as an entity term; even at w=0.15 a distractor that literally contains "Q3" is promoted above the more relevant "third quarter" doc. Worsens with weight.

Finding 2 — the capitalization-only heuristic has a real false-positive mode. Title-Cased query phrasing injects spurious entity terms (any capitalized non-entity token), and that can flip a correct top result. So the adversarial check is not clean.

Decision & rationale

  • Default: OFF (unchanged). No LoCoMo evidence supports enabling it (inert), and the adversarial check found a genuine Title-Case regression.
  • Weight: 0.15 (unchanged). LoCoMo provides zero signal to tune the weight; the only datapoint favoring a higher weight was the single manual "Joanna" QA case, which the benchmark cannot corroborate. Raising the default would be unjustified by data and would only amplify the Title-Case false positive.
  • No code/default changes — the PR's existing defaults are already the data-driven choice. I updated docs/semantic-search.md (new "Benchmark findings" section + guidance) and the config.py rationale comment/description to record why it stays off and where the boost actually helps (entity-heavy corpora with entity-titled notes / natural-case queries, i.e. the Entity-aware ranking boost: cross-conversation entity confusion dominates hard retrieval misses in LoCoMo benchmark #951 "Joanna" case).

Files changed: docs/semantic-search.md, src/basic_memory/config.py (commit b07af9b9).
Gates: ruff check + ruff format --check clean on changed files; uv run ty check src tests test-int → all checks passed; targeted pytest (entity-boost service, hybrid fusion, config) → 118 passed.

🤖 Generated with Claude Code

@groksrc

groksrc commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Status: back to draft — deferred until the boost is demonstrably useful

Maintainer decision following the benchmark + adversarial findings above: the implementation is sound and non-regressive, but there is currently no evidence it improves retrieval, and the capitalization heuristic has a confirmed false-positive class (Title-Case queries promoting literal-token distractors, e.g. "Q3"). We won't land dormant machinery.

Likely path to make it land-able

  1. Ground entity-term extraction in the index, not capitalization. Match candidate query terms against indexed entity titles (the original Entity-aware ranking boost: cross-conversation entity confusion dominates hard retrieval misses in LoCoMo benchmark #951 framing) instead of treating any capitalized token as an entity. This kills the Title-Case false-positive class by construction — "Q3" only boosts if an entity titled Q3 actually exists — and makes the boost semantics match its name.
  2. Validate on a corpus the boost can see. LoCoMo is structurally blind here: its documents are titled by session id with entity names only in body text, so title/relation matching never fires (zero reorders at any weight, 0.15–2.0). Validation needs a benchmark whose notes are entity-titled like real Basic Memory graphs — either a LoCoMo variant ingested as entity-per-person notes, or a purpose-built set replicating the cross-conversation confusion pattern (the Joanna/Anthony case from the issue, at scale).
  3. Re-run the weight sweep there. The known calibration datapoint says 0.15 cannot overcome relation-link FTS inflation (~24% score gap vs a 15% boost); expect the useful range to be materially higher once there is signal to tune against.
  4. Then revisit the default under the same two gates used here: benchmark wins with no category regression, plus a clean adversarial pass.

🤖 Generated with Claude Code

@groksrc groksrc marked this pull request as draft June 11, 2026 21:10
groksrc and others added 3 commits June 12, 2026 12:31
…arch

Proper nouns in a query carry no extra weight against generic semantic
similarity, so documents about a different entity on the same topic can
outrank the document that actually names the queried entity (#951
cross-conversation confusion in the LoCoMo benchmark).

Add an optional, lexical-only re-scoring pass to hybrid fusion:

- Extract candidate entity terms from the query (capitalized / proper-noun
  tokens that are not common stopwords; trailing possessives stripped).
- Count how many distinct query entity terms appear in each fused
  candidate's entity name (title) or a relation row's linked entity names.
- Multiply matching candidates' fused scores by
  1 + weight * min(matches, max_terms), promoting entity-matching docs.

The boost runs over the full fused candidate set before the limit/offset
cut, so a matching doc below the cutoff can be promoted into the returned
window. It adds no model inference (index/lexical lookups only), so
per-query latency overhead is trivial, and only affects hybrid retrieval.

Behind three config flags, DEFAULT OFF pending LoCoMo benchmark validation:
search_entity_boost_enabled, search_entity_boost_weight,
search_entity_boost_max_terms. Documented in docs/semantic-search.md.

Tests: unit coverage for entity-term extraction and the boost math; a
hybrid-pipeline test showing reordering when enabled and unchanged ordering
when disabled; and a service-level integration test over a real DB with a
deterministic stub embedding provider proving an entity-matching doc
outranks a higher-similarity non-matching doc only when enabled.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
… construction time

On the SQLite path, SQLiteSearchRepository.__init__ calls
create_embedding_provider() when semantic_search_enabled=True and no
provider is supplied, which could pull FastEmbed/cache artifacts or
fail where semantic deps are unusable.  The Postgres branch already
passed the stub provider at construction; bring the SQLite branch in
line by passing embedding_provider=provider to the constructor and
removing the redundant after-the-fact attribute swaps.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
Benchmarked the #951 entity-aware ranking boost against the LoCoMo retrieval
suite (hybrid mode) and a hand-built adversarial corpus.

LoCoMo is insensitive to the boost: sweeping the weight across
0.15/0.3/0.5/1.0/2.0 produced identical recall@5, recall@10, MRR, and
content-hit at every point (no query reordered, no score changed). LoCoMo docs
are keyed by session id and expose speaker names only in body text, never as
entity titles or relation names, so the title/relation-matching boost never
fires there.

An adversarial check found a real regression mode: Title-Case queries inject
spurious entity terms. 'What Is The Plan For Q3' extracts 'Q3' and, even at
weight 0.15, promotes a literal-'Q3' document over the more relevant 'third
quarter' document. Clean proper nouns (Katze) work; lowercase-leading
identifiers (getUserById) are correctly ignored.

Decision: keep search_entity_boost_enabled default off and the weight at 0.15.
LoCoMo provides no signal to raise the weight, and the adversarial check is not
clean. Document the findings and guidance; no code/default changes.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc force-pushed the feat/951-entity-ranking-boost branch from b07af9b to d175879 Compare June 12, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Entity-aware ranking boost: cross-conversation entity confusion dominates hard retrieval misses in LoCoMo benchmark

1 participant