Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions docs/semantic-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,76 @@ All settings are fields on `BasicMemoryConfig` and can be set via environment va
| `semantic_embedding_document_input_type` | `BASIC_MEMORY_SEMANTIC_EMBEDDING_DOCUMENT_INPUT_TYPE` | Auto for known LiteLLM models | Optional LiteLLM `input_type` for indexed document/passages. |
| `semantic_embedding_query_input_type` | `BASIC_MEMORY_SEMANTIC_EMBEDDING_QUERY_INPUT_TYPE` | Auto for known LiteLLM models | Optional LiteLLM `input_type` for search queries. |
| `semantic_vector_k` | `BASIC_MEMORY_SEMANTIC_VECTOR_K` | `100` | Candidate count for vector nearest-neighbour retrieval. Higher values improve recall at the cost of latency. |
| `search_entity_boost_enabled` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_ENABLED` | `false` | Enable the entity-aware ranking boost in hybrid search (see below). Default off: benchmark-validated as inert on LoCoMo and prone to Title-Case false positives. |
| `search_entity_boost_weight` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_WEIGHT` | `0.15` | Per-matched-term multiplier strength for the entity boost. A candidate matching N query entity terms is scaled by `1 + weight * min(N, max_terms)`. |
| `search_entity_boost_max_terms` | `BASIC_MEMORY_SEARCH_ENTITY_BOOST_MAX_TERMS` | `3` | Maximum number of distinct matched entity terms that contribute to the boost, bounding the multiplier. |

## Entity-Aware Ranking Boost

Hybrid search fuses keyword (FTS) and vector similarity, but proper nouns in a query
carry no special weight against generic semantic similarity. As a result, a document
about a *different* entity on the same topic can outrank the document that actually
names the queried entity — e.g. "What are Joanna's hobbies?" surfacing a generic
hobbies note ahead of Joanna's note (see
[#951](https://github.com/basicmachines-co/basic-memory/issues/951)).

When `search_entity_boost_enabled=true`, hybrid retrieval performs a final,
lexical-only re-scoring pass:

1. It extracts candidate entity terms from the query — capitalized / proper-noun
tokens that are not common stopwords (e.g. `Joanna`, `Anthony`, `NASA`).
2. For each fused candidate, it counts how many distinct query entity terms appear in
the candidate's entity name (its title) or in a relation row's linked entity names.
3. Matching candidates have their fused score multiplied by
`1 + weight * min(matches, max_terms)`, so an entity-matching document can be
promoted above a higher-similarity non-matching one.

The boost adds **no model inference** — it is pure index/lexical lookup, so per-query
latency overhead is trivial. It only affects `hybrid` retrieval; `text` and `vector`
modes are unchanged. Non-matching candidates keep their original scores, so ordering
among them is preserved.

```bash
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_ENABLED=true
# Optional tuning:
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_WEIGHT=0.15
export BASIC_MEMORY_SEARCH_ENTITY_BOOST_MAX_TERMS=3
```

> **Default off.** This setting is disabled by default. See the benchmark
> findings below for why the default stays off and where the boost helps.

### Benchmark findings

The boost was benchmarked against LoCoMo (the
[basic-memory-benchmarks](https://github.com/basicmachines-co/basic-memory-benchmarks)
retrieval suite, hybrid mode) and a hand-built adversarial corpus. Two results
drove the decision to keep the default **off** and leave the weight at `0.15`:

1. **LoCoMo is insensitive to the boost.** Sweeping the weight across
`0.15, 0.3, 0.5, 1.0, 2.0` produced *identical* recall@5, recall@10, MRR, and
content-hit at every point — no query reordered, no score changed. LoCoMo's
documents are titled by conversation/session id and expose speaker names only
in body text, never as entity titles or relation names. Because the boost
matches query proper nouns against a candidate's **title or linked relation
names**, it never fires on this corpus. LoCoMo therefore provides no signal to
raise the weight, and the boost neither helps nor harms it.

2. **A capitalization-only heuristic has false positives.** On a corpus where
entity terms appear in titles, the boost correctly promotes the right document
for clean proper nouns (e.g. `Katze`) and is correctly inert on
lowercase-leading identifiers (e.g. `getUserById`, ignored). But **Title-Case
queries can regress**: a query like `What Is The Plan For Q3` extracts `Q3` as
an entity term, and even at weight `0.15` it promotes a document that
*literally* contains "Q3" above the more relevant document that says "third
quarter". Since entity detection is lexical (capitalization, no NER), any
capitalized non-entity token in a query is a potential false positive.

**Guidance.** Enable the boost only on entity-heavy corpora where your queries
name entities that are themselves note titles or linked relations (the #951
"Joanna" case). Prefer natural-case queries (`What are Joanna's hobbies?`) over
Title-Cased phrasing, which can inject spurious entity terms. Leave it off for
conversational / body-text-keyed corpora like LoCoMo, where it cannot help.

## Embedding Providers

Expand Down
36 changes: 36 additions & 0 deletions src/basic_memory/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,42 @@ def __init__(self, **data: Any) -> None: ...
"Valid values: text, vector, hybrid. "
"When unset, defaults to 'hybrid' if semantic search is enabled, otherwise 'text'.",
)
# Entity-aware ranking boost (hybrid retrieval).
# Trigger: proper nouns in a query (e.g. "Joanna") carry no extra weight against
# generic semantic similarity, so documents from the wrong conversation can outrank
# the gold document during hybrid fusion (#951).
# Why: entities are first-class in Basic Memory, so a candidate whose title or linked
# relation names contain a query proper noun is a stronger answer than a same-topic
# document about a different entity.
# Outcome: when enabled, hybrid fusion multiplies a candidate's fused score by a small
# bonus for each distinct query entity term it matches lexically (no model inference).
# Default OFF: LoCoMo benchmarking showed the boost is inert there (its docs are keyed
# by session id, not entity titles) and an adversarial check found Title-Case queries
# can inject spurious entity terms (e.g. "Q3") that regress ranking. See
# docs/semantic-search.md "Benchmark findings".
search_entity_boost_enabled: bool = Field(
default=False,
description="Enable entity-aware ranking boost in hybrid search. When enabled, "
"hybrid candidates whose title or linked relation names contain a proper-noun "
"term from the query are boosted in the final ranking. Lexical-only; adds no "
"model inference. Default off: benchmark-validated as inert on LoCoMo and prone "
"to Title-Case false positives (see docs/semantic-search.md).",
)
search_entity_boost_weight: float = Field(
default=0.15,
description="Per-matched-term multiplier strength for the entity-aware ranking "
"boost. A candidate matching N distinct query entity terms has its fused score "
"multiplied by (1 + weight * N), capped at search_entity_boost_max_terms terms. "
"Only applies when search_entity_boost_enabled is true.",
ge=0.0,
)
search_entity_boost_max_terms: int = Field(
default=3,
description="Maximum number of distinct matched entity terms that contribute to "
"the entity-aware ranking boost, bounding the multiplier so a single candidate "
"cannot run away with the ranking.",
gt=0,
)

# Database connection pool configuration (Postgres only)
db_pool_size: int = Field(
Expand Down
3 changes: 3 additions & 0 deletions src/basic_memory/repository/postgres_search_repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ def __init__(
self._semantic_postgres_prepare_concurrency = (
self._app_config.semantic_postgres_prepare_concurrency
)
self._entity_boost_enabled = self._app_config.search_entity_boost_enabled
self._entity_boost_weight = self._app_config.search_entity_boost_weight
self._entity_boost_max_terms = self._app_config.search_entity_boost_max_terms
self._embedding_provider = embedding_provider
self._vector_dimensions = 384
self._vector_tables_initialized = False
Expand Down
173 changes: 173 additions & 0 deletions src/basic_memory/repository/search_repository_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,64 @@
# the vector/hybrid retrieval path must key rows by (type, id) to avoid collisions.
type SearchIndexKey = tuple[str, int]

# --- Entity-aware ranking boost (#951) ---

# Match word tokens (allowing internal apostrophes/hyphens) so we can inspect
# their capitalization to detect proper-noun-like query terms.
_ENTITY_TERM_TOKEN_PATTERN = re.compile(r"[A-Za-z][A-Za-z'\-]*")

# Common capitalized sentence-starters and interrogatives that look like proper
# nouns but are not entity references. Kept lowercase for case-insensitive checks.
# Intentionally small: a candidate term only boosts a row when it actually matches
# that row's title/relation names, so a stray non-entity term simply does nothing.
_ENTITY_TERM_STOPWORDS = frozenset(
{
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"but",
"by",
"do",
"does",
"for",
"from",
"has",
"have",
"how",
"i",
"in",
"is",
"it",
"of",
"on",
"or",
"the",
"their",
"they",
"this",
"to",
"was",
"we",
"were",
"what",
"when",
"where",
"which",
"who",
"whom",
"whose",
"why",
"will",
"with",
"you",
"your",
}
)


@dataclass
class VectorSyncBatchResult:
Expand Down Expand Up @@ -166,6 +224,13 @@ class SearchRepositoryBase(ABC):
_vector_dimensions: int
_vector_tables_initialized: bool

# Entity-aware ranking boost (#951). Defaults keep the feature off for any
# subclass or test double that does not explicitly configure it. Concrete
# backends overwrite these from BasicMemoryConfig in their __init__.
_entity_boost_enabled: bool = False
_entity_boost_weight: float = 0.0
_entity_boost_max_terms: int = 1

def __init__(self, session_maker: async_sessionmaker[AsyncSession], project_id: int):
"""Initialize with session maker and project_id filter.

Expand Down Expand Up @@ -2147,6 +2212,105 @@ async def _fetch_search_index_rows_by_ids(
# Shared semantic search: hybrid score-based fusion
# ------------------------------------------------------------------

# --- Entity-aware ranking boost (#951) ---

@staticmethod
def _extract_query_entity_terms(search_text: Optional[str]) -> set[str]:
"""Extract candidate entity (proper-noun) terms from a query string.

Heuristic, lexical only (no model inference): a token is a candidate entity
term when it is title-cased or all-caps and is not a common stopword. The
result is lowercased so downstream matching is case-insensitive.

Examples:
"What are Joanna's hobbies?" -> {"joanna"}
"Who is Anthony?" -> {"anthony"}
"Deborah and Jolene" -> {"deborah", "jolene"}
"what is the weather" -> set() (no proper nouns)
"""
if not search_text:
return set()

terms: set[str] = set()
for match in _ENTITY_TERM_TOKEN_PATTERN.finditer(search_text):
token = match.group(0)
# Trigger: token begins with an uppercase letter (Title-Case or ALL-CAPS).
# Why: proper nouns and named entities are conventionally capitalized; this
# is the cheapest reliable signal without a NER model.
# Outcome: lowercase, non-capitalized words are ignored as generic terms.
if not token[0].isupper():
continue
normalized = token.lower()
# Strip a trailing possessive so "Joanna's" matches the entity "Joanna".
if normalized.endswith("'s"):
normalized = normalized[:-2]
if normalized in _ENTITY_TERM_STOPWORDS:
continue
# Single characters (e.g. a stray "I") carry no entity signal.
if len(normalized) < 2:
continue
terms.add(normalized)
return terms

@staticmethod
def _row_entity_match_count(row: SearchIndexRow, entity_terms: set[str]) -> int:
"""Count distinct query entity terms that a candidate row references.

Matches against the row's own entity name (title) and the names embedded in
a relation row's title (``"From -> To"``). These are the fields where Basic
Memory's first-class entity names surface, so a match here is strong evidence
the candidate is about the queried entity rather than a same-topic document.
"""
if not entity_terms:
return 0

haystack_parts = [row.title or ""]
# Relation rows encode linked entity names in their title ("From -> To");
# the relation_type itself is not an entity name, so it is excluded.
haystack = " ".join(part for part in haystack_parts if part)
if not haystack:
return 0

haystack_tokens: set[str] = set()
for match in _ENTITY_TERM_TOKEN_PATTERN.finditer(haystack):
token = match.group(0).lower()
# Mirror the query-side possessive stripping so a doc titled
# "Joanna's Hobbies" matches the query entity term "joanna".
if token.endswith("'s"):
token = token[:-2]
haystack_tokens.add(token)
return len(entity_terms & haystack_tokens)

def _apply_entity_boost(
self,
fused_scores: dict[SearchIndexKey, float],
rows_by_key: dict[SearchIndexKey, SearchIndexRow],
entity_terms: set[str],
) -> dict[SearchIndexKey, float]:
"""Multiply fused scores by a per-matched-term bonus for entity-matching rows.

Trigger: entity boosting is enabled and the query contains proper-noun terms.
Why: a candidate whose entity/relation names contain a queried proper noun is a
stronger answer than a generic same-topic document (#951 cross-conversation
confusion).
Outcome: ``score * (1 + weight * min(matches, max_terms))``. Rows that match no
query entity term are returned unchanged, so relative order among non-matching
rows is preserved.
"""
if not self._entity_boost_enabled or not entity_terms or self._entity_boost_weight <= 0:
return fused_scores

boosted: dict[SearchIndexKey, float] = {}
for row_key, score in fused_scores.items():
row = rows_by_key.get(row_key)
matches = self._row_entity_match_count(row, entity_terms) if row is not None else 0
if matches <= 0:
boosted[row_key] = score
continue
capped_matches = min(matches, self._entity_boost_max_terms)
boosted[row_key] = score * (1.0 + self._entity_boost_weight * capped_matches)
return boosted

async def _search_hybrid(
self,
*,
Expand Down Expand Up @@ -2250,6 +2414,15 @@ async def _search_hybrid(
f = fts_scores.get(row_key, 0.0)
fused_scores[row_key] = max(v, f) + FUSION_BONUS * min(v, f)

# Entity-aware ranking boost (#951): runs over the full fused candidate set
# before the limit/offset cut, so a boosted entity-matching candidate can be
# promoted into the returned window. No-op when the feature is disabled or the
# query contains no proper-noun terms, preserving the existing ordering.
entity_terms = (
self._extract_query_entity_terms(query_text) if self._entity_boost_enabled else set()
)
fused_scores = self._apply_entity_boost(fused_scores, rows_by_key, entity_terms)

ranked = sorted(fused_scores.items(), key=lambda item: item[1], reverse=True)
output: list[SearchIndexRow] = []
for row_key, fused_score in ranked[offset : offset + limit]:
Expand Down
3 changes: 3 additions & 0 deletions src/basic_memory/repository/sqlite_search_repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ def __init__(
self._semantic_embedding_sync_batch_size = (
self._app_config.semantic_embedding_sync_batch_size
)
self._entity_boost_enabled = self._app_config.search_entity_boost_enabled
self._entity_boost_weight = self._app_config.search_entity_boost_weight
self._entity_boost_max_terms = self._app_config.search_entity_boost_max_terms
self._embedding_provider = embedding_provider
self._sqlite_vec_load_lock = asyncio.Lock()
self._sqlite_prepare_write_lock = asyncio.Lock()
Expand Down
Loading
Loading