feat(mcp): add rebuild_embeddings tool (issue #39) by bertheto · Pull Request #42 · ScottRBK/forgetful

bertheto · 2026-05-05T22:22:07Z

Summary

Adds a user-scoped, non-destructive rebuild_embeddings MCP tool, complementing the existing offline --re-embed CLI which performs a global storage reset (reset_embedding_storage() + full re-embed). Closes #39.

The new tool lets a memory client (LLM agent, dashboard, ops script) recover a small set of memories whose embedding is missing or stale, or force-refresh the embeddings of a single project, without touching other users' data and without resetting global vector storage.

Why a new tool instead of reusing `re_embed_all`?

re_embed_all is intentionally destructive: it resets the entire vec_memories table (SQLite) or column (Postgres) before re-embedding. This is correct for an offline migration but unsafe as a multi-tenant runtime tool.
Issue Feature request: add ebuild_embeddings MCP tool for existing memories #39 specifically asks for a way to rebuild embeddings for an existing memory set without recreating it. The previous workaround (delete + create) loses id, links, created_at, and provenance.
We deliberately keep the existing --re-embed CLI/code path untouched.

Design

Tool surface

Registered in both registries used by the project:

app/routes/mcp/memory_tools.py (FastMCP-decorated @mcp.tool());
app/routes/mcp/tool_adapters.py + app/routes/mcp/tool_metadata_registry.py (simplified registry).

Parameters:

Param	Type	Default	Notes
`memory_ids`	`list[int] \| None`	`None`	Explicit ids, already user-scoped server-side
`project_id`	`int \| None`	`None`	Single project owned by caller
`only_missing`	`bool`	`True`	When True, rebuild only memories whose embedding is missing/stale

Returns a dict with: total_candidates, rebuilt_ids, skipped_ids, failed (list of {memory_id, reason}) so the caller can act on partial success.

Persistence layer

Three new methods on MemoryRepository:

count_memories_for_targeted_rebuild(user_id, memory_ids?, project_id?, only_missing)
get_memories_for_targeted_rebuild(user_id, limit, offset, ...)
upsert_targeted_embeddings(user_id, updates)

All filters are AND-combined; user_id is mandatory and never optional. upsert_targeted_embeddings performs server-side ownership validation against memories.user_id before writing — a malicious caller cannot poison another user's vector by passing arbitrary ids.

SQLite (vec_memories virtual table is keyed by memory_id and rejects duplicates): idempotent DELETE + INSERT per id.
Postgres (memories.embedding inline column): idempotent UPDATE.

Service layer

ReEmbeddingService.rebuild_targeted reuses the existing pipeline (build_embedding_text + EmbeddingsAdapter.generate_embedding) so embeddings stay dimensionally consistent with the rest of the corpus. reset_embedding_storage() is never called in this path.

Tests

File	Tests	Notes
`tests/integration/test_targeted_rebuild_sqlite.py`	4	Real in-memory SQLite + `sqlite-vec`, deterministic offline embedding adapter; covers gap-repair, user-scoping, no-reset guarantee, repository-level ownership rejection
`tests/integration/test_re_embedding_service.py`	+4	Service-level coverage with mocked repo: empty candidate set, partial-ownership skips, per-memory failure recorded in `result.failed`, `reset_embedding_storage` never called
`tests/e2e_sqlite/test_re_embedding_sqlite.py`	+3	End-to-end with FastEmbed when available; fixture now skips cleanly instead of crashing when Hugging Face is unreachable (offline runners)

Local run (offline): 418 passed, 1 unrelated pre-existing flake (test_list_projects_multiple). The new test_targeted_rebuild_sqlite.py module is fully offline-safe and ran green in CI on this branch.

Backward compatibility

re_embed_all and the CLI --re-embed flag are unchanged.
New methods on MemoryRepository are additive; existing implementations were extended, not modified.
No DB schema migration required.
Existing MCP tools' contracts are unchanged.

Test plan

pytest tests/integration/test_targeted_rebuild_sqlite.py -> 4 passed
pytest tests/integration/test_re_embedding_service.py -> 12 passed (8 existing + 4 new)
pytest tests/e2e_sqlite/test_re_embedding_sqlite.py -> 8 skipped offline (HF unreachable), runs cleanly when FastEmbed model is cached
ruff check on all changed files -> All checks passed
Manually exercised on a 6 MB SQLite DB (forgetful_ai==0.4.0 HTTP runtime) with a real Cursor MCP client; rebuild_embeddings with only_missing=True repaired exactly the gap memories, no other vectors touched.

Closes #39

Adds a user-scoped, non-destructive MCP tool to rebuild embeddings on a memory subset, complementing the existing offline `--re-embed` CLI which performs a global storage reset. The new `rebuild_embeddings` tool: - is registered in both `memory_tools.py` (FastMCP) and the simplified tool registry (`tool_metadata_registry.py` + `tool_adapters.py`); - never calls `reset_embedding_storage()`; - supports `memory_ids`, `project_id`, and `only_missing` filters; - returns a structured result (`total_candidates`, `rebuilt_ids`, `skipped_ids`, `failed`) so callers can act on partial success. Persistence layer changes: - new `MemoryRepository` methods `count_memories_for_targeted_rebuild`, `get_memories_for_targeted_rebuild`, `upsert_targeted_embeddings`; - SQLite implementation: idempotent DELETE-then-INSERT into `vec_memories`, server-side ownership check; - Postgres implementation: idempotent UPDATE on `memories.embedding`, same ownership check. Service layer changes: - `ReEmbeddingService.rebuild_targeted` reuses the same embedding pipeline (`build_embedding_text` + `EmbeddingsAdapter`) so vectors stay dimensionally consistent with the rest of the corpus. Tests: - 4 new integration tests on a real in-memory SQLite stack with a deterministic offline embedding adapter, covering: gap repair with `only_missing=True`, user-scoping (cross-user attempts blocked at the repository), no-op when `reset_embedding_storage` would be called, and `upsert_targeted_embeddings` rejecting unowned ids. - 4 additional integration tests on the service layer (mocked repo) covering empty candidate set, partial-ownership skips, and per-memory embedding failures recorded in `result.failed`. - 3 new e2e SQLite tests gated by FastEmbed availability (skipped when Hugging Face is unreachable). - `tests/e2e_sqlite/test_re_embedding_sqlite.py` fixture now skips cleanly instead of crashing when the FastEmbed model cannot be downloaded. Closes ScottRBK#39

ScottRBK · 2026-05-07T12:47:58Z

Thanks for tackling this — the structural choices are solid: reusing ReEmbeddingService rather than introducing a new service, keeping re_embed_all untouched, server-side ownership filtering, and the (rebuilt, skipped, failed) partial-success shape are all the right moves. The integration test coverage in test_targeted_rebuild_sqlite.py is also a good model to mirror.

Before getting into individual line-level changes, I want to raise a design question that might reshape some of the surface area, then list the concrete blockers and follow-ups.

Design discussion: rethinking `only_missing`

The premise behind only_missing is that some memories may have a missing embedding while others have a stale one. But in our codebase, embeddings are a mandatory invariant: MemoryService.create_memory writes the embedding atomically with the row (see app/services/memory_service.py:170-191), so no application code path can produce an embedding-less memory. The only ways to end up with one are:

Out-of-band writes (migrations, direct SQL, alternate ingest paths) — i.e. invariant violations
Mid-migration state from reset_embedding_storage() during re_embed_all

The expected lifecycle event — embedding model upgrade, dimension change, provider switch — produces stale embeddings, not missing ones. A user running rebuild_embeddings() with only_missing=True (the current default) after a model swap will see zero rebuilds, because their existing rows aren't missing, they're just wrong. They'd have to know to pass only_missing=False, which the metadata describes as "force a full refresh" — language that implies overriding a safety check rather than picking the actually-correct mode for the common case.

I'd like to propose dropping only_missing entirely. The tool unconditionally rebuilds every memory in the targeted set. Bypass recovery → caller passes the specific IDs the bypass left behind. Model swap → caller passes a project_id or omits scope (global within user). This is consistent with the actual invariant, removes the API trap, and the failed list still tells the caller anything that didn't succeed. We can always add a skip_if_current=False optimization flag later (with explicit dimension-mismatch detection) without breaking compatibility.

Happy to discuss if you see a use case I'm missing.

Blockers — must address before merge

1. Pagination silently skips memories when scope shrinks during processing

app/services/re_embedding_service.py:166-174

The loop uses range(0, total, batch_size) with total captured once before any writes. Under only_missing=True, each successful batch removes its rows from the filter set, so the next iteration's OFFSET batch_size skips over rows that haven't been processed yet. Concrete: 5 missing memories with batch_size=2 rebuilds IDs 1, 2, and 5 — IDs 3 and 4 silently fall through.

The tests don't catch this because their batch sizes always exceed the test data.

If we keep an only_missing-style filter, the fix needs to be cursor-based (keyset pagination). Replace the offset parameter on get_memories_for_targeted_rebuild with after_id, have the service track last_seen_id across batches, and ensure ORDER BY id ASC. This works for both filter-stable and filter-shrinking cases. If we drop only_missing per the design discussion above, the simpler fix (loop until empty result, no offset) also works.

2. Auto-link recomputation is missing

app/services/re_embedding_service.py:rebuild_targeted

Issue #39 explicitly required: "Also re-compute linked_memory_ids (Zettelkasten auto-linking) since it depends on the embedding." The current implementation regenerates the embedding but never calls find_similar_memories or create_links_batch. After a rebuild, the memory's vector neighbourhood has shifted but its links are still pointing at where it used to be.

Mirror the canonical pattern from MemoryService.create_memory:177-191: after each successful upsert, call memory_repo.find_similar_memories(...) then memory_repo.create_links_batch(...). Use the additive approach for now — create_links_batch already dedupes via its IntegrityError catch, so it's safe to call repeatedly. A future "replace" mode (delete prior auto-links first) would require an auto_link flag on the link rows to distinguish auto-created from manually-created links, which is a schema change out of scope for this PR.

Major fixes

3. Postgres UPDATE missing `user_id` predicate (TOCTOU window)

app/repositories/postgres/memory_repository.py:1057-1076

Ownership is checked via SELECT … WHERE user_id == :user_id and the result is held in an in-memory owned_ids set; the subsequent UPDATE filters only on id. That's a check-then-act pattern with a small window for cross-tenant writes if memories.user_id is ever mutated concurrently. The SQLite path at :1086-1107 has the same pattern.

Belt-and-braces fix: add user_id to the UPDATE/DELETE WHERE clauses themselves so ownership enforcement is atomic with the write. The in-memory ownership set can stay as an optimization, the WHERE-clause guard is the safety net.

For Postgres, this is a one-line addition. SQLite needs a subquery against memories since vec_memories has no user_id column directly.

4. Postgres `str(user_id)` deviation from existing convention

app/repositories/postgres/memory_repository.py:977, 986, 1060

Three new occurrences cast user_id to str. Existing methods in the same file pass the UUID object directly (:402, :412, :715). Functionally equivalent today via SQLAlchemy's type coercion, but the cast looks copy-pasted from the SQLite implementation (where the column is TEXT) and breaks the file's convention. Drop the str() calls — same behaviour, matches surrounding code.

Scoping semantics

5. `memory_ids` and `project_id` should follow tiered scoping

The issue spec used elif logic: explicit IDs win, else project, else global. The current implementation AND-combines the filters, so rebuild_embeddings(memory_ids=[1,2,3], project_id=5) silently drops any of those IDs not in project 5. That's a "wait, why didn't all my IDs rebuild?" debugging trap.

Suggested behaviour:

memory_ids provided alone → rebuild exactly those rows
project_id provided alone → rebuild memories in that project
Neither provided → global within user
Both provided: validate that every memory_id belongs to project_id. If any doesn't, raise ToolError("VALIDATION_ERROR: …"). Treat project_id as a guard rail, not a silent filter.
memory_ids that don't resolve at all (not found / not owned / obsolete) → append to failed with a clear reason rather than silently dropping.

6. `memory_ids=[]` should reject, not silently mean "no filter"

postgres/memory_repository.py:980, sqlite/memory_repository.py:1000

if memory_ids: is falsy for both None and []. A caller filtering down to an empty list ("rebuild nothing") gets "rebuild everything in scope" instead — combined with only_missing=False after a model swap, that's potentially thousands of unwanted embedding API calls.

Add an early validation in the adapter:

if memory_ids is not None and len(memory_ids) == 0:
    raise ToolError("VALIDATION_ERROR: memory_ids must be non-empty if provided; pass null for global scope")

Group this with the memory_ids-vs-project_id validation from item 5 at the top of the adapter method, before any DB work.

Cleanup

7. Drop `total_candidates` from the result type

app/services/re_embedding_service.py:26-37

Captured once before work begins; callers will reasonably assume it equals len(rebuilt) + len(skipped) + len(failed), but that invariant breaks under (a) the pagination bug, (b) concurrent writes between the count and the loop, and (c) batch-level persistence failures. Rather than rename or recompute, drop the field. The three outcome lists already let the caller compute any sum they want, honestly. Keep the count in the existing log line for observability, just not in the return shape.

8. Wording: "missing/stale" in user-facing docs

tool_metadata_registry.py:728-731, 757-759 and tool_adapters.py:447

If the design discussion above lands on dropping only_missing, the wording goes away with it. If we keep it in some form, replace "missing/stale" with "without an embedding" — the implementation has no concept of staleness, only row-presence.

Acknowledged but parked

We noticed that app/routes/mcp/memory_tools.py contains a parallel @mcp.tool()-decorated copy of rebuild_embeddings that isn't wired into the live tool surface (the live path is meta_tools → ToolRegistry → tool_adapters.py). This isn't unique to your PR — the entire _tools.py family is leftover scaffolding from a pre-meta-tools iteration of the architecture. We're going to clean those up in a separate sweep, so nothing for you to do here.

Suggested ordering

If you want to tackle this in pieces:

Design discussion on only_missing (let's settle this before you change implementation)
Blockers 1 + 2 (pagination + auto-links) in one commit
Majors 3 + 4 (Postgres UPDATE WHERE + str(user_id)) in one commit
Scoping items 5 + 6 in one commit
Cleanup items 7 + 8 in one commit

Tests for items 1, 5, 6 will need new coverage — particularly a multi-batch test with batch_size smaller than the candidate set, the both-passed validation rejection, and the empty-list rejection.

bertheto · 2026-05-08T21:32:31Z

Thanks for the thorough review - the framing of only_missing as an API trap is convincing, and the pagination + auto-link issues are real bugs we missed. Going to address everything in your suggested order.

Design discussion: agreed, dropping only_missing entirely.

Your invariant analysis is the right lens: MemoryService.create_memory makes embedding-less rows an out-of-band anomaly, and the realistic lifecycle event (model swap / dimension change) produces stale rows, not missing ones. Defaulting to only_missing=True would silently no-op the common case. Dropping the flag, treating "rebuild this set" as the unconditional contract, and deferring skip_if_current to a future PR (with explicit dimension detection) is cleaner.

Plan: 5 commits on this branch, same PR.

Drop only_missing, switch pagination to keyset on id ASC (after_id cursor, robust to concurrent inserts mid-loop), add auto-link recompute mirroring MemoryService.create_memory:177-191 after each successful upsert_targeted_embeddings so the new vector is already visible to find_similar_memories.
Postgres UPDATE / SQLite DELETE+INSERT WHERE clauses get user_id (atomic ownership), drop str(user_id) casts in the Postgres path to match :402, :412, :715.
Tiered scoping with hard validation: memory_ids=[] rejects, memory_ids + project_id is validated up-front, unresolved IDs land in failed with a reason.
Drop total_candidates from the result type (keep it in the log line), update "missing/stale" wording in metadata.
New tests: multi-batch with batch_size < candidate set, both-passed validation rejection, empty-list rejection, ownership-WHERE atomic guard.

Will keep the parallel @mcp.tool() copy in memory_tools.py in sync until your separate sweep removes the _tools.py family.

Pushing additively to feature/rebuild-embeddings-issue-39. Will ping when ready for re-review.

Rebuild targeted memories unconditionally, page by id cursor, and recompute additive auto-links after vectors are rewritten.

Close the targeted rebuild check-then-write window by carrying user ownership into the write predicates and align the Postgres UUID handling with existing repository conventions.

Reject ambiguous empty memory id lists, enforce project guards for explicit ids, and report unresolved ids in partial-success results.

Remove total_candidates from the MCP result contract while keeping candidate counts in logs for observability.

Add focused regression coverage for keyset pagination, cross-user vector safety, and auto-link recomputation ordering.

bertheto · 2026-05-08T21:49:14Z

Pushed the review follow-up as 5 additive commits in the order suggested:

59a8158 - drop only_missing, switch targeted rebuild pagination to after_id keyset, and recompute additive auto-links after successful embedding writes.
b423490 - add ownership predicates to targeted vector writes and align Postgres UUID handling with existing repository conventions.
2458d5c - validate explicit scopes (memory_ids=[], memory_ids + project_id) and report unresolved explicit IDs in failed instead of silently dropping them.
9489738 - remove total_candidates from the result contract while keeping the candidate count in logs.
78f1321 - add regression coverage for multi-batch keyset rebuilds, foreign-vector safety, and auto-link recomputation ordering.

Validation:

uv run pytest tests/integration/test_re_embedding_service.py tests/integration/test_targeted_rebuild_sqlite.py tests/e2e_sqlite/test_re_embedding_sqlite.py -q - 23 passed, 7 skipped (FastEmbed model unavailable/offline).
uv run pytest tests/integration -q - 439 passed, 1 failed: existing upstream test_project_service.py::test_list_projects_multiple ordering assertion (project-0 vs project-2), same pre-existing flake observed before these changes.
only_missing grep - no matches.
total_candidates remains only in the internal log line, not in the return shape.

Ready for re-review.

bertheto added 5 commits May 8, 2026 23:36

fix(rebuild_embeddings): drop only_missing and refresh links

59a8158

Rebuild targeted memories unconditionally, page by id cursor, and recompute additive auto-links after vectors are rewritten.

fix(rebuild_embeddings): enforce ownership during vector writes

b423490

Close the targeted rebuild check-then-write window by carrying user ownership into the write predicates and align the Postgres UUID handling with existing repository conventions.

fix(rebuild_embeddings): validate explicit scopes

2458d5c

Reject ambiguous empty memory id lists, enforce project guards for explicit ids, and report unresolved ids in partial-success results.

fix(rebuild_embeddings): simplify result shape

9489738

Remove total_candidates from the MCP result contract while keeping candidate counts in logs for observability.

test(rebuild_embeddings): cover review regressions

78f1321

Add focused regression coverage for keyset pagination, cross-user vector safety, and auto-link recomputation ordering.

ScottRBK merged commit 1cb9e0e into ScottRBK:main May 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): add rebuild_embeddings tool (issue #39)#42

feat(mcp): add rebuild_embeddings tool (issue #39)#42
ScottRBK merged 6 commits into
ScottRBK:mainfrom
bertheto:feature/rebuild-embeddings-issue-39

bertheto commented May 5, 2026

Uh oh!

ScottRBK commented May 7, 2026

Uh oh!

bertheto commented May 8, 2026 •

edited

Loading

Uh oh!

bertheto commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bertheto commented May 5, 2026

Summary

Why a new tool instead of reusing re_embed_all?

Design

Tool surface

Persistence layer

Service layer

Tests

Backward compatibility

Test plan

Uh oh!

ScottRBK commented May 7, 2026

Design discussion: rethinking only_missing

Blockers — must address before merge

1. Pagination silently skips memories when scope shrinks during processing

2. Auto-link recomputation is missing

Major fixes

3. Postgres UPDATE missing user_id predicate (TOCTOU window)

4. Postgres str(user_id) deviation from existing convention

Scoping semantics

5. memory_ids and project_id should follow tiered scoping

6. memory_ids=[] should reject, not silently mean "no filter"

Cleanup

7. Drop total_candidates from the result type

8. Wording: "missing/stale" in user-facing docs

Acknowledged but parked

Suggested ordering

Uh oh!

bertheto commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertheto commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why a new tool instead of reusing `re_embed_all`?

Design discussion: rethinking `only_missing`

3. Postgres UPDATE missing `user_id` predicate (TOCTOU window)

4. Postgres `str(user_id)` deviation from existing convention

5. `memory_ids` and `project_id` should follow tiered scoping

6. `memory_ids=[]` should reject, not silently mean "no filter"

7. Drop `total_candidates` from the result type

bertheto commented May 8, 2026 •

edited

Loading

bertheto commented May 8, 2026 •

edited

Loading