feat(mcp): add rebuild_embeddings tool (issue #39)#42
Conversation
Adds a user-scoped, non-destructive MCP tool to rebuild embeddings on a memory subset, complementing the existing offline `--re-embed` CLI which performs a global storage reset. The new `rebuild_embeddings` tool: - is registered in both `memory_tools.py` (FastMCP) and the simplified tool registry (`tool_metadata_registry.py` + `tool_adapters.py`); - never calls `reset_embedding_storage()`; - supports `memory_ids`, `project_id`, and `only_missing` filters; - returns a structured result (`total_candidates`, `rebuilt_ids`, `skipped_ids`, `failed`) so callers can act on partial success. Persistence layer changes: - new `MemoryRepository` methods `count_memories_for_targeted_rebuild`, `get_memories_for_targeted_rebuild`, `upsert_targeted_embeddings`; - SQLite implementation: idempotent DELETE-then-INSERT into `vec_memories`, server-side ownership check; - Postgres implementation: idempotent UPDATE on `memories.embedding`, same ownership check. Service layer changes: - `ReEmbeddingService.rebuild_targeted` reuses the same embedding pipeline (`build_embedding_text` + `EmbeddingsAdapter`) so vectors stay dimensionally consistent with the rest of the corpus. Tests: - 4 new integration tests on a real in-memory SQLite stack with a deterministic offline embedding adapter, covering: gap repair with `only_missing=True`, user-scoping (cross-user attempts blocked at the repository), no-op when `reset_embedding_storage` would be called, and `upsert_targeted_embeddings` rejecting unowned ids. - 4 additional integration tests on the service layer (mocked repo) covering empty candidate set, partial-ownership skips, and per-memory embedding failures recorded in `result.failed`. - 3 new e2e SQLite tests gated by FastEmbed availability (skipped when Hugging Face is unreachable). - `tests/e2e_sqlite/test_re_embedding_sqlite.py` fixture now skips cleanly instead of crashing when the FastEmbed model cannot be downloaded. Closes ScottRBK#39
|
Thanks for tackling this — the structural choices are solid: reusing Before getting into individual line-level changes, I want to raise a design question that might reshape some of the surface area, then list the concrete blockers and follow-ups. Design discussion: rethinking
|
|
Thanks for the thorough review - the framing of Design discussion: agreed, dropping Your invariant analysis is the right lens: Plan: 5 commits on this branch, same PR.
Will keep the parallel Pushing additively to |
Rebuild targeted memories unconditionally, page by id cursor, and recompute additive auto-links after vectors are rewritten.
Close the targeted rebuild check-then-write window by carrying user ownership into the write predicates and align the Postgres UUID handling with existing repository conventions.
Reject ambiguous empty memory id lists, enforce project guards for explicit ids, and report unresolved ids in partial-success results.
Remove total_candidates from the MCP result contract while keeping candidate counts in logs for observability.
Add focused regression coverage for keyset pagination, cross-user vector safety, and auto-link recomputation ordering.
|
Pushed the review follow-up as 5 additive commits in the order suggested:
Validation:
Ready for re-review. |
Summary
Adds a user-scoped, non-destructive
rebuild_embeddingsMCP tool, complementing the existing offline--re-embedCLI which performs a global storage reset (reset_embedding_storage()+ full re-embed). Closes #39.The new tool lets a memory client (LLM agent, dashboard, ops script) recover a small set of memories whose embedding is missing or stale, or force-refresh the embeddings of a single project, without touching other users' data and without resetting global vector storage.
Why a new tool instead of reusing
re_embed_all?re_embed_allis intentionally destructive: it resets the entirevec_memoriestable (SQLite) or column (Postgres) before re-embedding. This is correct for an offline migration but unsafe as a multi-tenant runtime tool.id,links,created_at, and provenance.--re-embedCLI/code path untouched.Design
Tool surface
Registered in both registries used by the project:
app/routes/mcp/memory_tools.py(FastMCP-decorated@mcp.tool());app/routes/mcp/tool_adapters.py+app/routes/mcp/tool_metadata_registry.py(simplified registry).Parameters:
memory_idslist[int] | NoneNoneproject_idint | NoneNoneonly_missingboolTrueReturns a dict with:
total_candidates,rebuilt_ids,skipped_ids,failed(list of{memory_id, reason}) so the caller can act on partial success.Persistence layer
Three new methods on
MemoryRepository:count_memories_for_targeted_rebuild(user_id, memory_ids?, project_id?, only_missing)get_memories_for_targeted_rebuild(user_id, limit, offset, ...)upsert_targeted_embeddings(user_id, updates)All filters are AND-combined;
user_idis mandatory and never optional.upsert_targeted_embeddingsperforms server-side ownership validation againstmemories.user_idbefore writing — a malicious caller cannot poison another user's vector by passing arbitrary ids.vec_memoriesvirtual table is keyed bymemory_idand rejects duplicates): idempotentDELETE+INSERTper id.memories.embeddinginline column): idempotentUPDATE.Service layer
ReEmbeddingService.rebuild_targetedreuses the existing pipeline (build_embedding_text+EmbeddingsAdapter.generate_embedding) so embeddings stay dimensionally consistent with the rest of the corpus.reset_embedding_storage()is never called in this path.Tests
tests/integration/test_targeted_rebuild_sqlite.pysqlite-vec, deterministic offline embedding adapter; covers gap-repair, user-scoping, no-reset guarantee, repository-level ownership rejectiontests/integration/test_re_embedding_service.pyresult.failed,reset_embedding_storagenever calledtests/e2e_sqlite/test_re_embedding_sqlite.pyLocal run (offline):
418 passed, 1 unrelated pre-existing flake (test_list_projects_multiple). The newtest_targeted_rebuild_sqlite.pymodule is fully offline-safe and ran green in CI on this branch.Backward compatibility
re_embed_alland the CLI--re-embedflag are unchanged.MemoryRepositoryare additive; existing implementations were extended, not modified.Test plan
pytest tests/integration/test_targeted_rebuild_sqlite.py-> 4 passedpytest tests/integration/test_re_embedding_service.py-> 12 passed (8 existing + 4 new)pytest tests/e2e_sqlite/test_re_embedding_sqlite.py-> 8 skipped offline (HF unreachable), runs cleanly when FastEmbed model is cachedruff checkon all changed files -> All checks passedforgetful_ai==0.4.0HTTP runtime) with a real Cursor MCP client;rebuild_embeddingswithonly_missing=Truerepaired exactly the gap memories, no other vectors touched.Closes #39