local-first · 97.4% R@5 on LongMemEval raw · zero API calls
Every memory framework right now is racing the same direction.
Mem0 promises perfect recall. MemPalace promises verbatim retention. Letta hands an LLM the whole context and asks it to manage itself. The benchmarks they all compete on — recall@K, hit-rate, MRR — measure one thing: how rarely does your agent lose a fact?
But agents in production don't die of losing facts. They die of keeping them. The password rotated three months ago, still suggested. The customer who exercised right-to-deletion, still in the recommender. The job title wrong since 2023 because the supersede never landed. The OTP from last Tuesday, permanently embedded next to a real preference. Memory systems fail by overgrowth, not by attrition. And no one is benchmarking that side.
The Greeks had a name for the missing operation.
Lethe (Λήθη) — one of the five rivers of Hades. Souls drank from it before reincarnation, leaving the former life behind.
The Greek word for truth — ἀλήθεια / aletheia — derives from
lethe itself.
Memory is what survives Lethe. Truth is what survives memory.
In the myth, Lethe is a river — surface, current, bed. Everything in the water has a depth. A leaf floats; a stone sinks; some things are weighed down enough to disappear.
Graph stores answer what is connected to what. Vector stores answer what is semantically similar. Neither answers the question agent memory actually faces: how deep is this fact, right now?
We built the simplest mental model that fits: every fact has one
number — depth. Every operation is a force on it.
depth state how it got there
─────────────────────────────────────────────────────────────
+∞ pinned, immune to gravity .pin()
= 1.0 just inscribed, on surface .inscribe()
∈ (0, 1) sinking under gravity .consolidate()
= 0 submerged, present but mute .surrender(mode="release")
< 0 erased from disk .surrender(mode="purge")
─────────────────────────────────────────────────────────────
No weight. No alive flag. No superseded_at column. One number,
one axis, one mental model.
Two axes. The conventional one: can a memory system find a fact when you need it? The one we propose: can it let go of a fact when you ask? Most frameworks score on the first; Lethe scores on both.
500 questions on MemPalace's own evaluation methodology, same
all-MiniLM-L6-v2 embedder, zero API calls.
| System | R@1 | R@5 | R@10 | Wall |
|---|---|---|---|---|
| MemPalace (raw) | 80.6% | 96.6% | 98.2% | 12 min |
| Lethe v1 | 85.4% | 97.4% | 99.0% | 14 min |
Lethe leads at every K; the gap is 6× wider at R@1 than at R@5
(+4.8 pp vs +0.8 pp). A single depth axis beats a palace of
wings, rooms, and drawers — most clearly where it matters: at #1.
1000 generated cases across five families — supersession, decay, amnesia, purge, drift — each probing one structural property a memory system must exhibit to be safe in production. Pass / fail is exact substring matching on top-k recall, no LLM judge, deterministic. Full methodology: docs/forgeteval.md.
| System | super | decay | amnesia | purge | drift | Overall |
|---|---|---|---|---|---|---|
| Lethe v1 | 100% | 100% | 98% | 100% | 99% | 99.3% (993 / 1000) |
| LangMem (LangGraph) | 100% | 100% | 98% | 100% | 99% | 99.5% (995 / 1000) |
| Mem0 (2.0.2) | 100% | 100% | 70% | 75% | 100% | 88.8% |
| MemPalace | 0% | 0% | 0% | 0% | 0% | 0% (no forgetting primitives) |
The template suite is near-saturated for vector-store-backed
systems with adaptive eviction: Lethe and LangGraph's InMemoryStore
both clear 99 %. Mem0's gap is on amnesia (70 %) and purge (75 %),
the two families that probe width-control and identifier precision.
MemPalace returns 0 because the API has no deletion primitives.
For more discriminative comparison we run ForgetEval-Adv, a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted, oracle-validated) covering 10 attack categories (substring traps, prefix collisions, paraphrase supersession, negation, temporal qualifiers, shared attributes, compound facts, identifier obfuscation, cross-lingual identifiers, recursive supersession). See docs/forgeteval_adversarial.md.
| System | adversarial overall | trade-off shape |
|---|---|---|
| Lethe v1 | 244 / 385 (63.4 %) | 82 % prefix_collision, 0 % cross_lingual |
| Mem0 v2.0.2 | 263 / 385 (68.3 %) | multi-signal scoring, weaker identifier precision |
| LangGraph | 242 / 385 (62.9 %) | 0 % cross_lingual, no native edit primitive |
| MemPalace | 0 / 385 ( 0.0 %) | no deletion primitives |
| Lethe + LLM | 353 / 385 (91.7 %) | recovers cross_lingual + intent-aware deletion |
| LangGraph + LLM | 359 / 385 (93.2 %) | same hook, high-recall backbone |
The three deterministic systems cluster in a 63–68 % band with mutually overlapping Wilson CIs — the bench reads the trade-off, not a winner. The discriminative signal is per-category: deterministic stores hold the lexical/temporal categories but fail canonicalization (Lethe 0 % cross_lingual, 5 % identifier_obfuscation).
The +LLM rows use the optional llm: Callable[[str], str] hook on the
adapter, wired to DeepSeek-V3 via SiliconFlow. Cost: ~$0.17 for a
full 385-case run. The recall hot path stays LLM-free; only the
three mutation operations (supersede, purge, release) consult
the model — and the +28 pt lift travels across backends (Lethe and
LangGraph alike), so it is the placement of the hook, not the
storage engine, that earns it.
For attack categories that need semantic understanding the engine
deliberately doesn't provide (compound_fact across all 3 systems,
identifier_obfuscation for Lethe / LangMem), the
LetheAdapter(llm=...) hook routes those decisions to an LLM at
mutation time; the recall hot path stays LLM-free.
Reproduce: py bench/forgeteval/run.py --adapter {lethe|mem0|langmem|mempalace} --suite {template,adversarial}
ForgetEval is downstream of the depth model — and the depth model is downstream of ForgetEval. A failing
purge_gdprcase in early runs forcedrecall(lexical=True)into the core as a first-class primitive. Both tables above reflect that loop.
- One physical axis:
depth. Every state — pinned, surfaced, sinking, submerged, erased — is a numeric region. No status flags. - Single SQLite file. Three sub-tables (
memory,memory_vec,memory_fts) keyed by sharedrowid; plus an append-onlyeventlog and asupersessionedge table. No external services. - Two retrieval primitives.
recall(query)is RRF-blended vec + BM25;recall(query, lexical=True)is pure BM25. Purge uses the second — deletingalice@acme.iois a lexical lookup by identifier, not a semantic search for "similar customers." - Verifiable forgetting. Every signed purge returns an Ed25519-signed receipt anchored to a Merkle root over the event log. Tamper with any past event afterwards → receipt fails verification. No other open-source memory framework can produce this proof because none of them keep the log to anchor to.
- Time-travel built in.
recall(query, at=T)reconstructs depth state at any past timestamp from the event log.
pip install "pylethe[embed,crypto,mcp]"The PyPI distribution name is pylethe (the lethe slot was already
taken on PyPI by an unrelated package); the import remains
from lethe import Lethe.
Library:
from lethe import Lethe
agent = Lethe("./agent.db")
mid = agent.inscribe("Alice works at Anthropic.")
agent.surrender(mid, mode="release") # depth → 0
agent.surrender({"old": mid, "new": "Alice now at OpenAI."},
mode="supersede") # old sinks, new surfaces
agent.surrender(mid, mode="purge") # erased from disk
agent.pin(mid) # depth → +∞CLI — one subcommand per primitive:
lethe inscribe "Alice works at Anthropic."
lethe recall "Where does Alice work?"
lethe supersede 1 --new "Alice now at OpenAI."
lethe blame "Alice's job"
lethe consolidate
lethe ingest ~/notes # batch: *.md *.txt *.rst
# Verifiable purge
lethe keygen
lethe --db agent.db purge --signed 42 # emits receipt JSON
lethe verify-receipt receipt.json --db agent.db --db-checkDB defaults to ~/.lethe/agent.db. Pass --json on any subcommand for
machine-readable output.
MCP — eleven tools exposed over stdio (every core operation plus signed-purge receipts). Add to Claude Desktop / Claude Code / Cursor:
{
"mcpServers": {
"lethe": {
"command": "python",
"args": ["-m", "lethe.mcp_server"],
"env": {"LETHE_DB": "/absolute/path/to/agent.db"}
}
}
}Runnable cookbook in recipes/ for the common patterns:
OTP TTL, GDPR purge with cryptographic receipt, belief
revision via supersession, pinning user preferences, and
time-travel debugging. Each recipe is a self-contained ~40-line
script that runs without fastembed — python recipes/02_gdpr_purge_receipt.py.
v1.0.0-alpha. Core implemented and tested:
$ pytest tests
14 passed in 0.65s
Roadmap (next):
- Human-curated adversarial ForgetEval. Substring traps, prefix collisions, paraphrase chains. Template-generated 1000-case is the floor, not the ceiling.
- Receipt-verification benchmark family. Does the system produce auditable proof of deletion? A new ForgetEval axis no other framework even attempts.
- Adaptive consolidation policies.
consolidate()uses one fixed decay law; we want per-domain policies — financial records decay slower than chat memory. - Production-density distractor corpora. Synthetic office-trivia fillers replaced with real long-form text (Wikipedia, code, emails) for a tougher recall environment.
- Pluggable retrieval backends. A
Backendprotocol so the default SQLite + vec0 + FTS5 stack can be swapped for Postgres- pgvector, Pinecone, Weaviate, or a custom store. The depth axis and surrender / recall semantics stay identical — only the storage layer changes.
- Optional LLM hooks at inscribe and consolidation time. Entity extraction at inscribe, semantic deduplication, and LLM-guided consolidation policies (which facts to promote, which to release). The recall path stays LLM-free — determinism and latency are non-negotiable on the hot path.
📄 ForgetEval: Benchmarking the Forgetting Axis of Agent Memory Systems — Dongxu Yang, DeepLethe, 2026.
Full methodology, formal model of the depth axis (Propositions 1–4), 1000-case ForgetEval, 5-seed variance, distractor sweep, component ablations, and LongMemEval-S comparison. 23 pages, MIT.
@misc{yang2026forgeteval,
author = {Yang, Dongxu},
title = {ForgetEval: Benchmarking the Forgetting Axis of
Agent Memory Systems},
year = {2026},
howpublished = {\url{https://github.com/deeplethe/lethe/blob/main/paper/paper.pdf}},
}MIT — see LICENSE.
