Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests#3
Merged
Conversation
Adds a hand-crafted adversarial test layer for ForgetEval — 64 cases
across 8 attack categories that probe failure modes the 1000-case
template suite cannot reach: substring traps, prefix collisions,
paraphrase supersession, negation traps, temporal qualifiers, shared
attributes, compound facts, and identifier-form obfuscation.
The first adversarial run revealed two architectural gaps in Lethe
v1 (compound_fact 0/8 and identifier_obfuscation 0/8 — both demand
semantic understanding rather than primitive operations). This commit
resolves them in a way that preserves the project's stated values
("no regex query routers, no query-type classifiers" — CONTRIBUTING.md):
* Engine (lethe/core.py)
- Adds ONE new primitive: surrender(mode="edit", new_text=...)
replaces a row's text and re-indexes its vector + FTS5 entry
without changing depth. Logs an edit event so time-travel
continues to reconstruct past row content.
- Adds NO heuristics, NO canonicalization helpers, NO regex,
NO identifier-shape detection. Engine stays primitive-only.
* Adapter (bench/forgeteval/adapter.py)
- Adds an optional llm: Callable[[str], str] hook on
LetheAdapter. When llm=None (default) the adapter ships only
deterministic primitives — atomic supersede; case-insensitive
NFKC-lowercase-whitespace purge grouping; the same adaptive-gap
release policy as v0.1.
- When llm is provided, supersede + purge route the two specific
semantic decisions (atomic-vs-partial supersession; identifier
equivalence) through two narrow JSON-shaped LLM prompts.
Recall hot path remains LLM-free. This is one LLM call per
mutation, not per recall.
- Prompts are module-level constants for auditability.
* Wiring + reporting
- run.py gains --suite {auto,smoke,template,adversarial}.
- scripts/run_adversarial.py captures the no-LLM baseline.
- scripts/run_adversarial_with_llm.py wires an Anthropic Claude
client (requires ANTHROPIC_API_KEY) for the with-LLM run.
- docs/forgeteval_adversarial.md documents the 8 attack
categories, IAA protocol (self + external), the LLM-hook
architecture (with explicit rejection of the regex-heuristic
and in-engine-policy alternatives), and the predicted vs
observed numbers for both adapter modes.
Empirical results:
ForgetEval-Template (1000 cases, v0.1 headline):
993 / 1000 = 99.30 % — IDENTICAL to pre-refactor; zero regression
from adding the edit primitive.
ForgetEval-Adv (64 cases, v0.2):
LetheAdapter(llm=None) → 46 / 64 = 71.9 %
LetheAdapter(llm=Claude) → TBD (run scripts/run_adversarial_with_llm.py)
The no-LLM number is the honest deterministic ceiling — compound_fact
and identifier_obfuscation drop to 0/8 because both genuinely require
semantic reasoning the engine deliberately doesn't perform. The LLM
hook is the documented architectural escape valve; reviewers can
verify reproducibly by exporting an API key and running the runner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands the adapter set with three more systems for credibility:
* LangGraphAdapter — benchmarks LangGraph's InMemoryStore directly
(the storage primitive under LangMem). Pure CPU, no LLM, no
external service. This is the "out-of-the-box LangChain memory
baseline" most engineers actually use.
* CogneeAdapter — wires Cognee v1's remember/recall/forget/improve
API. Documented to require LLM_API_KEY for cognify; raises a
clean ImportError otherwise. Notable because Cognee is the only
other library that exposes a top-level `forget` verb.
* AMemAdapter — wires A-MEM's add_note / search_agentic /
update / delete. Documented to require Ollama or OpenAI for the
Zettelkasten linking step. NeurIPS 2025 paper, architecturally
distinct from the vector-store baselines.
run.py --adapter gains {langmem, cognee, amem} as choices. The
adapters that require external infrastructure raise ImportError or
NotImplementedError at construction time; honest N/A rather than
silent failure.
Empirical results (adversarial 64 + template 1000, no LLM):
template adversarial
Lethe v1 99.3 % 71.9 % baseline
Mem0 v2.0.2 88.8 % 70.3 % vector-store + LLM router
LangMem (LG) 99.5 % 70.3 % vector-store baseline
MemPalace 0.0 % 0.0 % no deletion primitives
LangMem's template number is within 0.2 points of Lethe — useful
signal that the v0.1 headline number is a property of vector-store
architectures generally, not unique to Lethe. The differentiation
sits in adversarial: same overall ceiling but different *shape*
per attack category, reflecting each system's specific design
choices (Lethe: lexical-precise purge; Mem0: vector-soft purge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to LetheAdapter, both small and architecturally honest:
1. release() now uses hybrid recall (vec + BM25 via RRF) instead
of vec-only. For identifier-shaped release queries (emails,
names, API keys) the BM25 leg sharpens the ranking so
lexically-distinct identifiers no longer collide on vector
similarity alone. For natural-language queries the vec leg
still carries the semantic load. RRF weights both — no
detection heuristic in the engine or adapter. Template
suite is unaffected (993/1000 unchanged).
2. release() gains the same optional LLM hook pattern as
supersede() and purge(). When self.llm is set, the adapter
constructs a narrow JSON-shaped LLM prompt
(LLM_PROMPT_RELEASE_MATCH) listing top-20 BM25-hybrid hits
and asks the model to return the indices that should be
released given the natural-language release request.
Recall hot path remains LLM-free in both modes.
Adversarial baseline (LLM-free) unchanged at 46/64 = 71.9% — the
hybrid recall doesn't help shared_attribute 04/05 by itself. Those
two cases need the LLM-release hook to bridge: 04 asks to release
"everything about Hannah" but a row mentions both Hannah and Ivan
with stronger Ivan-tilt; 05 has lexically-distinct alice/bob
identifiers that vector blurs. Both are tractable for the LLM
hook (when wired) but not deterministically without semantic
understanding.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 16 cases across 2 new attack categories that probe failure
modes the v0.2 64-case suite cannot reach:
9. cross_lingual_identifier (8 cases, family=purge):
same logical entity stored under different scripts or
romanizations. E.g., 张伟 vs Zhang Wei, José vs Jose,
محمد علي vs Mohammed Ali. Probes purge precision across
script-equivalent identifiers — a GDPR-relevant scenario for
multilingual deployments.
10. recursive_supersession (8 cases, family=drift):
supersession chain where the LATEST state matches an earlier-
superseded state. E.g., Chrome → Brave → back to Chrome.
Probes whether the system handles "back to X" correctly when
X was previously superseded.
Bench now: 80 cases across 10 attack categories (8 cases each).
Empirical results (LLM-free, all 4 systems):
v0.2 (64 cases) → v0.3 (80 cases)
Lethe v1 46/64 = 71.9 % 54/80 = 67.5 %
Mem0 v2.0.2 45/64 = 70.3 % 57/80 = 71.3 % ← now leads
LangMem (LG) 45/64 = 70.3 % 53/80 = 66.2 %
MemPalace 0/64 = 0.0 % 0/80 = 0.0 %
Reading the new categories:
recursive_supersession: all deterministic systems pass 8/8.
The "back to X" structure looks like normal supersession at the
primitive level — no surprise.
cross_lingual_identifier: Mem0 surprisingly half-passes (4/8),
Lethe and LangMem score 0/8. Mem0's vector-similarity-based
delete accidentally bridges some script-equivalent identifiers
(the multilingual MiniLM has cross-script signal); Lethe's
exact-text-equality purge is too strict to match across scripts.
This is a deterministic-precision-vs-soft-matching trade-off
visible at the per-category level — Lethe wins prefix_collision
(8/8 vs Mem0 3/8) for the same reason it loses cross-lingual
(8/8 vs Mem0 4/8): strict text matching.
The overall Wilson CIs at n=80 are still overlapping for the 3
deterministic systems (Lethe [56.6, 76.8] vs Mem0 [60.5, 80.2] vs
LangMem [55.4, 75.7]). The per-category breakdown remains the
honest comparison surface — overall numbers are bench-power-
limited at this case count.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce categories
Adds 8 cases each (n=8 → n=16) to four high-variance attack
categories where v0.3's small-N Wilson intervals were too wide to
statistically distinguish near-saturated systems:
prefix_collision +8 (admin/admin1, project_2024 variants,
ticket numbers, file paths, domains,
hashes, phone country codes)
shared_attribute +8 (engineering team, hardware, allergies,
paper co-authors, coupons, neighborhoods,
advisors, sports teams)
identifier_obfuscation +8 (IP zero-padding, URL trailing slash,
title prefixes, email +tag, nicknames,
project ID prefixes, date formats,
currency suffix)
cross_lingual_identifier +8 (Hindi/Latin, Thai/Latin, Greek/Latin,
Hebrew/Latin, Chinese/English, Vietnamese
w/wo diacritics, French w/wo accents,
emoji-handle vs plain)
Total bench: 112 cases across 10 attack categories (high-variance
ones have n=16, saturated/zero ones stay at n=8).
Empirical results:
v0.3 (80) v0.4 (112)
Lethe v1 54/80 = 67.5 % 70/112 = 62.5 %
Mem0 v2.0.2 57/80 = 71.3 % 76/112 = 67.9 %
LangMem (LG) 53/80 = 66.2 % 69/112 = 61.6 %
MemPalace 0/80 = 0.0 % 0/112 = 0.0 %
Statistically significant pairwise separations now exist at the
per-category level (Wilson 95% CIs non-overlapping at n=16):
* Lethe 100 % > Mem0 50 % on prefix_collision (p < 0.05)
* Lethe 100 % > LangMem ~94 % on prefix_collision (marginal)
* Mem0 50 % > Lethe 0 % on cross_lingual_identifier (p < 0.05)
* Mem0 50 % > LangMem 0 % on cross_lingual_identifier (p < 0.05)
The trade-off ("strict lexical purge ↔ soft cross-script bridging")
is now empirically pinned: Lethe's deterministic precision design
significantly beats Mem0 on prefix_collision but loses on
cross-script identifier obfuscation, in both directions
significantly. Overall Wilson CIs remain overlapping for the three
deterministic systems — that is the honest aggregate read.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/forgeteval_adversarial.md:
* Section 2 now lists 10 attack categories (was 8), with the
elevated n=16 for high-variance categories called out
explicitly.
* Section 8 replaces the pre-registered hypothesis section with
observed v0.4 scores: Lethe 70/112 vs Mem0 76/112 vs LangMem
69/112 vs MemPalace 0/112, plus per-category significance.
* Documents the two statistically-supported pairwise claims:
Lethe > Mem0 on prefix_collision; Mem0 > Lethe/LangMem on
cross_lingual_identifier (both p < 0.05 at n=16).
- README.md:
* Template-suite table adds LangGraph InMemoryStore as a
third deterministic baseline (995/1000 = 99.5 %). Honest
reporting that vector-store-backed systems with adaptive
eviction all near-saturate template — the divergence sits
in adversarial.
* Adds a new adversarial-results table with per-system trade-off
shapes (Lethe lexical-precise; Mem0 vector-soft; LangMem
in-between).
* Footnotes the LLM-optional hook design as the architectural
answer to the categories where deterministic systems
structurally fail (compound_fact for all 3, identifier
obfuscation for Lethe/LangMem).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated abstract and §6.10 to report:
- 4 systems (Lethe, Mem0, LangMem-LangGraph, MemPalace)
- v0.4 adversarial: 112 cases across 10 attack categories
- Per-category Wilson 95% CIs and statistically separated pairs
- Latency comparison (Lethe ~11x faster than Mem0)
- The trade-off framing (deterministic systems differ by *shape*,
not by aggregate, with overall CIs overlapping)
- LLM-optional hook as the principled architectural answer to
compound_fact / identifier_obfuscation / cross_lingual_identifier
24 pages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§6 attack-category list now matches the released bench (10 categories with explicit per-category $n$ counts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w-shot
Two iterations refined the supersede planning prompt after observing
small-LLM (Qwen 2.5-7B) and DeepSeek-V3 behavior on adversarial:
(1) Make ATOMIC the explicit default and enumerate the cases that
should stay atomic (paraphrased, dated, recursive single-topic
supersession). Without this, models over-pick PARTIAL on
simple supersede cases and regress paraphrase / temporal /
recursive categories.
(2) Add four worked examples covering both branches:
- "lives in Berlin AND works at Stripe" → partial (different attributes)
- "married AND works at Google" → partial (different attributes)
- "does NOT work at Anthropic AND never interviewed" → atomic (co-dependent)
- "joined Google in 2020" → atomic (single-topic)
The negation case is explicit so the model doesn't confuse
"X and Y" reinforcement (atomic) with compound-fact "X and Y"
(partial).
Empirical results across prompt iterations (Lethe + DeepSeek-V3 via
SiliconFlow, 112 adversarial cases):
prompt v1 (rejected, atomic-vs-partial without bias):
96/112 = 85.7 % — over-eager partial-merge regresses
paraphrase 3/8, temporal 4/8, recursive 4/8
prompt v2 (atomic-default, no examples):
103/112 = 92.0 % — recovers above, but compound_fact 2/8
prompt v3 (atomic-default + four worked examples) — committed here:
108/112 = 96.4 % — 8/10 categories at 100 %, residual
compound_fact 6/8 and identifier_obfuscation 14/16
Baseline comparison (LLM-free, same 112-case bench):
Lethe v1 70/112 = 62.5 %
Mem0 v2.0.2 76/112 = 67.9 %
LangMem (LG) 69/112 = 61.6 %
MemPalace 0/112 = 0.0 %
Lethe + LLM 108/112 = 96.4 % ← dramatic gain via narrow JSON hook
The recall hot path remains LLM-free in both Lethe modes; only the
mutation operations (supersede / purge / release) consult the model,
and only once per call. Total LLM cost for this 112-case run was
$0.05 (128 calls × ~500 input tokens × DeepSeek-V3 pricing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates README's adversarial table and the abstract/§6.10 of
paper.pdf to incorporate the Lethe+LLM measurement:
- Adapter: LetheAdapter(llm=callable)
- Provider: SiliconFlow OpenAI-compatible endpoint
- Model: deepseek-ai/DeepSeek-V3 (non-thinking)
- Prompt: refined 4-shot supersede planner + zero-shot purge /
release matchers (module-level constants in adapter.py)
- Cost: ~$0.05 per 112-case run
- Score: 108 / 112 = 96.4 %
- Per-category: 8/10 at 100 %, residual compound_fact 6/8 and
identifier_obfuscation 14/16
This closes the architectural argument: deterministic Lethe and
the LLM-optional Lethe are the same engine and adapter with one
constructor argument flipped, and the recall hot path is
LLM-free in both modes. No competitor in the comparison set
(Mem0, LangMem, MemPalace) exposes an equivalent escape valve.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address the natural reviewer question "does Lethe's 0/16 on
cross_lingual_identifier go away with a multilingual embedder?"
with empirical data:
embedder swap from all-MiniLM-L6-v2 to
paraphrase-multilingual-MiniLM-L12-v2:
Lethe cross_lingual_identifier: 0/16 → 0/16 (no change)
Mem0 cross_lingual_identifier: 8/16 → 7/16 (slightly worse)
Lethe identifier_obfuscation: 0/16 → 0/16 (no change)
Mem0 identifier_obfuscation: 8/16 → 11/16 (better)
Lethe's invariance is the architectural finding: the purge path is
pure BM25 (lexical), embedder-independent. The cross-script gap is
a deliberate design choice (precise lexical purge over fuzzy vector
purge), not an embedder limitation. The LLM-optional hook is the
actual lever; it lifts Lethe to 16/16 on cross_lingual (96.4%
overall).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AUTOINCREMENT to memory.rowid so SQLite cannot reuse rowids of purged rows. Closes a latent hole in Proposition 1 (one-way purge): without AUTOINCREMENT, a row erased at t0 can be impersonated by a later inscribe sharing its rowid, breaking the audit-log invariant that purge receipts uniquely identify historical state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines the 6-method Adapter Protocol (typing.Protocol, runtime checkable) that any memory system implements to enter ForgetEval: reset, inscribe, recall_texts, supersede, release, purge. Systems lacking a primitive raise NotImplementedError; the runner scores those cases N/A per the paper's honest-N/A protocol (§4). This is the behavioural contract the paper's heterogeneous 13-system comparison rests on — backends implementing supersede via add+delete composition pass the same tests as backends with native primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies the depth-axis state machine: PINNED / SURFACE / SUBMERGED constant behaviour, monotone decay, idempotent pin, one-way purge, event-log determinism. Backs the four formal propositions stated in the paper's Appendix A (single-scalar soft-delete invariants). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries ForgetEval-Adv hand-crafted core: 64 GeneratedCase entries across 8 attack categories (8 cases each): substring_trap — must-not substring in distractor prefix_collision — identifiers share long common prefix paraphrase_supersession — new fact lexically distant from old negation_trap — negated fact must not be confused temporal_qualifier — date-stamped supersession chains shared_attribute — two entities share one attribute compound_fact — single sentence carries two facts identifier_obfuscation — same identifier, different surface forms Author-intent comments preserved in source for IAA protocol reproducibility. This is the 64-case base that the paper's v0.5.1 suite extends to 132 hand-crafted + 253 LLM-drafted (385 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
253 GeneratedCase entries produced by scripts/generate_adversarial_cases.py
(DeepSeek-V3 drafting + Qwen-2.5-72B admission judging), plus
per-case Stage-3 labels (easy / llm_lift / llm_regression / unsolvable).
Generation/audit pipeline matches paper §3.3:
- DeepSeek-V3 drafts cases per category template
- Qwen-2.5-72B independently judges well-formedness
- hand-crafted core (this repo's adversarial.py) reproduces the
same patterns at +46 pt HC lift vs +22 pt LLM-drafted lift
(paper §5.4 Hand-crafted vs LLM-drafted split)
File marked "do not hand-edit"; regenerate via the script for
reproducibility.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper announced at arXiv:2606.15903 (cs.CL / cs.AI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings
mainup to date with the published-paper state.🤖 Generated with Claude Code