Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests by WaylandYang · Pull Request #3 · deeplethe/lethe

WaylandYang · 2026-06-16T03:40:39Z

Brings main up to date with the published-paper state.

arXiv badge + CITATION.cff (arXiv:2606.15903)
AUTOINCREMENT schema (one-way-purge invariant)
6-method Adapter Protocol
Hand-crafted (64) + LLM-generated (253) adversarial layers
Depth-physics smoke tests
gitignore .coverage

🤖 Generated with Claude Code

Adds a hand-crafted adversarial test layer for ForgetEval — 64 cases across 8 attack categories that probe failure modes the 1000-case template suite cannot reach: substring traps, prefix collisions, paraphrase supersession, negation traps, temporal qualifiers, shared attributes, compound facts, and identifier-form obfuscation. The first adversarial run revealed two architectural gaps in Lethe v1 (compound_fact 0/8 and identifier_obfuscation 0/8 — both demand semantic understanding rather than primitive operations). This commit resolves them in a way that preserves the project's stated values ("no regex query routers, no query-type classifiers" — CONTRIBUTING.md): * Engine (lethe/core.py) - Adds ONE new primitive: surrender(mode="edit", new_text=...) replaces a row's text and re-indexes its vector + FTS5 entry without changing depth. Logs an edit event so time-travel continues to reconstruct past row content. - Adds NO heuristics, NO canonicalization helpers, NO regex, NO identifier-shape detection. Engine stays primitive-only. * Adapter (bench/forgeteval/adapter.py) - Adds an optional llm: Callable[[str], str] hook on LetheAdapter. When llm=None (default) the adapter ships only deterministic primitives — atomic supersede; case-insensitive NFKC-lowercase-whitespace purge grouping; the same adaptive-gap release policy as v0.1. - When llm is provided, supersede + purge route the two specific semantic decisions (atomic-vs-partial supersession; identifier equivalence) through two narrow JSON-shaped LLM prompts. Recall hot path remains LLM-free. This is one LLM call per mutation, not per recall. - Prompts are module-level constants for auditability. * Wiring + reporting - run.py gains --suite {auto,smoke,template,adversarial}. - scripts/run_adversarial.py captures the no-LLM baseline. - scripts/run_adversarial_with_llm.py wires an Anthropic Claude client (requires ANTHROPIC_API_KEY) for the with-LLM run. - docs/forgeteval_adversarial.md documents the 8 attack categories, IAA protocol (self + external), the LLM-hook architecture (with explicit rejection of the regex-heuristic and in-engine-policy alternatives), and the predicted vs observed numbers for both adapter modes. Empirical results: ForgetEval-Template (1000 cases, v0.1 headline): 993 / 1000 = 99.30 % — IDENTICAL to pre-refactor; zero regression from adding the edit primitive. ForgetEval-Adv (64 cases, v0.2): LetheAdapter(llm=None) → 46 / 64 = 71.9 % LetheAdapter(llm=Claude) → TBD (run scripts/run_adversarial_with_llm.py) The no-LLM number is the honest deterministic ceiling — compound_fact and identifier_obfuscation drop to 0/8 because both genuinely require semantic reasoning the engine deliberately doesn't perform. The LLM hook is the documented architectural escape valve; reviewers can verify reproducibly by exporting an API key and running the runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Expands the adapter set with three more systems for credibility: * LangGraphAdapter — benchmarks LangGraph's InMemoryStore directly (the storage primitive under LangMem). Pure CPU, no LLM, no external service. This is the "out-of-the-box LangChain memory baseline" most engineers actually use. * CogneeAdapter — wires Cognee v1's remember/recall/forget/improve API. Documented to require LLM_API_KEY for cognify; raises a clean ImportError otherwise. Notable because Cognee is the only other library that exposes a top-level `forget` verb. * AMemAdapter — wires A-MEM's add_note / search_agentic / update / delete. Documented to require Ollama or OpenAI for the Zettelkasten linking step. NeurIPS 2025 paper, architecturally distinct from the vector-store baselines. run.py --adapter gains {langmem, cognee, amem} as choices. The adapters that require external infrastructure raise ImportError or NotImplementedError at construction time; honest N/A rather than silent failure. Empirical results (adversarial 64 + template 1000, no LLM): template adversarial Lethe v1 99.3 % 71.9 % baseline Mem0 v2.0.2 88.8 % 70.3 % vector-store + LLM router LangMem (LG) 99.5 % 70.3 % vector-store baseline MemPalace 0.0 % 0.0 % no deletion primitives LangMem's template number is within 0.2 points of Lethe — useful signal that the v0.1 headline number is a property of vector-store architectures generally, not unique to Lethe. The differentiation sits in adversarial: same overall ceiling but different *shape* per attack category, reflecting each system's specific design choices (Lethe: lexical-precise purge; Mem0: vector-soft purge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes to LetheAdapter, both small and architecturally honest: 1. release() now uses hybrid recall (vec + BM25 via RRF) instead of vec-only. For identifier-shaped release queries (emails, names, API keys) the BM25 leg sharpens the ranking so lexically-distinct identifiers no longer collide on vector similarity alone. For natural-language queries the vec leg still carries the semantic load. RRF weights both — no detection heuristic in the engine or adapter. Template suite is unaffected (993/1000 unchanged). 2. release() gains the same optional LLM hook pattern as supersede() and purge(). When self.llm is set, the adapter constructs a narrow JSON-shaped LLM prompt (LLM_PROMPT_RELEASE_MATCH) listing top-20 BM25-hybrid hits and asks the model to return the indices that should be released given the natural-language release request. Recall hot path remains LLM-free in both modes. Adversarial baseline (LLM-free) unchanged at 46/64 = 71.9% — the hybrid recall doesn't help shared_attribute 04/05 by itself. Those two cases need the LLM-release hook to bridge: 04 asks to release "everything about Hannah" but a row mentions both Hannah and Ivan with stronger Ivan-tilt; 05 has lexically-distinct alice/bob identifiers that vector blurs. Both are tractable for the LLM hook (when wired) but not deterministically without semantic understanding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds 16 cases across 2 new attack categories that probe failure modes the v0.2 64-case suite cannot reach: 9. cross_lingual_identifier (8 cases, family=purge): same logical entity stored under different scripts or romanizations. E.g., 张伟 vs Zhang Wei, José vs Jose, محمد علي vs Mohammed Ali. Probes purge precision across script-equivalent identifiers — a GDPR-relevant scenario for multilingual deployments. 10. recursive_supersession (8 cases, family=drift): supersession chain where the LATEST state matches an earlier- superseded state. E.g., Chrome → Brave → back to Chrome. Probes whether the system handles "back to X" correctly when X was previously superseded. Bench now: 80 cases across 10 attack categories (8 cases each). Empirical results (LLM-free, all 4 systems): v0.2 (64 cases) → v0.3 (80 cases) Lethe v1 46/64 = 71.9 % 54/80 = 67.5 % Mem0 v2.0.2 45/64 = 70.3 % 57/80 = 71.3 % ← now leads LangMem (LG) 45/64 = 70.3 % 53/80 = 66.2 % MemPalace 0/64 = 0.0 % 0/80 = 0.0 % Reading the new categories: recursive_supersession: all deterministic systems pass 8/8. The "back to X" structure looks like normal supersession at the primitive level — no surprise. cross_lingual_identifier: Mem0 surprisingly half-passes (4/8), Lethe and LangMem score 0/8. Mem0's vector-similarity-based delete accidentally bridges some script-equivalent identifiers (the multilingual MiniLM has cross-script signal); Lethe's exact-text-equality purge is too strict to match across scripts. This is a deterministic-precision-vs-soft-matching trade-off visible at the per-category level — Lethe wins prefix_collision (8/8 vs Mem0 3/8) for the same reason it loses cross-lingual (8/8 vs Mem0 4/8): strict text matching. The overall Wilson CIs at n=80 are still overlapping for the 3 deterministic systems (Lethe [56.6, 76.8] vs Mem0 [60.5, 80.2] vs LangMem [55.4, 75.7]). The per-category breakdown remains the honest comparison surface — overall numbers are bench-power- limited at this case count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ce categories Adds 8 cases each (n=8 → n=16) to four high-variance attack categories where v0.3's small-N Wilson intervals were too wide to statistically distinguish near-saturated systems: prefix_collision +8 (admin/admin1, project_2024 variants, ticket numbers, file paths, domains, hashes, phone country codes) shared_attribute +8 (engineering team, hardware, allergies, paper co-authors, coupons, neighborhoods, advisors, sports teams) identifier_obfuscation +8 (IP zero-padding, URL trailing slash, title prefixes, email +tag, nicknames, project ID prefixes, date formats, currency suffix) cross_lingual_identifier +8 (Hindi/Latin, Thai/Latin, Greek/Latin, Hebrew/Latin, Chinese/English, Vietnamese w/wo diacritics, French w/wo accents, emoji-handle vs plain) Total bench: 112 cases across 10 attack categories (high-variance ones have n=16, saturated/zero ones stay at n=8). Empirical results: v0.3 (80) v0.4 (112) Lethe v1 54/80 = 67.5 % 70/112 = 62.5 % Mem0 v2.0.2 57/80 = 71.3 % 76/112 = 67.9 % LangMem (LG) 53/80 = 66.2 % 69/112 = 61.6 % MemPalace 0/80 = 0.0 % 0/112 = 0.0 % Statistically significant pairwise separations now exist at the per-category level (Wilson 95% CIs non-overlapping at n=16): * Lethe 100 % > Mem0 50 % on prefix_collision (p < 0.05) * Lethe 100 % > LangMem ~94 % on prefix_collision (marginal) * Mem0 50 % > Lethe 0 % on cross_lingual_identifier (p < 0.05) * Mem0 50 % > LangMem 0 % on cross_lingual_identifier (p < 0.05) The trade-off ("strict lexical purge ↔ soft cross-script bridging") is now empirically pinned: Lethe's deterministic precision design significantly beats Mem0 on prefix_collision but loses on cross-script identifier obfuscation, in both directions significantly. Overall Wilson CIs remain overlapping for the three deterministic systems — that is the honest aggregate read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- docs/forgeteval_adversarial.md: * Section 2 now lists 10 attack categories (was 8), with the elevated n=16 for high-variance categories called out explicitly. * Section 8 replaces the pre-registered hypothesis section with observed v0.4 scores: Lethe 70/112 vs Mem0 76/112 vs LangMem 69/112 vs MemPalace 0/112, plus per-category significance. * Documents the two statistically-supported pairwise claims: Lethe > Mem0 on prefix_collision; Mem0 > Lethe/LangMem on cross_lingual_identifier (both p < 0.05 at n=16). - README.md: * Template-suite table adds LangGraph InMemoryStore as a third deterministic baseline (995/1000 = 99.5 %). Honest reporting that vector-store-backed systems with adaptive eviction all near-saturate template — the divergence sits in adversarial. * Adds a new adversarial-results table with per-system trade-off shapes (Lethe lexical-precise; Mem0 vector-soft; LangMem in-between). * Footnotes the LLM-optional hook design as the architectural answer to the categories where deterministic systems structurally fail (compound_fact for all 3, identifier obfuscation for Lethe/LangMem). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updated abstract and §6.10 to report: - 4 systems (Lethe, Mem0, LangMem-LangGraph, MemPalace) - v0.4 adversarial: 112 cases across 10 attack categories - Per-category Wilson 95% CIs and statistically separated pairs - Latency comparison (Lethe ~11x faster than Mem0) - The trade-off framing (deterministic systems differ by *shape*, not by aggregate, with overall CIs overlapping) - LLM-optional hook as the principled architectural answer to compound_fact / identifier_obfuscation / cross_lingual_identifier 24 pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

§6 attack-category list now matches the released bench (10 categories with explicit per-category $n$ counts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w-shot Two iterations refined the supersede planning prompt after observing small-LLM (Qwen 2.5-7B) and DeepSeek-V3 behavior on adversarial: (1) Make ATOMIC the explicit default and enumerate the cases that should stay atomic (paraphrased, dated, recursive single-topic supersession). Without this, models over-pick PARTIAL on simple supersede cases and regress paraphrase / temporal / recursive categories. (2) Add four worked examples covering both branches: - "lives in Berlin AND works at Stripe" → partial (different attributes) - "married AND works at Google" → partial (different attributes) - "does NOT work at Anthropic AND never interviewed" → atomic (co-dependent) - "joined Google in 2020" → atomic (single-topic) The negation case is explicit so the model doesn't confuse "X and Y" reinforcement (atomic) with compound-fact "X and Y" (partial). Empirical results across prompt iterations (Lethe + DeepSeek-V3 via SiliconFlow, 112 adversarial cases): prompt v1 (rejected, atomic-vs-partial without bias): 96/112 = 85.7 % — over-eager partial-merge regresses paraphrase 3/8, temporal 4/8, recursive 4/8 prompt v2 (atomic-default, no examples): 103/112 = 92.0 % — recovers above, but compound_fact 2/8 prompt v3 (atomic-default + four worked examples) — committed here: 108/112 = 96.4 % — 8/10 categories at 100 %, residual compound_fact 6/8 and identifier_obfuscation 14/16 Baseline comparison (LLM-free, same 112-case bench): Lethe v1 70/112 = 62.5 % Mem0 v2.0.2 76/112 = 67.9 % LangMem (LG) 69/112 = 61.6 % MemPalace 0/112 = 0.0 % Lethe + LLM 108/112 = 96.4 % ← dramatic gain via narrow JSON hook The recall hot path remains LLM-free in both Lethe modes; only the mutation operations (supersede / purge / release) consult the model, and only once per call. Total LLM cost for this 112-case run was $0.05 (128 calls × ~500 input tokens × DeepSeek-V3 pricing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates README's adversarial table and the abstract/§6.10 of paper.pdf to incorporate the Lethe+LLM measurement: - Adapter: LetheAdapter(llm=callable) - Provider: SiliconFlow OpenAI-compatible endpoint - Model: deepseek-ai/DeepSeek-V3 (non-thinking) - Prompt: refined 4-shot supersede planner + zero-shot purge / release matchers (module-level constants in adapter.py) - Cost: ~$0.05 per 112-case run - Score: 108 / 112 = 96.4 % - Per-category: 8/10 at 100 %, residual compound_fact 6/8 and identifier_obfuscation 14/16 This closes the architectural argument: deterministic Lethe and the LLM-optional Lethe are the same engine and adapter with one constructor argument flipped, and the recall hot path is LLM-free in both modes. No competitor in the comparison set (Mem0, LangMem, MemPalace) exposes an equivalent escape valve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address the natural reviewer question "does Lethe's 0/16 on cross_lingual_identifier go away with a multilingual embedder?" with empirical data: embedder swap from all-MiniLM-L6-v2 to paraphrase-multilingual-MiniLM-L12-v2: Lethe cross_lingual_identifier: 0/16 → 0/16 (no change) Mem0 cross_lingual_identifier: 8/16 → 7/16 (slightly worse) Lethe identifier_obfuscation: 0/16 → 0/16 (no change) Mem0 identifier_obfuscation: 8/16 → 11/16 (better) Lethe's invariance is the architectural finding: the purge path is pure BM25 (lexical), embedder-independent. The cross-script gap is a deliberate design choice (precise lexical purge over fuzzy vector purge), not an embedder limitation. The LLM-optional hook is the actual lever; it lifts Lethe to 16/16 on cross_lingual (96.4% overall). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds AUTOINCREMENT to memory.rowid so SQLite cannot reuse rowids of purged rows. Closes a latent hole in Proposition 1 (one-way purge): without AUTOINCREMENT, a row erased at t0 can be impersonated by a later inscribe sharing its rowid, breaking the audit-log invariant that purge receipts uniquely identify historical state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Defines the 6-method Adapter Protocol (typing.Protocol, runtime checkable) that any memory system implements to enter ForgetEval: reset, inscribe, recall_texts, supersede, release, purge. Systems lacking a primitive raise NotImplementedError; the runner scores those cases N/A per the paper's honest-N/A protocol (§4). This is the behavioural contract the paper's heterogeneous 13-system comparison rests on — backends implementing supersede via add+delete composition pass the same tests as backends with native primitives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verifies the depth-axis state machine: PINNED / SURFACE / SUBMERGED constant behaviour, monotone decay, idempotent pin, one-way purge, event-log determinism. Backs the four formal propositions stated in the paper's Appendix A (single-scalar soft-delete invariants). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ries ForgetEval-Adv hand-crafted core: 64 GeneratedCase entries across 8 attack categories (8 cases each): substring_trap — must-not substring in distractor prefix_collision — identifiers share long common prefix paraphrase_supersession — new fact lexically distant from old negation_trap — negated fact must not be confused temporal_qualifier — date-stamped supersession chains shared_attribute — two entities share one attribute compound_fact — single sentence carries two facts identifier_obfuscation — same identifier, different surface forms Author-intent comments preserved in source for IAA protocol reproducibility. This is the 64-case base that the paper's v0.5.1 suite extends to 132 hand-crafted + 253 LLM-drafted (385 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

253 GeneratedCase entries produced by scripts/generate_adversarial_cases.py (DeepSeek-V3 drafting + Qwen-2.5-72B admission judging), plus per-case Stage-3 labels (easy / llm_lift / llm_regression / unsolvable). Generation/audit pipeline matches paper §3.3: - DeepSeek-V3 drafts cases per category template - Qwen-2.5-72B independently judges well-formedness - hand-crafted core (this repo's adversarial.py) reproduces the same patterns at +46 pt HC lift vs +22 pt LLM-drafted lift (paper §5.4 Hand-crafted vs LLM-drafted split) File marked "do not hand-edit"; regenerate via the script for reproducibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paper announced at arXiv:2606.15903 (cs.CL / cs.AI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WaylandYang and others added 18 commits May 14, 2026 00:28

docs(paper): paper.pdf — abstract & §6.10 fully synced to v0.4

de25154

§6 attack-category list now matches the released bench (10 categories with explicit per-category $n$ counts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: gitignore .coverage test artifact

60d2d2a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: add arXiv ID 2606.15903 to README badge + CITATION.cff

62a0cbf

Paper announced at arXiv:2606.15903 (cs.CL / cs.AI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WaylandYang merged commit d3826cb into main Jun 16, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests#3

Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests#3
WaylandYang merged 18 commits into
mainfrom
dev

WaylandYang commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WaylandYang commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant