From 3f826a53884d1cc3a808214a76413c648f46d371 Mon Sep 17 00:00:00 2001 From: Wayland Yang Date: Tue, 16 Jun 2026 12:00:29 +0800 Subject: [PATCH] docs: sync README ForgetEval-Adv table to published 385-case numbers The adversarial table cited stale v0.4 numbers (112-case, Lethe+LLM 96.4%, $0.05) that disagreed with the now-public paper (arXiv:2606.15903, 385-case v0.5.1). Updated to the canonical in-house 385 reference: Lethe 244/385 (63.4%), Mem0 263/385 (68.3%), LangGraph 242/385 (62.9%), MemPalace 0/385, Lethe+LLM 353/385 (91.7%), LangGraph+LLM 359/385 (93.2%), cost ~$0.17. Also reframes the deterministic cluster honestly (63-68% band, overlapping Wilson CIs) and notes the +28pt hook lift travels across backends, matching the paper's control-plane-placement thesis. LongMemEval table left as-is (different "raw" setup than the paper's session-granularity appendix). Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 55 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index f245780..2c0c8b8 100644 --- a/README.md +++ b/README.md @@ -122,32 +122,35 @@ the two families that probe width-control and identifier precision. MemPalace returns 0 because the API has no deletion primitives. For more discriminative comparison we run **ForgetEval-Adv**, a -112-case hand-crafted layer covering 10 attack categories -(substring traps, prefix collisions, paraphrase supersession, -negation, temporal qualifiers, shared attributes, compound facts, -identifier obfuscation, cross-lingual identifiers, recursive -supersession). See [docs/forgeteval_adversarial.md](docs/forgeteval_adversarial.md). - -| System | adversarial overall | wall / case | trade-off shape | -|---------------------|---------------------:|-------------:|------------------------------------------------| -| **Lethe v1** | 70 / 112 (62.5 %) | ~48 ms | 100 % prefix_collision, 0 % cross_lingual | -| Mem0 v2.0.2 | 76 / 112 (67.9 %) | ~527 ms | 50 % prefix_collision, 50 % cross_lingual | -| LangMem (LangGraph) | 69 / 112 (61.6 %) | ~56 ms | 94 % prefix_collision, 0 % cross_lingual | -| MemPalace | 0 / 112 ( 0.0 %) | ~167 ms | no deletion primitives | -| **Lethe + LLM** | **108 / 112 (96.4 %)** | ~2.2 s (mutations only) | 100 % cross_lingual, 100 % shared_attribute; 8 / 10 categories at 100 % | - -The Lethe+LLM row uses the optional `llm: Callable[[str], str]` -hook on `LetheAdapter` wired to DeepSeek-V3 via SiliconFlow. -Cost: ~$0.05 for a full 112-case run. The recall hot path -remains LLM-free; only the three mutation operations (`supersede`, -`purge`, `release`) consult the model. - -Statistically separated per-category claims at p < 0.05 (non- -overlapping Wilson 95 % CIs at n=16): -**Lethe > Mem0 on prefix_collision** (lexical-precise purge wins); -**Mem0 > Lethe / LangMem on cross_lingual_identifier** (vector-soft -matching wins). Overall Wilson CIs of the three deterministic -systems overlap — the bench reads the trade-off, not a winner. +385-case adversarial layer (132 hand-crafted + 253 LLM-drafted, +oracle-validated) covering 10 attack categories (substring traps, +prefix collisions, paraphrase supersession, negation, temporal +qualifiers, shared attributes, compound facts, identifier +obfuscation, cross-lingual identifiers, recursive supersession). +See [docs/forgeteval_adversarial.md](docs/forgeteval_adversarial.md). + +| System | adversarial overall | trade-off shape | +|---------------------|-----------------------:|-------------------------------------------------| +| **Lethe v1** | 244 / 385 (63.4 %) | 82 % prefix_collision, 0 % cross_lingual | +| Mem0 v2.0.2 | 263 / 385 (68.3 %) | multi-signal scoring, weaker identifier precision | +| LangGraph | 242 / 385 (62.9 %) | 0 % cross_lingual, no native edit primitive | +| MemPalace | 0 / 385 ( 0.0 %) | no deletion primitives | +| **Lethe + LLM** | **353 / 385 (91.7 %)** | recovers cross_lingual + intent-aware deletion | +| **LangGraph + LLM** | **359 / 385 (93.2 %)** | same hook, high-recall backbone | + +The three deterministic systems cluster in a **63–68 % band** with +mutually overlapping Wilson CIs — the bench reads the trade-off, not +a winner. The discriminative signal is per-category: deterministic +stores hold the lexical/temporal categories but fail canonicalization +(Lethe 0 % cross_lingual, 5 % identifier_obfuscation). + +The +LLM rows use the optional `llm: Callable[[str], str]` hook on the +adapter, wired to DeepSeek-V3 via SiliconFlow. Cost: **~$0.17 for a +full 385-case run**. The recall hot path stays LLM-free; only the +three mutation operations (`supersede`, `purge`, `release`) consult +the model — and the +28 pt lift travels across backends (Lethe and +LangGraph alike), so it is the *placement* of the hook, not the +storage engine, that earns it. For attack categories that need semantic understanding the engine deliberately doesn't provide (compound_fact across all 3 systems,