From 3f826a53884d1cc3a808214a76413c648f46d371 Mon Sep 17 00:00:00 2001
From: Wayland Yang <wayland0916@gmail.com>
Date: Tue, 16 Jun 2026 12:00:29 +0800
Subject: [PATCH] docs: sync README ForgetEval-Adv table to published 385-case
 numbers

The adversarial table cited stale v0.4 numbers (112-case, Lethe+LLM
96.4%, $0.05) that disagreed with the now-public paper
(arXiv:2606.15903, 385-case v0.5.1). Updated to the canonical
in-house 385 reference:

  Lethe 244/385 (63.4%), Mem0 263/385 (68.3%),
  LangGraph 242/385 (62.9%), MemPalace 0/385,
  Lethe+LLM 353/385 (91.7%), LangGraph+LLM 359/385 (93.2%),
  cost ~$0.17.

Also reframes the deterministic cluster honestly (63-68% band,
overlapping Wilson CIs) and notes the +28pt hook lift travels
across backends, matching the paper's control-plane-placement
thesis. LongMemEval table left as-is (different "raw" setup than
the paper's session-granularity appendix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 README.md | 55 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/README.md b/README.md
index f245780..2c0c8b8 100644
--- a/README.md
+++ b/README.md
@@ -122,32 +122,35 @@ the two families that probe width-control and identifier precision.
 MemPalace returns 0 because the API has no deletion primitives.
 
 For more discriminative comparison we run **ForgetEval-Adv**, a
-112-case hand-crafted layer covering 10 attack categories
-(substring traps, prefix collisions, paraphrase supersession,
-negation, temporal qualifiers, shared attributes, compound facts,
-identifier obfuscation, cross-lingual identifiers, recursive
-supersession).  See [docs/forgeteval_adversarial.md](docs/forgeteval_adversarial.md).
-
-| System              | adversarial overall  | wall / case  | trade-off shape                                |
-|---------------------|---------------------:|-------------:|------------------------------------------------|
-| **Lethe v1**        |  70 / 112 (62.5 %)   |  ~48 ms      | 100 % prefix_collision, 0 % cross_lingual      |
-| Mem0 v2.0.2         |  76 / 112 (67.9 %)   |  ~527 ms     | 50 % prefix_collision, 50 % cross_lingual      |
-| LangMem (LangGraph) |  69 / 112 (61.6 %)   |  ~56 ms      | 94 % prefix_collision, 0 % cross_lingual       |
-| MemPalace           |   0 / 112 ( 0.0 %)   |  ~167 ms     | no deletion primitives                         |
-| **Lethe + LLM**     | **108 / 112 (96.4 %)** | ~2.2 s (mutations only)  | 100 % cross_lingual, 100 % shared_attribute; 8 / 10 categories at 100 % |
-
-The Lethe+LLM row uses the optional `llm: Callable[[str], str]`
-hook on `LetheAdapter` wired to DeepSeek-V3 via SiliconFlow.
-Cost: ~$0.05 for a full 112-case run.  The recall hot path
-remains LLM-free; only the three mutation operations (`supersede`,
-`purge`, `release`) consult the model.
-
-Statistically separated per-category claims at p < 0.05 (non-
-overlapping Wilson 95 % CIs at n=16):
-**Lethe > Mem0 on prefix_collision** (lexical-precise purge wins);
-**Mem0 > Lethe / LangMem on cross_lingual_identifier** (vector-soft
-matching wins).  Overall Wilson CIs of the three deterministic
-systems overlap — the bench reads the trade-off, not a winner.
+385-case adversarial layer (132 hand-crafted + 253 LLM-drafted,
+oracle-validated) covering 10 attack categories (substring traps,
+prefix collisions, paraphrase supersession, negation, temporal
+qualifiers, shared attributes, compound facts, identifier
+obfuscation, cross-lingual identifiers, recursive supersession).
+See [docs/forgeteval_adversarial.md](docs/forgeteval_adversarial.md).
+
+| System              | adversarial overall    | trade-off shape                                 |
+|---------------------|-----------------------:|-------------------------------------------------|
+| **Lethe v1**        |  244 / 385 (63.4 %)    | 82 % prefix_collision, 0 % cross_lingual         |
+| Mem0 v2.0.2         |  263 / 385 (68.3 %)    | multi-signal scoring, weaker identifier precision |
+| LangGraph           |  242 / 385 (62.9 %)    | 0 % cross_lingual, no native edit primitive      |
+| MemPalace           |    0 / 385 ( 0.0 %)    | no deletion primitives                           |
+| **Lethe + LLM**     | **353 / 385 (91.7 %)** | recovers cross_lingual + intent-aware deletion   |
+| **LangGraph + LLM** | **359 / 385 (93.2 %)** | same hook, high-recall backbone                  |
+
+The three deterministic systems cluster in a **63–68 % band** with
+mutually overlapping Wilson CIs — the bench reads the trade-off, not
+a winner.  The discriminative signal is per-category: deterministic
+stores hold the lexical/temporal categories but fail canonicalization
+(Lethe 0 % cross_lingual, 5 % identifier_obfuscation).
+
+The +LLM rows use the optional `llm: Callable[[str], str]` hook on the
+adapter, wired to DeepSeek-V3 via SiliconFlow.  Cost: **~$0.17 for a
+full 385-case run**.  The recall hot path stays LLM-free; only the
+three mutation operations (`supersede`, `purge`, `release`) consult
+the model — and the +28 pt lift travels across backends (Lethe and
+LangGraph alike), so it is the *placement* of the hook, not the
+storage engine, that earns it.
 
 For attack categories that need semantic understanding the engine
 deliberately doesn't provide (compound_fact across all 3 systems,