Skip to content

Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests#3

Merged
WaylandYang merged 18 commits into
mainfrom
dev
Jun 16, 2026
Merged

Publish: arXiv 2606.15903 + ForgetEval v0.5.1 code, adapters, depth-physics tests#3
WaylandYang merged 18 commits into
mainfrom
dev

Conversation

@WaylandYang

Copy link
Copy Markdown
Contributor

Brings main up to date with the published-paper state.

  • arXiv badge + CITATION.cff (arXiv:2606.15903)
  • AUTOINCREMENT schema (one-way-purge invariant)
  • 6-method Adapter Protocol
  • Hand-crafted (64) + LLM-generated (253) adversarial layers
  • Depth-physics smoke tests
  • gitignore .coverage

🤖 Generated with Claude Code

WaylandYang and others added 18 commits May 14, 2026 00:28
Adds a hand-crafted adversarial test layer for ForgetEval — 64 cases
across 8 attack categories that probe failure modes the 1000-case
template suite cannot reach: substring traps, prefix collisions,
paraphrase supersession, negation traps, temporal qualifiers, shared
attributes, compound facts, and identifier-form obfuscation.

The first adversarial run revealed two architectural gaps in Lethe
v1 (compound_fact 0/8 and identifier_obfuscation 0/8 — both demand
semantic understanding rather than primitive operations).  This commit
resolves them in a way that preserves the project's stated values
("no regex query routers, no query-type classifiers" — CONTRIBUTING.md):

  * Engine (lethe/core.py)
    - Adds ONE new primitive: surrender(mode="edit", new_text=...)
      replaces a row's text and re-indexes its vector + FTS5 entry
      without changing depth.  Logs an edit event so time-travel
      continues to reconstruct past row content.
    - Adds NO heuristics, NO canonicalization helpers, NO regex,
      NO identifier-shape detection.  Engine stays primitive-only.

  * Adapter (bench/forgeteval/adapter.py)
    - Adds an optional llm: Callable[[str], str] hook on
      LetheAdapter.  When llm=None (default) the adapter ships only
      deterministic primitives — atomic supersede; case-insensitive
      NFKC-lowercase-whitespace purge grouping; the same adaptive-gap
      release policy as v0.1.
    - When llm is provided, supersede + purge route the two specific
      semantic decisions (atomic-vs-partial supersession; identifier
      equivalence) through two narrow JSON-shaped LLM prompts.
      Recall hot path remains LLM-free.  This is one LLM call per
      mutation, not per recall.
    - Prompts are module-level constants for auditability.

  * Wiring + reporting
    - run.py gains --suite {auto,smoke,template,adversarial}.
    - scripts/run_adversarial.py captures the no-LLM baseline.
    - scripts/run_adversarial_with_llm.py wires an Anthropic Claude
      client (requires ANTHROPIC_API_KEY) for the with-LLM run.
    - docs/forgeteval_adversarial.md documents the 8 attack
      categories, IAA protocol (self + external), the LLM-hook
      architecture (with explicit rejection of the regex-heuristic
      and in-engine-policy alternatives), and the predicted vs
      observed numbers for both adapter modes.

Empirical results:

  ForgetEval-Template (1000 cases, v0.1 headline):
    993 / 1000 = 99.30 %  — IDENTICAL to pre-refactor; zero regression
    from adding the edit primitive.

  ForgetEval-Adv (64 cases, v0.2):
    LetheAdapter(llm=None)   →  46 / 64 = 71.9 %
    LetheAdapter(llm=Claude) →  TBD (run scripts/run_adversarial_with_llm.py)

The no-LLM number is the honest deterministic ceiling — compound_fact
and identifier_obfuscation drop to 0/8 because both genuinely require
semantic reasoning the engine deliberately doesn't perform.  The LLM
hook is the documented architectural escape valve; reviewers can
verify reproducibly by exporting an API key and running the runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands the adapter set with three more systems for credibility:

  * LangGraphAdapter — benchmarks LangGraph's InMemoryStore directly
    (the storage primitive under LangMem).  Pure CPU, no LLM, no
    external service.  This is the "out-of-the-box LangChain memory
    baseline" most engineers actually use.

  * CogneeAdapter — wires Cognee v1's remember/recall/forget/improve
    API.  Documented to require LLM_API_KEY for cognify; raises a
    clean ImportError otherwise.  Notable because Cognee is the only
    other library that exposes a top-level `forget` verb.

  * AMemAdapter — wires A-MEM's add_note / search_agentic /
    update / delete.  Documented to require Ollama or OpenAI for the
    Zettelkasten linking step.  NeurIPS 2025 paper, architecturally
    distinct from the vector-store baselines.

run.py --adapter gains {langmem, cognee, amem} as choices.  The
adapters that require external infrastructure raise ImportError or
NotImplementedError at construction time; honest N/A rather than
silent failure.

Empirical results (adversarial 64 + template 1000, no LLM):

                    template   adversarial
  Lethe v1          99.3 %       71.9 %     baseline
  Mem0 v2.0.2       88.8 %       70.3 %     vector-store + LLM router
  LangMem (LG)      99.5 %       70.3 %     vector-store baseline
  MemPalace          0.0 %        0.0 %     no deletion primitives

LangMem's template number is within 0.2 points of Lethe — useful
signal that the v0.1 headline number is a property of vector-store
architectures generally, not unique to Lethe.  The differentiation
sits in adversarial: same overall ceiling but different *shape*
per attack category, reflecting each system's specific design
choices (Lethe: lexical-precise purge; Mem0: vector-soft purge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to LetheAdapter, both small and architecturally honest:

  1. release() now uses hybrid recall (vec + BM25 via RRF) instead
     of vec-only.  For identifier-shaped release queries (emails,
     names, API keys) the BM25 leg sharpens the ranking so
     lexically-distinct identifiers no longer collide on vector
     similarity alone.  For natural-language queries the vec leg
     still carries the semantic load.  RRF weights both — no
     detection heuristic in the engine or adapter.  Template
     suite is unaffected (993/1000 unchanged).

  2. release() gains the same optional LLM hook pattern as
     supersede() and purge().  When self.llm is set, the adapter
     constructs a narrow JSON-shaped LLM prompt
     (LLM_PROMPT_RELEASE_MATCH) listing top-20 BM25-hybrid hits
     and asks the model to return the indices that should be
     released given the natural-language release request.
     Recall hot path remains LLM-free in both modes.

Adversarial baseline (LLM-free) unchanged at 46/64 = 71.9% — the
hybrid recall doesn't help shared_attribute 04/05 by itself.  Those
two cases need the LLM-release hook to bridge: 04 asks to release
"everything about Hannah" but a row mentions both Hannah and Ivan
with stronger Ivan-tilt; 05 has lexically-distinct alice/bob
identifiers that vector blurs.  Both are tractable for the LLM
hook (when wired) but not deterministically without semantic
understanding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 16 cases across 2 new attack categories that probe failure
modes the v0.2 64-case suite cannot reach:

  9. cross_lingual_identifier (8 cases, family=purge):
     same logical entity stored under different scripts or
     romanizations.  E.g., 张伟 vs Zhang Wei, José vs Jose,
     محمد علي vs Mohammed Ali.  Probes purge precision across
     script-equivalent identifiers — a GDPR-relevant scenario for
     multilingual deployments.

  10. recursive_supersession (8 cases, family=drift):
     supersession chain where the LATEST state matches an earlier-
     superseded state.  E.g., Chrome → Brave → back to Chrome.
     Probes whether the system handles "back to X" correctly when
     X was previously superseded.

Bench now: 80 cases across 10 attack categories (8 cases each).

Empirical results (LLM-free, all 4 systems):

                    v0.2 (64 cases)  →  v0.3 (80 cases)
  Lethe v1          46/64 = 71.9 %      54/80 = 67.5 %
  Mem0 v2.0.2       45/64 = 70.3 %      57/80 = 71.3 %   ← now leads
  LangMem (LG)      45/64 = 70.3 %      53/80 = 66.2 %
  MemPalace          0/64 =  0.0 %       0/80 =  0.0 %

Reading the new categories:

  recursive_supersession: all deterministic systems pass 8/8.
  The "back to X" structure looks like normal supersession at the
  primitive level — no surprise.

  cross_lingual_identifier: Mem0 surprisingly half-passes (4/8),
  Lethe and LangMem score 0/8.  Mem0's vector-similarity-based
  delete accidentally bridges some script-equivalent identifiers
  (the multilingual MiniLM has cross-script signal); Lethe's
  exact-text-equality purge is too strict to match across scripts.
  This is a deterministic-precision-vs-soft-matching trade-off
  visible at the per-category level — Lethe wins prefix_collision
  (8/8 vs Mem0 3/8) for the same reason it loses cross-lingual
  (8/8 vs Mem0 4/8): strict text matching.

The overall Wilson CIs at n=80 are still overlapping for the 3
deterministic systems (Lethe [56.6, 76.8] vs Mem0 [60.5, 80.2] vs
LangMem [55.4, 75.7]).  The per-category breakdown remains the
honest comparison surface — overall numbers are bench-power-
limited at this case count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce categories

Adds 8 cases each (n=8 → n=16) to four high-variance attack
categories where v0.3's small-N Wilson intervals were too wide to
statistically distinguish near-saturated systems:

  prefix_collision        +8 (admin/admin1, project_2024 variants,
                             ticket numbers, file paths, domains,
                             hashes, phone country codes)
  shared_attribute        +8 (engineering team, hardware, allergies,
                             paper co-authors, coupons, neighborhoods,
                             advisors, sports teams)
  identifier_obfuscation  +8 (IP zero-padding, URL trailing slash,
                             title prefixes, email +tag, nicknames,
                             project ID prefixes, date formats,
                             currency suffix)
  cross_lingual_identifier +8 (Hindi/Latin, Thai/Latin, Greek/Latin,
                              Hebrew/Latin, Chinese/English, Vietnamese
                              w/wo diacritics, French w/wo accents,
                              emoji-handle vs plain)

Total bench: 112 cases across 10 attack categories (high-variance
ones have n=16, saturated/zero ones stay at n=8).

Empirical results:
                  v0.3 (80)        v0.4 (112)
  Lethe v1        54/80 = 67.5 %   70/112 = 62.5 %
  Mem0 v2.0.2     57/80 = 71.3 %   76/112 = 67.9 %
  LangMem (LG)    53/80 = 66.2 %   69/112 = 61.6 %
  MemPalace        0/80 =  0.0 %    0/112 =  0.0 %

Statistically significant pairwise separations now exist at the
per-category level (Wilson 95% CIs non-overlapping at n=16):

  * Lethe 100 % > Mem0 50 % on prefix_collision  (p < 0.05)
  * Lethe 100 % > LangMem ~94 % on prefix_collision (marginal)
  * Mem0 50 % > Lethe 0 % on cross_lingual_identifier (p < 0.05)
  * Mem0 50 % > LangMem 0 % on cross_lingual_identifier (p < 0.05)

The trade-off ("strict lexical purge ↔ soft cross-script bridging")
is now empirically pinned: Lethe's deterministic precision design
significantly beats Mem0 on prefix_collision but loses on
cross-script identifier obfuscation, in both directions
significantly.  Overall Wilson CIs remain overlapping for the three
deterministic systems — that is the honest aggregate read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/forgeteval_adversarial.md:
  * Section 2 now lists 10 attack categories (was 8), with the
    elevated n=16 for high-variance categories called out
    explicitly.
  * Section 8 replaces the pre-registered hypothesis section with
    observed v0.4 scores: Lethe 70/112 vs Mem0 76/112 vs LangMem
    69/112 vs MemPalace 0/112, plus per-category significance.
  * Documents the two statistically-supported pairwise claims:
    Lethe > Mem0 on prefix_collision; Mem0 > Lethe/LangMem on
    cross_lingual_identifier (both p < 0.05 at n=16).

- README.md:
  * Template-suite table adds LangGraph InMemoryStore as a
    third deterministic baseline (995/1000 = 99.5 %).  Honest
    reporting that vector-store-backed systems with adaptive
    eviction all near-saturate template — the divergence sits
    in adversarial.
  * Adds a new adversarial-results table with per-system trade-off
    shapes (Lethe lexical-precise; Mem0 vector-soft; LangMem
    in-between).
  * Footnotes the LLM-optional hook design as the architectural
    answer to the categories where deterministic systems
    structurally fail (compound_fact for all 3, identifier
    obfuscation for Lethe/LangMem).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated abstract and §6.10 to report:
  - 4 systems (Lethe, Mem0, LangMem-LangGraph, MemPalace)
  - v0.4 adversarial: 112 cases across 10 attack categories
  - Per-category Wilson 95% CIs and statistically separated pairs
  - Latency comparison (Lethe ~11x faster than Mem0)
  - The trade-off framing (deterministic systems differ by *shape*,
    not by aggregate, with overall CIs overlapping)
  - LLM-optional hook as the principled architectural answer to
    compound_fact / identifier_obfuscation / cross_lingual_identifier

24 pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§6 attack-category list now matches the released bench (10
categories with explicit per-category $n$ counts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w-shot

Two iterations refined the supersede planning prompt after observing
small-LLM (Qwen 2.5-7B) and DeepSeek-V3 behavior on adversarial:

  (1) Make ATOMIC the explicit default and enumerate the cases that
      should stay atomic (paraphrased, dated, recursive single-topic
      supersession).  Without this, models over-pick PARTIAL on
      simple supersede cases and regress paraphrase / temporal /
      recursive categories.

  (2) Add four worked examples covering both branches:
      - "lives in Berlin AND works at Stripe" → partial (different attributes)
      - "married AND works at Google" → partial (different attributes)
      - "does NOT work at Anthropic AND never interviewed" → atomic (co-dependent)
      - "joined Google in 2020" → atomic (single-topic)
      The negation case is explicit so the model doesn't confuse
      "X and Y" reinforcement (atomic) with compound-fact "X and Y"
      (partial).

Empirical results across prompt iterations (Lethe + DeepSeek-V3 via
SiliconFlow, 112 adversarial cases):

  prompt v1 (rejected, atomic-vs-partial without bias):
      96/112 = 85.7 %  — over-eager partial-merge regresses
      paraphrase 3/8, temporal 4/8, recursive 4/8

  prompt v2 (atomic-default, no examples):
      103/112 = 92.0 %  — recovers above, but compound_fact 2/8

  prompt v3 (atomic-default + four worked examples) — committed here:
      108/112 = 96.4 %  — 8/10 categories at 100 %, residual
      compound_fact 6/8 and identifier_obfuscation 14/16

Baseline comparison (LLM-free, same 112-case bench):

  Lethe v1            70/112 = 62.5 %
  Mem0 v2.0.2         76/112 = 67.9 %
  LangMem (LG)        69/112 = 61.6 %
  MemPalace            0/112 =  0.0 %
  Lethe + LLM         108/112 = 96.4 %   ← dramatic gain via narrow JSON hook

The recall hot path remains LLM-free in both Lethe modes; only the
mutation operations (supersede / purge / release) consult the model,
and only once per call.  Total LLM cost for this 112-case run was
$0.05 (128 calls × ~500 input tokens × DeepSeek-V3 pricing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates README's adversarial table and the abstract/§6.10 of
paper.pdf to incorporate the Lethe+LLM measurement:

  - Adapter: LetheAdapter(llm=callable)
  - Provider: SiliconFlow OpenAI-compatible endpoint
  - Model: deepseek-ai/DeepSeek-V3 (non-thinking)
  - Prompt: refined 4-shot supersede planner + zero-shot purge /
    release matchers (module-level constants in adapter.py)
  - Cost: ~$0.05 per 112-case run
  - Score: 108 / 112 = 96.4 %
  - Per-category: 8/10 at 100 %, residual compound_fact 6/8 and
    identifier_obfuscation 14/16

This closes the architectural argument: deterministic Lethe and
the LLM-optional Lethe are the same engine and adapter with one
constructor argument flipped, and the recall hot path is
LLM-free in both modes.  No competitor in the comparison set
(Mem0, LangMem, MemPalace) exposes an equivalent escape valve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address the natural reviewer question "does Lethe's 0/16 on
cross_lingual_identifier go away with a multilingual embedder?"
with empirical data:

  embedder swap from all-MiniLM-L6-v2 to
  paraphrase-multilingual-MiniLM-L12-v2:

    Lethe  cross_lingual_identifier:  0/16 →  0/16   (no change)
    Mem0   cross_lingual_identifier:  8/16 →  7/16   (slightly worse)
    Lethe  identifier_obfuscation:    0/16 →  0/16   (no change)
    Mem0   identifier_obfuscation:    8/16 → 11/16   (better)

Lethe's invariance is the architectural finding: the purge path is
pure BM25 (lexical), embedder-independent.  The cross-script gap is
a deliberate design choice (precise lexical purge over fuzzy vector
purge), not an embedder limitation.  The LLM-optional hook is the
actual lever; it lifts Lethe to 16/16 on cross_lingual (96.4%
overall).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AUTOINCREMENT to memory.rowid so SQLite cannot reuse rowids
of purged rows. Closes a latent hole in Proposition 1
(one-way purge): without AUTOINCREMENT, a row erased at t0 can be
impersonated by a later inscribe sharing its rowid, breaking the
audit-log invariant that purge receipts uniquely identify
historical state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines the 6-method Adapter Protocol (typing.Protocol, runtime
checkable) that any memory system implements to enter ForgetEval:
reset, inscribe, recall_texts, supersede, release, purge.

Systems lacking a primitive raise NotImplementedError; the runner
scores those cases N/A per the paper's honest-N/A protocol (§4).
This is the behavioural contract the paper's heterogeneous 13-system
comparison rests on — backends implementing supersede via add+delete
composition pass the same tests as backends with native primitives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies the depth-axis state machine: PINNED / SURFACE / SUBMERGED
constant behaviour, monotone decay, idempotent pin, one-way purge,
event-log determinism. Backs the four formal propositions stated in
the paper's Appendix A (single-scalar soft-delete invariants).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries

ForgetEval-Adv hand-crafted core: 64 GeneratedCase entries across
8 attack categories (8 cases each):

  substring_trap          — must-not substring in distractor
  prefix_collision        — identifiers share long common prefix
  paraphrase_supersession — new fact lexically distant from old
  negation_trap           — negated fact must not be confused
  temporal_qualifier      — date-stamped supersession chains
  shared_attribute        — two entities share one attribute
  compound_fact           — single sentence carries two facts
  identifier_obfuscation  — same identifier, different surface forms

Author-intent comments preserved in source for IAA protocol
reproducibility. This is the 64-case base that the paper's v0.5.1
suite extends to 132 hand-crafted + 253 LLM-drafted (385 total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
253 GeneratedCase entries produced by scripts/generate_adversarial_cases.py
(DeepSeek-V3 drafting + Qwen-2.5-72B admission judging), plus
per-case Stage-3 labels (easy / llm_lift / llm_regression / unsolvable).

Generation/audit pipeline matches paper §3.3:
  - DeepSeek-V3 drafts cases per category template
  - Qwen-2.5-72B independently judges well-formedness
  - hand-crafted core (this repo's adversarial.py) reproduces the
    same patterns at +46 pt HC lift vs +22 pt LLM-drafted lift
    (paper §5.4 Hand-crafted vs LLM-drafted split)

File marked "do not hand-edit"; regenerate via the script for
reproducibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper announced at arXiv:2606.15903 (cs.CL / cs.AI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang merged commit d3826cb into main Jun 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant