An RL environment that trains LLM agents to use the current fact, not the stale one.
by Vedant Patel · vrin.cloud · vedant@vrin.cloud
Quickstart · Results · How it works · Contributing · Cite · Contact
Across a long, multi-session interaction, facts change: a user moves, a price updates, a policy is replaced. Current memory systems and long-context models are good at recalling what they were told, and bad at dropping what is no longer true — so an assistant keeps booking flights to your old city. We call the correct handling of such updates supersession.
Benchmarks have started to measure this (FAMA, MemoryArena, MemoryAgentBench),
but they only ever score a frozen model. Supersede turns that measurement
into a training reward: a multi-session environment where facts are
superseded over time and the agent is rewarded for acting on the
currently-valid version and penalized for relying on a superseded one — with
turn-level credit, on the verifiers /
prime-rl rails the field
already trains on.
To our knowledge it is the first trainable environment whose verifiable reward is temporal fact-currency — and we use it to train the gap down, not just measure it.
- For labs & memory products: a ready-made RL environment + verifier for a known, unsolved production failure (assistants that cite your old job, address, or preference).
- For research: the first work, to our knowledge, to make supersession-correctness a learning signal rather than an eval-only metric.
| Prior work | What it does | What it does not |
|---|---|---|
| FAMA / Memora | Metric for using current vs. stale memory | Eval only, frozen models |
| MemAgent | RL for memory agents | Rewards final answer only, not fact-currency |
| LongRLVR | Verifiable reward on evidence relevance | No notion of temporal validity |
| MemoryAgentBench | Has a conflict-resolution eval task | Eval only, not a training environment |
Supersede sits in the intersection none of them occupy: a trainable environment whose verifiable reward is temporal fact-currency.
# 1. install (offline core + dev tools; no API key needed)
uv venv && source .venv/bin/activate
uv pip install -e ".[env,dev]"
# 2. verify it works
pytest # -> 21 passed
# 3. see the temporal core decide a supersession, no model required
python - <<'PY'
from supersede import Fact, detect_conflict
old = Fact(subject="Alice", predicate="lives in", object="Boston")
new = Fact(subject="Alice", predicate="lives in", object="Denver")
print(detect_conflict(new, [old]).strategy) # -> supersede
PYRun the full environment against a model (needs an OpenAI key):
prime env install supersede
prime eval run supersede -m openai/gpt-4.1-mini -a '{"max_examples": 78}'The environment auto-downloads the LongMemEval knowledge-update data (MIT
license) on first run.
The problem — bounded memory breaks supersession, even at the frontier.
On LongMemEval knowledge-update (n=78), swapping full context for a bounded,
self-maintained memory drops accuracy sharply, and the gap survives on the
strongest model:
| Model | Full context | Bounded memory |
|---|---|---|
| gpt-4.1-mini | 82% | 63% |
| gpt-4.1 | 91% | 64% |
| gpt-5.4 | 92% | 77% |
Even gpt-5.4 loses 15 points (paired McNemar p = 0.0033). The bottleneck is memory maintenance, not comprehension — and it doesn't close with a bigger model, or with a bigger memory (see the paper for the scale study).
The fix — training closes part of the gap. We can't fine-tune the proprietary models above, so we train a small open model (Qwen2.5-3B) on this environment with GRPO and evaluate on the same, held-out real questions. Its accuracy nearly doubles, monotonically as it learns:
| Checkpoint | Held-out oracle accuracy |
|---|---|
| base (untrained) | 9.0% |
| GRPO step 150 | 12.8% |
| GRPO step 175 | 16.7% |
Trained on synthetic episodes, improving on real held-out conversations — i.e. a learned skill, not memorization. It is a proof of mechanism on a small model (still far from the full-context ceiling, and the curve was still rising when training ran out of hard examples), not a finished policy.
The agent sees one session at a time and maintains a capped notes memory; it never re-sees raw sessions, then answers a query using the current value of a fact that changed along the way.
Reward
answered_current(+1): the final answer conveys the current/gold value (programmatic, ungameable matcher — no judge model needed).stale_penalty(−1): the answer asserts a known superseded value — active only when the task shipsstale_values(synthetic timelines; LongMemEval is gold-only).
load_environment arguments
| arg | default | meaning |
|---|---|---|
question_type |
knowledge-update |
LongMemEval subset |
max_examples |
None |
cap on tasks |
budget |
300 |
character cap on the agent's notes memory |
full_context |
False |
upper-bound mode: all sessions in context, single turn |
mode |
eval |
eval (real data) or train (procedural curriculum) |
src/supersede/
models.py # bi-temporal Fact (subject, predicate, object, validity, supersession)
temporal.py # conflict detection + supersession logic
timeline.py # synthetic fact-mutation timeline generator
rollout.py # framework-agnostic bounded-memory rollout state machine
reward.py # answer matching + answered_current / stale_penalty rewards
dataset.py # LongMemEval + synthetic task loaders
env.py # verifiers MultiTurnEnv wrapper (load_environment)
environments/supersede/ # Environments Hub package (prime env push)
scripts/ # eval harnesses (LongMemEval, validation)
docs/findings/ # empirical results, with caveats
tests/ # offline tests (temporal, rollout, reward) — 21 passing
Contributions are welcome — bug reports, harder training episodes, new model
results, and reward refinements especially. See
CONTRIBUTING.md; good first issues are labeled in the
issue tracker. Please run
ruff check and pytest before opening a PR.
@misc{patel2026supersede,
title = {Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents},
author = {Patel, Vedant},
year = {2026},
doi = {10.5281/zenodo.20837384},
url = {https://doi.org/10.5281/zenodo.20837384},
note = {Vrin. https://github.com/Vrin-cloud/supersede}
}- Questions / ideas: GitHub Discussions
- Bugs: open an issue
- Security: see SECURITY.md
- Email: vedant@vrin.cloud — Vedant Patel (vrin.cloud)
- Social: X / Twitter · LinkedIn
Apache-2.0. Built by Vrin on
verifiers, the
Prime Intellect Environments Hub, and
LongMemEval — thanks to their
maintainers for the open tooling and data.