Skip to content

Ferinjoque/lethe

Repository files navigation

Lethe

Python Tests Coverage License Status

A persistent village of LLM-powered agents whose memories decay, distort, and mutate the way human memory actually does.

Named after the river of forgetting in Greek myth. The dead drank from it and lost the memory of their lives.

Inspect, Timeline mode: one Ebbinghaus decay curve per memory, each trail fading from full fidelity at creation down to where it stands now.


Most generative-agent systems give their agents perfect memory. Lethe gives them a broken one, on the bet that a worse memory makes a more interesting mind: that the specific failures of human recall (forgetting, misattributing, conflating, confabulating under stress) are not noise but the raw material of gossip, belief, and myth.

The hypothesis

It has a weak form I think the evidence now supports, and a strong form worth arguing about.

Weak form. A memory architecture with Ebbinghaus decay, fidelity-weighted retrieval, and LLM-driven distortion produces emergent social structure (rumour cascades, shared beliefs, named doctrines) from purely bottom-up dynamics, with none of those outcomes scripted.

Strong form. Bounded memory produces more emergent structure than perfect memory would, and the failure modes of human memory are the generative substrate for culture. This is the interesting claim, and I try below to be clear about where the evidence stops and the speculation begins.

Compared to prior generative agents

Stanford Generative Agents (2023) Lethe
Memory Perfect append-only log Ebbinghaus decay + fidelity drift
Error modes None, never misremembers Drift, misattribution, conflation (LLM rewrites the stored text)
Rumours Not modelled Propagation graph with hop-by-hop mutation lineage
Affect None Valence/arousal mood with decay, contagion, trait drift
Trauma None Core memories that resist decay and intrude on recall
Belief Reflection memories Propositions with confidence, confirmation bias, social transmission
Religion None A shared unexplained belief across 3+ agents crystallises into a named doctrine
Generations None Agents age, retire, and seed successors with decayed hearsay
Observer Read-only An invisible "god" with a full intervention API
Continuity Session-bound A SQLite-backed daemon keeps ticking while you are away

The two systems ask different questions. Stanford asked whether agents can simulate believable social life. They can. Lethe asks what happens to believable social life when the agents cannot trust their own memories.

Frontend

Three views over the same living world. Vanilla JS, D3 vendored locally, no build step.

Observe Inspect Intervene
Observe Inspect Intervene
Village map and social graph, with a replay scrubber over world history. Per-agent thoughts, an Ebbinghaus decay-curve timeline, memory autopsies, the belief strip and doctrine pane. The god console (whisper, scarcity, nudge, rumour, stranger) and what-if branching.

Memory model

Every memory carries content, an embedding, an importance, a fidelity, an emotional valence, a distortion level, and a provenance chain of the agents it passed through. Retrieval is ACT-R flavoured: score = (0.4 recency + 0.3 importance + 0.3 relevance) weighted by the square root of fidelity. Fidelity decays each tick, slowed by importance and emotional weight (a flashbulb effect).

When fidelity crosses a threshold, the model is asked to rewrite the stored content, not to flag it as uncertain. The memory corrupts the way reconsolidation corrupts a real one.

flowchart LR
    E[event] --> M["memory<br/>fidelity 1.0"]
    M -->|decays each tick| D{fidelity}
    D -->|"below 0.70"| DR["drift<br/>details blur, wrong day"]
    D -->|"below 0.40"| MI["misattribute<br/>wrong person acted"]
    D -->|"below 0.10"| CO["conflate<br/>two memories merge"]
    DR --> R["may seed a<br/>new rumour"]
    MI --> R
    CO --> R
    R -->|gossip, one hop| M2["another agent's<br/>memory"]
    M2 -.->|decays too| D
Loading

Belief and religion follow from the same loop. During periodic consolidation ("sleep"), recent memories reflect into beliefs: propositions with a confidence that updates with a confirmation-bias multiplier and transmits in conversation. When three or more agents independently hold a confident belief about something unexplained (traced to a god intervention or a bad distortion rather than a witnessed event), the system names it a doctrine.

flowchart LR
    MEM[memories] -->|consolidation| BEL["belief<br/>+ confidence"]
    BEL -->|conversation| ADO["others adopt it<br/>confirmation bias"]
    ADO -->|"3+ share an<br/>unexplained belief"| DOC["named doctrine<br/>emergent religion"]
Loading

No agent is ever told to be religious. The doctrine is bottom-up.

Mood, trauma, and a note on what kind of memory this is

Agents carry a valence/arousal state they cannot fully override. It decays toward a personality baseline, spreads by contagion after conversations, and biases recall (a low-valence agent retrieves more negative memories, which lowers valence further). Intense experiences become core memories that decay 10x slower and intrude on recall. Sustained distress drifts personality traits. Coping styles (approach, avoidance, displacement, sublimation) derive from Big Five and shape behaviour.

One honest limitation: all of this is explicit, retrievable memory. The agent knows what hurt it and can describe it. That is closer to clinical "explicit traumatic memory" than to the implicit, pre-narrative trauma that much of human suffering runs on, the kind with no story the person can access. Modelling that would mean hidden state driving behaviour with no corresponding memory record. See open problems.


Findings

Deterministic experiments use a stub model (a fixed canned response): zero cost, fully reproducible from a seed, and they isolate the machinery (decay, mood, trust, core-memory formation, emergence) at the price of empty belief text. Semantic findings need a real model and are reported separately. Raw CSVs live in findings/; the analysis tool is scripts/analyze_experiments.py. Effect sizes are Cohen's d. Sample sizes (n = 5 to 15 worlds per condition) are descriptive, not yet inferential.

Experiment Question Result
1. Trauma vs control Does emotional charge, not information, decide what marks a memory? Yes. Trauma is the sole driver of core-memory formation (perfect separation, control variance zero). Same facts delivered mildly leave no permanent trace.
1. (at 120 ticks) What happens once memory has time to decay? Every condition, including the untouched baseline, converges on an information cascade and forms a doctrine. The cascade needs no divine seed.
2-4. Knob sweeps Do distortion rate, rumour capacity, meeting frequency change outcomes? Null at 40 to 50 ticks, for one shared reason (below).
5. Generational Does a religion outlive its founders? Yes. Doctrines survive retirement; beliefs accumulate; collective mood heals as the wounded generation departs.

Three results worth stating plainly:

Trauma marks, information does not. Four conditions deliver the same grain-theft rumour and whispers with different emotional charge. Only the high-trauma condition produces core memories (exactly 3.0 per world, zero variance). What you were told does not mark you. How much it hurt does. Trauma also depresses the whole village's mood through contagion (valence falls from -0.61 neutral to -0.80 high trauma, d about -1.5), spreading from three whispered-to agents to everyone.

The interesting dynamics live past tick 80. Experiments 2 to 4 swept knobs that only act on distorted memories. At 40 to 50 ticks no memory has decayed below the drift threshold yet, so distortion rate is exactly zero and the knobs are inert. I was tuning a machine that had not switched on. This is the single most important methodological lesson here: short demos miss everything.

Given time, the architecture invents religion on its own. At 120 ticks all four conditions reach the same arc and form a doctrine. The untouched baseline gains the most emergence (+166%) because it started lowest.

Condition Emergence 40t to 120t Arc shift Doctrines
high trauma 0.238 to 0.468 isolation to information cascade 1
low trauma 0.237 to 0.471 isolation to information cascade 1
neutral 0.228 to 0.472 isolation to information cascade 1
no god 0.178 to 0.474 isolation to information cascade 1

What the intervention changes long-term is not whether a religion forms but the village interior: intervention worlds end up with more beliefs, lower trust (d about -5), more persistent depression, and under high trauma twelve permanent core-memory scars no other condition carries. Bounded memory generates the myth on its own. What the invisible hand leaves behind is not the myth but the wound.

Generational survival, in detail (Experiment 5)

Five high-trauma worlds run to 120 ticks, past the generational turnover at 100.

  • Doctrines crystallise in 100% of worlds and survive the retirement of the generation that formed them.
  • Beliefs accumulate across generations, from 5 at tick 40 to 13 at tick 120.
  • Collective trauma partially heals. Mean valence recovers from -0.80 to -0.62 as the wound-carrying founders retire and successors arrive unscarred. Trust is unchanged: the social structure survives the handover, only the mood lifts. The grief is not resolved so much as buried with the people who carried it.

Running on a real model

The stub cannot show meaning: every agent converges on the same text. To see the real thing I ran the high-trauma scenario live on gpt-4o-mini. The target was 100+ ticks, far enough into the distortion regime and past the generational turnover to watch confabulated rumours feed doctrine formation. It reached tick 60 before the model's free daily token tier ran out (which is also why ticking slowed near the end: a budget limit, not a bug). So this is a partial run, and the gap between where it stopped and where it was headed is the open question below.

What the model produced that the stub never could:

  • Rumours mutate with real content. "Bob was seen carrying sacks from the communal grain store at night" had, by its first retelling, grown the detail "just before the Winter Festival." A specific that was never there.
  • Characters stay in character. Dmitri the elder: "the whispers are a cancer we must address; we need to gather our neighbours." Bob the accused: "I'm tired of this. Everyone's whispering and I can't just stand by while they twist the truth."
  • Confabulation is total. As fidelity decayed, one agent's memories filled with people who do not exist (Steve, Jake, Jordan, a dozen invented names) and relocated village events to "the city carnival." She lost her own name in one remembered greeting. Exactly the gap-filling distortion is meant to model, unprompted.
  • Supernatural attribution begins. By tick 50 Alice held: "the threat to our village is more likely to be a dark force or entity rather than just human mischief." Carol reached for "a deeper secret beyond the grain shortages." The first stones of a religion.

And the result I did not expect, which is the most interesting thing in the project:

A capable model resists mythologising. The proto-myth never reached the three-agent threshold. Two of five agents made the supernatural leap; the other three consolidated the same material into sober sociology ("rumours damage trust"). gpt-4o-mini is pulled by its training toward rational-actor narratives, away from "a spirit did this," even when the evidence in front of it is genuinely inexplicable.

If that holds, it is a real and slightly unsettling claim: the more capable and better aligned the model, the harder it is to make its agents superstitious, and so the harder to grow a religion in silico. The stub (no reasoning at all) crystallises doctrines readily. The smart model resists. The substrate for shared myth may be something a sufficiently rational mind actively suppresses.


What is assumed, and might be wrong

Every result above is bounded by these modelling choices.

  • Fidelity is a single scalar. Real forgetting is multidimensional: you can keep the gist and lose the source. Lethe collapses that into one number.
  • Distortion equals an LLM rewrite. I assume a model corrupting text is a fair stand-in for reconsolidation. It looks right, which is not the same as validated.
  • The god-attribution metric measures provenance, not content. It counts beliefs whose source traces to an intervention or a distortion, not beliefs that are semantically supernatural. Do not over-read it.
  • The proto-myth detector matches on a text prefix. A brittle proxy for semantic convergence. It almost certainly undercounts emergent religion in the real-model runs.
  • One tick is read as roughly one day. An interpretive convenience, not a modelled quantity.
  • Belief is an explicit proposition with a scalar confidence. This ignores implicit, procedural, and somatic knowing entirely.
Open problems and things that would be interesting

The gap I most want to close:

  • Rich meaning inside the distortion regime has never been observed. The stub gives the distortion cascade at 120 ticks but with empty content. The real model gives rich content but exhausts a free token tier by tick 60, before the cascade fully engages. The single most interesting run, a real model carried past tick 100 watching genuine confabulated rumours feed genuine doctrine formation, sits in the unobserved intersection. It needs a paid tier, memory pruning to cut per-tick token cost, or a cheaper fast model for routine ticks. This is the next thing to do.

Other directions, roughly in order of appeal:

  • Does a less capable model mythologise faster? The rationalism finding predicts an inverse relationship between model capability and superstition. A sweep across model sizes would test it directly and is cheap.
  • Convergent or divergent myth? Many seeds of one scenario on a real model. Do the villages invent the same god or different ones? Either answer is interesting.
  • Implicit, unconscious trauma. Everything here is explicit and retrievable. Real trauma often is not: it shapes behaviour through channels the person cannot narrate. Modelling that means hidden state that drives behaviour with no memory record, an agent that is afraid and cannot say of what.
  • Statistical rigour. Current n is descriptive. A proper study wants n of 30 or more per condition with significance tests.
  • Competing gods. Two observers intervening against each other, conflict as a first-class event. Turns the toy into a small social-dynamics game.

Architecture

lethe/
  memory/        decay, distortion, ACT-R retrieval, consolidation, beliefs
  social/        relationship graph, rumour propagation with mutation lineage
  agents/        Big-Five personality, mood, coping, goals, system prompt
  world/         tick engine, in-process runner, scheduler, emergence and arc,
                 religion, generations, consequence graph, what-if branching
  llm/           free-tier router with fallback and a cost guard; local embeddings
  persistence/   SQLite schema (WAL, thread-local connections), snapshots
  god/           intervention API: whisper, introduce, scarcity, nudge, rumour
  api/           FastAPI and SSE; serves the frontend and the world endpoints

Quickstart

git clone <repo> && cd lethe
pip install -e ".[dev]"          # or: uv sync

cp .env.example .env             # add at least one provider key

pytest                           # 331 unit and integration tests, ~81% coverage
pytest e2e/ --no-cov             # 23 Playwright E2E (run: playwright install chromium first)

python scripts/scenario_seed.py  # seed "The Grain Thief": five agents, one rumour

# One process serves the API and frontend and ticks the world in place:
LETHE_AUTORUN=1 uvicorn lethe.api.server:app --port 8090
# open http://localhost:8090

On Windows the repo convention is to call the venv interpreter directly, for example ./.venv/Scripts/python.exe -m pytest.

Reproducing the experiments

# Full deterministic sweep, about 20 to 40 minutes, $0, CPU only.
# Use 120 ticks, not 40: the short runs miss the distortion regime entirely.
python scripts/experiment.py --worlds 8 --ticks 120 --stub --seed 1 \
  --scenario high_trauma low_trauma neutral no_god --out findings/exp1_trauma_120t.csv

# Per-condition means and Cohen's d for any result CSV:
python scripts/analyze_experiments.py --csv findings/exp1_trauma_120t.csv --baseline no_god

A real-model run is the same command without --stub. It costs API tokens and self-limits via a daily cost guard (LETHE_OPENAI_MAX_DAILY_COST_USD, default $1).

Run it 24/7, and the free-tier model stack
docker compose up -d --build     # http://localhost:8090, state in a named volume

fly.toml is included for a Fly.io deploy that keeps one machine alive. The adaptive scheduler spreads a daily token budget across active hours and slows overnight, so a single free tier can sustain the world for days.

The router tries providers in order and falls back on any failure or rate limit:

  1. OpenAI gpt-4o-mini, primary, 2.5M tokens per day on the free tier
  2. Gemma 4 31B and 26B (Google AI Studio), fallback
  3. OpenRouter (Llama 3.3 70B free), optional
  4. Groq, fastest inference
  5. Cerebras, high token budget
  6. Ollama, local last resort, fully private, $0

At least one key is required for a live world. The research harness needs none (--stub). A per-day cost guard falls back to free providers once spend crosses the cap, so a live run cannot quietly run up a bill.


License

MIT

About

A persistent village of LLM-powered agents whose memories decay, distort, and mutate like human memory. A research toy testing whether a worse memory makes a more interesting mind.

Topics

Resources

License

Stars

Watchers

Forks

Contributors