Ontology-grounded memory primitive for LLM agents. Tiered salience decay, closed-world fact extraction, per-agent filtered views. Pure-stdlib storage, no vector store, no graph database.
ontocontext is a small library you wire into your own LLM agent or multi-agent system to give it selective, decaying long-term context — the kind of memory that remembers a user's allergy across an entire session, lets a "where's the bathroom?" question fade after three turns, and lets each subagent see only the slice of state it needs.
The repository also ships a museum-chatbot example as a reference integration, but the library is the product. The example is in examples/museum_chatbot/.
| ontocontext | Vector-store memory (mem0, etc.) | |
|---|---|---|
| What it remembers | Only what your ontology declares | Whatever the LLM thinks is salient |
| Forgetting | Declarative per-tier decay + bounded cardinality with recency/salience eviction | LLM-judged DELETE; otherwise accumulates |
| Multi-agent slicing | First-class: filter by concept prefix or class | Bolt-on at the orchestrator |
| Auditability | Read one JSON file to know what the system can remember | Inspect emitted facts across thousands of turns |
| Cost per turn | One LLM call (extraction) | Two-plus (extraction + reconcile) |
| Storage | dict[str, Fact] in process |
Vector DB + optional graph DB |
If your domain is bounded enough that you can name the categories of things worth remembering, ontocontext is a much smaller, more predictable lever. If it isn't, pair it with a vector store — or use mem0. There's a longer comparison in docs/oc_vs_mem0.md.
pip install ontocontext
# or
uv add ontocontextPython 3.10+. Runtime dependencies: requests (used by the bundled OpenRouter extractor; swappable) and rapidfuzz (used by the optional fuzzy-merging matcher; pluggable).
from ontocontext import MemoryStore, extract_facts, filtered_blockThe minimum per-turn loop. No museum framing — bring your own ontology.
sequenceDiagram
autonumber
actor User
participant App as Your app
participant Ext as extract_facts<br/>(extractor LLM)
participant Mem as MemoryStore
participant Chat as Chat LLM
User->>App: user_msg
App->>Ext: user_msg + ontology
Ext-->>App: facts: [{concept, value,<br/>polarity, evidence}]
loop for each fact
App->>Mem: upsert(concept, value, polarity, evidence)
end
App->>Mem: tick() — decay all facts,<br/>prune below threshold
Mem-->>App: context_block()
App->>Chat: system_prompt(memory) + user_msg
Chat-->>App: bot reply
App-->>User: bot reply
Two LLM calls per turn: one cheap, structured extraction call (steps 2–3) and one chat completion (steps 8–9). Everything between is deterministic.
import json
from ontocontext import MemoryStore, extract_facts
with open("your_ontology.json") as f:
ontology = json.load(f)
memory = MemoryStore(ontology)
def handle_turn(user_msg: str, llm_call) -> str:
# 1. Extract ontology-mapped facts from the user turn.
facts = extract_facts(
user_msg=user_msg,
bot_msg=None, # optionally pass last assistant turn for confirmations
ontology=ontology,
model="openai/gpt-4o-mini", # any OpenRouter model
)
# 2. Upsert. The store handles negation, capacity/eviction, and reinforcement.
for f in facts:
memory.upsert(
concept=f["concept"],
value=f["value"],
polarity=f["polarity"],
evidence=f["evidence"],
)
# 3. Decay everything one turn; prune below threshold.
memory.tick()
# 4. Inject the current memory block into your agent's system prompt.
system_prompt = (
"You are <agent role>.\n\n"
f"What we know about the user:\n{memory.context_block()}"
)
# 5. Hand off to whatever LLM you use.
return llm_call(system_prompt, user_msg)That's the whole integration. MemoryStore is pure in-process state — serialize it yourself if sessions need to survive restarts (see Persistence).
from ontocontext import (
MemoryStore, # tiered fact store with per-turn decay
Fact, # dataclass for one stored fact
extract_facts, # closed-world fact extractor (OpenRouter)
filtered_block, # per-agent context view, filtered by concept/class
build_concept_catalog, # render ontology as extractor catalog text
EXTRACTOR_SYSTEM, # extractor system-prompt template with {catalog} slot
DECAY, # dict: per-class decay multipliers
PRUNE_THRESHOLD, # float: salience floor for non-permanent facts
)All other symbols (modules, helpers prefixed with _) are internal and may change without notice.
| Class | Per-turn decay | Examples |
|---|---|---|
permanent |
1.00 (none) | disability, language preference, allergy |
long_term |
0.99 | preferences, repeated interests |
session |
0.85 | "with my daughter today", "I have 1 hour" |
ephemeral |
0.55 | "I need a coffee", transient questions |
Below salience = 0.10, non-permanent facts are pruned. Permanent facts are removed only by explicit negation.
stateDiagram-v2
direction LR
[*] --> alive : upsert() — salience = salience_weight
alive --> alive : tick() — salience × decay
alive --> alive : re-mention — salience + bump
alive --> pruned : salience < 0.10
alive --> pruned : polarity negated
alive --> evicted : concept over cardinality
pruned --> [*]
evicted --> [*]
note right of alive
permanent ×1.00 — never pruned automatically
long_term ×0.99 — ~230 turns to threshold
session ×0.85 — ~14 turns to threshold
ephemeral ×0.55 — ~4 turns to threshold
eviction — recency drops the oldest,
salience drops the weakest
end note
The ontology is the contract between the extractor (what it is allowed to emit) and the memory store (how each fact decays and updates). Concept IDs are arbitrary strings; convention is Domain.Subtype.
flowchart LR
O[("ontology.json<br/>concepts + metadata")]
U([user message]) --> E
E[extract_facts<br/>LLM extractor]
S[MemoryStore<br/>upsert + tick]
CB[/context_block /<br/>filtered_block/]
O -. "concept catalog<br/>(label, examples)" .-> E
O -. "per-concept rules<br/>(persistence_class,<br/>salience_weight,<br/>cardinality, eviction)" .-> S
E -- "facts:<br/>concept, value,<br/>polarity, evidence" --> S
S --> CB
classDef contract fill:#fef3c7,stroke:#d97706,stroke-width:2px
class O contract
The ontology sits between the two consumers as a single source of truth: the extractor reads it to know what it's allowed to emit, and the store reads it to know how each emitted fact should age and update. If a concept isn't in the ontology, the extractor can't emit it, and the store wouldn't know how to handle it if it did.
{
"version": "0.2",
"description": "Closed ontology for <your domain>",
"concepts": {
"Domain.SomeConcept": {
"label": "Human-readable name shown in the LLM prompt",
"examples": ["I want X", "give me Y"],
"persistence_class": "permanent | long_term | session | ephemeral",
"salience_weight": 0.0,
"cardinality": 1,
"eviction": "recency | salience"
}
}
}| Field | What it controls |
|---|---|
label |
Text rendered in the context block injected into your prompt. Write it for the LLM, not the developer. |
examples |
Few-shot anchors handed to the extractor. 2–3 short, varied examples are enough. |
persistence_class |
Which decay tier the fact lives in — when it fades. |
salience_weight |
Starting salience when first asserted; also caps reinforcement. Higher = survives more decay turns. |
cardinality |
How many values may coexist under this concept: a positive int (1, 5, …) or "unlimited". |
eviction |
When the concept is over capacity, which fact loses: recency (drop the oldest) or salience (drop the weakest). |
Changed in 0.2.0. Memory shape is declared on three independent axes. The old
update_policyfield is deprecated but still honoured — see Migrating.
persistence_class— the time axis: how fast a fact decays. Unchanged.cardinality— the count axis: how many values may live under one concept at once.1keeps a single value; an integerKkeeps a bounded set;"unlimited"accumulates.eviction— the conflict axis: when a concept exceeds its cardinality,recencydrops the oldest fact andsaliencedrops the weakest.
Two worked concepts:
| concept | persistence | cardinality | eviction | behaviour |
|---|---|---|---|---|
Position.CurrentArea |
session | 1 |
recency |
only the latest place; a new area overwrites the old |
ArtInterest.Medium |
long_term | 8 |
salience |
up to eight interests; the weakest is dropped when a ninth arrives |
cardinality: 1, eviction: recency is exactly the old superseded behaviour — now expressible alongside bounded plurality ("keep the strongest 5"), which update_policy could not express.
Concepts that still declare update_policy keep working and emit a DeprecationWarning at MemoryStore construction. Replace it with explicit cardinality/eviction using this mapping:
legacy update_policy |
cardinality |
eviction |
|---|---|---|
superseded |
1 |
recency |
mutable |
"unlimited" |
salience |
monotonic |
"unlimited" |
salience |
(monotonic and mutable behaved identically in practice — both accumulated and never evicted — so both map to unlimited + salience.)
Heuristics for assigning a persistence class:
permanent— things you must not forget mid-session: disabilities, language preference, allergies, account tier, jurisdiction.long_term— soft preferences worth carrying forward across turns and ideally across sessions: favorite styles, repeated requests, named entities the user cares about.session— bounds of this conversation: time budget, current goal, who's in the room.ephemeral— should vanish in a few turns if not re-mentioned: immediate needs, transient mood, just-asked questions.
When in doubt, choose the shorter lifetime. Over-retention causes prompt bloat and stale recommendations; under-retention is recoverable because the user usually re-states what matters.
A complete worked ontology lives in examples/museum_chatbot/ontology.json.
extract_facts() is a thin wrapper over OpenRouter. To swap, copy it and replace just the HTTP block. The system prompt and catalog builder are part of the public API:
import json
from ontocontext import build_concept_catalog, EXTRACTOR_SYSTEM
def extract_facts_anthropic(user_msg, bot_msg, ontology, client):
system_prompt = EXTRACTOR_SYSTEM.format(catalog=build_concept_catalog(ontology))
convo = f"USER: {user_msg}" + (f"\nASSISTANT: {bot_msg}" if bot_msg else "")
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
system=system_prompt,
messages=[{"role": "user", "content": convo}],
temperature=0,
max_tokens=512,
)
data = json.loads(resp.content[0].text)
# Keep the defensive validation from the bundled extractor verbatim:
return [f for f in data.get("facts", [])
if isinstance(f, dict)
and f.get("concept") in ontology["concepts"]
and f.get("polarity") in ("asserted", "negated")
and f.get("value") and f.get("evidence")]The validation block is provider-agnostic and load-bearing — keep it whatever LLM you use.
memory.context_block() returns all surviving facts as one string. For a multi-agent system you typically want each subagent to see only the slice it needs. Three patterns work today.
Simplest. The orchestrator runs extraction once per turn, then every subagent receives the same context block. Works fine when agents are few and most concepts are cross-cutting.
shared_memory = MemoryStore(ontology)
# orchestrator does extract → upsert → tick once
context = shared_memory.context_block()
accessibility_agent.run(system=ACCESS_PROMPT + context, user=user_msg)
wayfinding_agent.run(system=WAY_PROMPT + context, user=user_msg)flowchart TB
U([user turn]) --> EX[extract_facts]
EX --> MS[("MemoryStore\n(shared)")]
MS --> CB["context_block\n(all facts)"]
CB --> A1[agent 1]
CB --> A2[agent 2]
CB --> A3[agent 3]
Downside: prompt bloat scales with subagent count, and small models get distracted by irrelevant facts.
Think of each subagent as subscribing to specific kinds of memories. A wayfinding agent subscribes to SpecialNeed.*. A recommendations agent subscribes to ArtInterest.* and VisitPlan.*. The subscription declaration sits with the agent definition, so the agent structurally cannot answer without seeing its relevant memory slice — no prompt wording required to remind the LLM.
This matters more than it looks. Consider a wheelchair user who asks "where is the bathroom?" In a single-agent setup (Pattern A) the LLM has the wheelchair fact in context but has to remember to use it on every directions query, which LLMs do inconsistently. With a subscribed wayfinding agent, SpecialNeed.MobilityImpairment is always in the slice that agent sees — the cross-reference is forced by the architecture, not requested by the prompt.
Use the shipped filtered_block helper to render an agent's subscription:
from ontocontext import filtered_block
wayfinding_agent.run(
system=WAY_PROMPT + filtered_block(memory, concept_prefixes=["SpecialNeed.", "ImmediateNeed."]),
user=user_msg,
)
recommendations_agent.run(
system=REC_PROMPT + filtered_block(memory, concept_prefixes=["ArtInterest.", "VisitPlan."]),
user=user_msg,
)
accessibility_agent.run(
system=ACCESS_PROMPT + filtered_block(memory, concept_prefixes=["SpecialNeed."]),
user=user_msg,
)Single source of truth (one extraction pass, one decay clock) with role-relevant slices. Privacy bonus: the recommendations agent never sees disability disclosures it doesn't need.
A clean way to make subscriptions explicit is to bind them to the agent definition itself:
AGENTS = {
"wayfinding": {"prefixes": ["SpecialNeed.", "ImmediateNeed."], "prompt": WAY_PROMPT},
"accessibility": {"prefixes": ["SpecialNeed."], "prompt": ACCESS_PROMPT},
"recommendations": {"prefixes": ["ArtInterest.", "VisitPlan."], "prompt": REC_PROMPT},
"immediate": {"classes": ["ephemeral"], "prompt": NEED_PROMPT},
}
for name, cfg in AGENTS.items():
block = filtered_block(memory,
concept_prefixes=cfg.get("prefixes"),
classes=cfg.get("classes"))
dispatch(name, system=cfg["prompt"] + block, user=user_msg)The dict is the subscription contract: changing what an agent "knows about the user" is a one-line edit, reviewable in a PR, and impossible to forget in a prompt.
flowchart TB
U([user turn]) --> EX["extract_facts\n(one pass)"]
EX --> MS[("MemoryStore\n(shared)")]
MS --> FB1["filtered_block\nSpecialNeed.*, ImmediateNeed.*"]
MS --> FB2["filtered_block\nArtInterest.*, VisitPlan.*"]
MS --> FB3["filtered_block\nSpecialNeed.*"]
FB1 --> WA[wayfinding_agent]
FB2 --> RA[recommendations_agent]
FB3 --> AA[accessibility_agent]
classDef filter fill:#ede9fe,stroke:#7c3aed,stroke-width:1px
class FB1,FB2,FB3 filter
For fully independent subagents (different domains, different LLMs, different lifecycles) give each its own MemoryStore and its own ontology fragment. Run extraction once per agent. More expensive, maximally isolated. Reach for it only when state really must not cross.
flowchart TB
U([user turn])
U --> EX1["extract_facts\n(ontology A)"]
U --> EX2["extract_facts\n(ontology B)"]
EX1 --> MS1[("MemoryStore A\n+ ontology A")]
EX2 --> MS2[("MemoryStore B\n+ ontology B")]
MS1 --> AG1[agent A]
MS2 --> AG2[agent B]
MemoryStore.facts is a dict[str, Fact] of plain dataclasses, so serialization is one line per direction.
from dataclasses import asdict
from ontocontext import Fact, MemoryStore
# Save (e.g. on session end)
snapshot = {
"turn": memory.turn,
"facts": [asdict(f) for f in memory.facts.values()
if f.persistence_class in ("permanent", "long_term")],
}
# Load (e.g. on session start)
memory = MemoryStore(ontology)
memory.turn = snapshot["turn"]
for d in snapshot["facts"]:
f = Fact(**d)
memory.facts[f.id] = fflowchart LR
subgraph SN["Session N"]
M1[("MemoryStore")]
end
SNAP["snapshot\n(permanent + long_term only)"]
subgraph SN1["Session N+1"]
M2[("MemoryStore\n(restored)")]
end
GONE(["session + ephemeral\ndropped intentionally"])
M1 -->|"asdict()"| SNAP
M1 -.->|"not saved"| GONE
SNAP -->|"Fact(**d)"| M2
classDef dropped fill:#fee2e2,stroke:#dc2626,stroke-width:1px
class GONE dropped
Round-tripping session and ephemeral facts across sessions is almost always wrong — those tiers exist precisely because they should not survive.
The library reads two values from code:
| Knob | Where | Effect |
|---|---|---|
DECAY |
ontocontext.memory.DECAY |
Per-turn multiplier per persistence class. Lower = faster forgetting. |
PRUNE_THRESHOLD |
ontocontext.memory.PRUNE_THRESHOLD (also reads PRUNE_THRESHOLD env var) |
Salience floor below which non-permanent facts are dropped. |
salience_weight |
per-concept in your ontology JSON | Starting salience and reinforcement ceiling. |
| Reinforcement bump | memory.py (0.3 * base) |
How much a re-mention raises salience. |
Decay defaults are tuned for one fact extraction per conversational turn. If your app extracts per websocket message, per token, or on a timer, multiply decay factors closer to 1.0 or you will prune everything.
The bundled museum example also reads CHAT_MODEL, EXTRACTOR_MODEL, and HISTORY_SIZE from the environment, but those are conventions of the example, not the library.
By default, MemoryStore dedupes facts within a concept by exact lower-stripped string equality. That means value="photography" on turn 1 and value="photos" on turn 5 produce two facts under the same ArtInterest.Style concept, when you probably wanted one reinforced fact. Two kwargs on MemoryStore.__init__ enable optional fuzzy merging:
from ontocontext import MemoryStore
# Off by default: strict exact match, current behavior.
memory = MemoryStore(ontology)
# Opt in: rapidfuzz WRatio with a sane starting threshold.
memory = MemoryStore(ontology, similarity_threshold=0.80)
# Now "photography" + "photos" → 1 fact (reinforced)
# "Pablo Picasso" + "Picasso" → 1 fact
# "Monet" + "Picasso" → 2 facts (correctly rejected)| Knob | Default | Effect |
|---|---|---|
similarity_threshold |
1.0 |
Cutoff in [0, 1]. 1.0 = strict exact match; < 1.0 enables fuzzy merging within a concept. ~0.80 works well with the default matcher. |
similarity_fn |
None |
Optional (a, b) -> float callable. None uses rapidfuzz.fuzz.WRatio on lower-stripped strings. |
Suppose the extractor sees a museum visitor talk about photography and Picasso across a session, naturally varying the phrasing turn over turn. With a mutable ArtInterest.Style concept (one user can have several styles), each new value is normally treated as a new style. That's the right semantics for genuinely different styles — but morphological variants of the same one should be a single reinforced fact.
The snippet below runs both configurations against the same input and prints memory.dump() after each:
from ontocontext import MemoryStore
ontology = {"concepts": {"ArtInterest.Style": {
"label": "Art style or medium",
"examples": ["I love X", "I'm into X"],
"persistence_class": "long_term",
"salience_weight": 0.7,
"cardinality": "unlimited",
"eviction": "salience",
}}}
# Values the extractor emits over five conversational turns:
turns = [
("photography", "I really love photography, especially mid-century stuff."),
("Picasso", "Picasso is incredible."),
("photos", "Honestly, I just adore old photos."),
("Pablo Picasso", "Anything by Pablo Picasso, really."),
("photographs", "These vintage photographs are amazing."),
]
def run(store):
for value, evidence in turns:
store.upsert("ArtInterest.Style", value, "asserted", evidence)
store.tick()
return store.dump()
print(run(MemoryStore(ontology))) # default
print(run(MemoryStore(ontology, similarity_threshold=0.80))) # fuzzyDefault (similarity_threshold=1.0) — five separate facts, each at low salience because none reinforced the others:
(turn 5) 5 fact(s):
• [ArtInterest.Style] photographs (sal=0.69, long_term) [from: "These vintage photographs are amazing."]
• [ArtInterest.Style] Pablo Picasso (sal=0.69, long_term) [from: "Anything by Pablo Picasso, really."]
• [ArtInterest.Style] photos (sal=0.68, long_term) [from: "Honestly, I just adore old photos."]
• [ArtInterest.Style] Picasso (sal=0.67, long_term) [from: "Picasso is incredible."]
• [ArtInterest.Style] photography (sal=0.67, long_term) [from: "I really love photography, especially mid-century stuff."]
Fuzzy (similarity_threshold=0.80) — two reinforced canonical facts. "photography" absorbed "photos" and "photographs" (now near-ceiling salience); "Picasso" absorbed "Pablo Picasso":
(turn 5) 2 fact(s):
• [ArtInterest.Style] photography (sal=0.99, long_term) [from: "These vintage photographs are amazing."]
• [ArtInterest.Style] Picasso (sal=0.88, long_term) [from: "Anything by Pablo Picasso, really."]
Three things worth noticing:
- The original
valuesurvives."photography"was asserted first, so it stays — even though the most recent mention used"photographs". This keeps the context block stable across turns. - The latest
evidencewins. The audit string updates on every reinforcement so you can trace back to the most recent supporting utterance. - Salience accumulates correctly.
"photography"was reinforced three times in five turns and is now at ceiling, exactly the signal a downstream agent needs.
The first canonical form wins — when "photos" reinforces an existing "photography", salience and evidence update but the stored value stays "photography". Negation also goes through the fuzzy match: upsert(..., "photos", polarity="negated") removes the stored "photography" fact if it's within threshold.
Embedding-based merging without changing the library. Swap similarity_fn for any callable that returns a [0, 1] similarity score — sentence-transformers, OpenAI embeddings, Qwen3 cosine, your own model. The library never sees the embedding stack:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
cache: dict[str, np.ndarray] = {}
def embed_cosine(a: str, b: str) -> float:
for s in (a, b):
if s not in cache:
cache[s] = model.encode(s, normalize_embeddings=True)
return float(np.dot(cache[a], cache[b]))
memory = MemoryStore(ontology, similarity_threshold=0.70, similarity_fn=embed_cosine)This is the documented extension point for true synonym handling (car/automobile, cross-lingual paraphrases) — the bundled rapidfuzz default catches character-overlap cases (morphology, typos, name partials) at zero install cost.
- Not a vector store. Retrieval is "everything currently above threshold," not query-conditioned. If a user asks "where can I get one?" three turns after mentioning coffee, the coffee fact must still be alive on its own salience — there is no semantic recall of pruned facts. Pair with a vector store if you need that.
- Not persistent by default. State lives in process memory; serialize yourself.
- Not multi-user. One
MemoryStoreinstance per user/conversation. User isolation is your orchestrator's job. - Not concurrency-safe. No locks; wrap with your own if you have async writers.
- Closed-world by design. The extractor will silently drop anything outside the ontology. This is the feature, not a bug — but it means evolving the ontology is a deliberate, reviewed act.
A complete reference integration lives in examples/museum_chatbot/ — a CLI chatbot that greets visitors at a modern art museum and selectively remembers what matters (special needs, art interests, time budgets, immediate needs).
# from the repo root
cp .secrets.example .secrets # paste your OPENROUTER_API_KEY
uv sync
make runyou> Hi! I'm visiting today, and I should mention I'm deaf.
you> /memory
you> I really love photography, especially mid-century stuff.
you> I only have about an hour.
you> I could really use a coffee.
you> /memory
you> Where can I have one?
you> /memory # coffee should already be decaying
you> [...a few more turns...]
you> /memory # coffee is gone, deafness and photography remain
Read examples/museum_chatbot/README.md for the full walkthrough. Nothing in examples/ is imported by the library; you do not need any of it to use ontocontext.
Issues and PRs are very welcome — especially:
- New example integrations (a different domain, a different LLM provider, a multi-agent orchestrator).
- Ontology evolution tooling: scripts that mine production logs for unanticipated concepts and propose ontology extensions.
- Persistence adapters (SQLite, Redis, Postgres) for the
permanentandlong_termtiers across sessions. - A second extraction pass on bot replies to capture clarifications and confirmations.
- Better decay calibration for non-per-turn applications (token-based, time-based).
Open an issue to discuss design changes before sending a PR, or just open a PR for bug fixes, doc improvements, and obvious wins. If you build something on top of ontocontext, drop a link — I'd love to see it.
- Per-user persistence layer (SQLite) for the
permanentandlong_termtiers. - Optional vector index over fact
valuestrings for query-conditioned retrieval (so "where can I have one?" lights up the still-warm coffee fact even if it weren't already in the always-on block). - A
/forgetAPI and a "what do you remember about me?" affordance — important for trust given that we're storing things like disability status. - A second extraction pass on bot replies.
MIT — see LICENSE.