Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/design/mnemon-takeaways.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Takeaways from mnemon-dev/mnemon

**Reviewed:** 2026-05-21 — [mnemon-dev/mnemon](https://github.com/mnemon-dev/mnemon) at HEAD
**Author:** justrach (review session notes, not a roadmap commitment)

## What mnemon is

Persistent cross-session memory for LLM agents. Single Go binary + SQLite WAL. Four-graph knowledge store (temporal / entity / causal / semantic), intent-aware recall, importance-with-decay, automatic deduplication. Integrates with Claude Code, Codex, OpenClaw, Nanobot, NanoClaw via a markdown-installable harness.

272 stars, Go, MIT, actively maintained.

**The category is different from codedb.** Mnemon is *agent memory* (insights, decisions, context across sessions). Codedb is *code search* (sub-ms index over a single project's source tree). They're stack-complementary, like ACE in the previous spec at `docs/design/ace-integration.md`.

## The design idea worth stealing

### 1. LLM-Supervised vs LLM-Embedded — same pattern codedb already uses

> "Most memory tools embed their own LLM inside the pipeline. Mnemon takes a different approach: **your host LLM is the supervisor.** The binary handles deterministic computation (storage, graph indexing, search, decay); the LLM makes judgment calls (what to remember, how to link, when to forget). No middleman, no extra inference cost."

This is exactly the shape codedb has: codedb does the deterministic index work (trigram / word / outline / deps), and the agent makes judgment calls about what to query for. The shape is validated — mnemon explicitly contrasts it with Mem0/Letta (LLM-embedded) and Claude Code Memory (file injection).

**Takeaway:** keep the LLM-Supervised pattern as codedb's identifying architecture. Resist the temptation to bake an LLM into codedb (e.g., for the reader.md regeneration loop — leave that to the host agent).

### 2. Intent-native protocol — `remember / link / recall`

Mnemon's three primary verbs are:
- `remember` (write)
- `link` (graph edge)
- `recall` (read)

…not `INSERT`, `UPSERT`, `SELECT`. The argument is that command names should map to the LLM's cognitive vocabulary so the agent can use them without translation.

Codedb has a mix today: `codedb_search` and `codedb_outline` are operation-named; `codedb_callers` and `codedb_context` lean cognition-named.

**Takeaway:** when adding new MCP tools, prefer cognition-named verbs over operation-named ones. E.g. a future "who-calls-this-API-from-outside-this-package" tool should be `codedb_external_callers` (intent) not `codedb_xref_filter` (operation).

### 3. Effective Importance (EI) decay formula

```
EI = base_weight(importance) × access_factor × decay_factor × edge_factor

base_weight: imp 5 → 1.0, … 1 → 0.15
access_factor: max(1.0, log(1 + access_count))
decay_factor: 0.5 ^ (days_since_access / 30) — half-life of 30 days
edge_factor: 1.0 + 0.1 × min(edge_count, 5) — up to +0.5
```

Auto-pruning fires at >1000 active insights; immunity for importance ≥4 or access_count ≥3.

**Takeaway for codedb's reader.md staleness model:** today reader.md is binary `ready | stale | missing | malformed`. A graceful-decay analog would be:

```
freshness = 1.0 × decay(time_since_generation) × structural_match(source_hash_partial)

structural_match: 1.0 if hash exact match, 0.9 if same files but small edits,
0.5 if same files with significant edits, 0.0 if files renamed/removed
```

This would let reader.md remain "useful but aging" for a while instead of cliff-edging into stale on the first whitespace change. **Not a current priority** — the binary hash is simpler and conservative — but worth keeping in the design folder.

### 4. Hybrid extraction (regex + tech dictionary + LLM-assisted)

For entity extraction (binding insights to common terms like `Qdrant`, `Kubernetes`, `React`), mnemon uses:

1. Regex patterns (CamelCase, ALL_CAPS, file paths, URLs)
2. A 200+ entry technical dictionary
3. User-provided `--entities` flag
4. LLM-assisted causal-edge candidacy

Codedb's `extractContextCandidates` (in `handleContext`) already does (1) via CamelCase / snake_case / quoted-string heuristics. It could borrow (2) — a small technical-term dictionary would catch keywords like `WSGI`, `JIT`, `IPC`, `TLS` that the current pattern misses.

**Takeaway:** consider augmenting the keyword extractor with a small (~100-entry) curated tech dictionary. Cheap, deterministic, no LLM call. File as a P3 enhancement.

### 5. Lifecycle hooks — Prime / Remind / Nudge / Compact

Mnemon installs hooks at four phases:

| phase | trigger | mnemon action |
|---|---|---|
| Prime | session start | make skill, guideline, active store visible |
| Remind | user prompt arrives | decide whether recall could change this task |
| Nudge | mid-conversation | suggest writing important moments |
| Compact | before context compression | persist what would be lost |

**Takeaway for codedb:** ship a `codedb hooks install` mode that writes `.claude/hooks.json` entries for:
- `SessionStart`: print `codedb status` + reader.md staleness summary
- `Stop`: if reader.md was marked stale during the session, prompt the agent to regenerate before context-compact

Closes critical-review I06 (`codedb_status` doesn't surface reader.md state). Small, concrete, follow-up.

### 6. Skill + Guideline split

Mnemon ships **two** markdown files for agent integration:
- `SKILL.md` — the commands (what)
- `GUIDELINE.md` — the judgment (when, why)

The split is intentional: SKILL teaches syntax, GUIDELINE teaches taste. Pasting both into an agent's prompt is the markdown-installable harness.

Codedb has `docs/skills.md` (similar to SKILL.md) but no separate GUIDELINE. A short `docs/guideline.md` could codify things like:
- When to use `codedb_context` vs `codedb_search`
- When the reader.md prepend is helping vs noise (and how to tell)
- How to interpret "stale" hints
- When to write a new `.codedb/reader.md` vs let the existing one stay

**Takeaway:** add `docs/guideline.md` as a v0.2.5818 follow-up. ~150 lines max.

## What NOT to steal

- **Knowledge graph storage** (the four-graph model). Code is structural — it already has graphs (`codedb_callers`, `codedb_deps`). Adding a temporal/causal/semantic memory graph on top is the wrong shape for a code-search tool.
- **Auto-pruning + soft-delete**. Codedb's snapshot is a snapshot of the current source tree; "pruning" old code paths doesn't make sense.
- **The `remember / link / recall` API verbatim**. Codedb doesn't write user-authored facts; the agent doesn't author code memories via codedb. Skip.

## Concrete v0.2.5818 candidates (ranked by ROI)

1. **Lifecycle hooks installer** — `codedb hooks install` writes `.claude/hooks.json` with SessionStart + Stop checks. Closes I06. ~50 LOC + a tiny JSON template. **High value, low risk.**
2. **`docs/guideline.md`** — separate from skills.md, teaches when/why. **Pure docs.**
3. **Tech-dictionary keyword extraction** — augment `extractContextCandidates` with a 100-entry dict for terms regex misses. **~30 LOC.**
4. **Decay-style reader.md staleness** — design only for now; the binary hash protocol is fine for v0. **Design doc, no code.**

None of these are urgent. Tracking them here so the option stays open.
90 changes: 90 additions & 0 deletions docs/design/perf-context-token-cut-eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# perf(context) token-cut — eval against v0.2.5817

**Date:** 2026-05-21 (after token-cut commit `1276bd4` + mnemon doc `858a8d5`)
**Branch:** `perf/codedb-context-token-cut`
**Question:** Does the deterministic 49% byte reduction on T1-shape `codedb_context` output translate to fewer agent calls in end-to-end use?

## Deterministic byte count

Same task, same corpus, both binaries:

```
$ codedb_context "find before_request decorator" /Users/.../flask
```

| | bytes | approx tokens |
|---|---:|---:|
| v0.2.5817 release | 2993 | ~750 |
| this branch (token-opt) | **1525** | **~380** |
| Δ | **−1468 B** | **−49%** |

Where the bytes came from: the entire "## Top sites (with ±2 lines of context)" section + 2 entries from "## Most-relevant files." Verified byte-level — the change is deterministic.

## Agent eval (n=3 per task, Sonnet 4.6)

### T1 flask "find before_request decorator" — *gate fires* (3 sym_refs)

| sample | token-opt | main_baseline (earlier eval) |
|---|---:|---:|
| A | **5** | 4 |
| B | **6** | 5 |
| C | **5** | 5 |
| **mean** | **5.33** | 4.67 |
| **median** | **5** | 5 |
| **best** | **5** | 4 |
| **worst** | **6** | 5 |
| **spread (max−min)** | **1** | 1 |

**Reading:** mean is 0.66 calls worse than main, but distribution is tighter (5/6/5 vs 4/5/5 — same spread, both bounded). Median ties. The 49% byte reduction did NOT cause the agent to need more calls — every sample landed at 5±1. This is **at-parity-or-noise**, with a real byte saving.

The earlier "post-callers" eval on `experiment/reader-md` had 4/4/7 (mean 5.0, one wild 7) — the token-opt has tighter variance, which is a positive sign.

### T2 regex "where is a pattern compiled" — *gate does NOT fire* (6+ sym_refs from NFA/DFA matches)

| sample | token-opt |
|---|---:|
| A | 19 |
| C | 16 |
| mean (n=2) | 17.5 |

Gate doesn't fire (verified by inspecting codedb_context output for T2: sym_refs.items.len = 6, > 3 threshold). Output is byte-identical to v0.2.5817 here, so any variance is pure agent noise. Comparable to v0.2.5817 baseline.

### T3 react "passive effects flush" — *gate does NOT fire* (many useEffect/useLayoutEffect matches)

| sample | token-opt |
|---|---:|
| A | 7 |
| B | 15 |
| C | 16 |
| mean | 12.67 |
| median | 15 |

Same situation as T2 — gate doesn't fire, output identical to v0.2.5817. The wide spread (7 to 16) is the same agent-variance pattern we've seen on T3 across all branches.

## Conclusion

The **−49% byte saving on T1-shape tasks is real and deterministic** (same input → same shorter output). The end-to-end agent eval shows:

- **T1 (where the gate fires)**: at-parity-or-noise with main. Median ties (5=5), mean 0.66 worse but with tighter spread. The cut byte content was redundant on this task shape — the agent didn't need it.
- **T2/T3 (gate doesn't fire)**: byte-identical to v0.2.5817 → only sampling noise differentiates. Numbers vary as before.

The token cut is a free win on narrow-symbol tasks. For agents on small-context models (Haiku, Sonnet on tight context), this matters more than the n=3 agent-call eval can show — the saved tokens stay in the agent's context window for the rest of the session.

## Correctness

9/9 runs across the matrix returned correct answers (decorator name, file, execution site, function — all matched across every sample). No quality regression.

## Threats to validity

- n=3 is still small. Confidence interval on T1 mean is ±~1 call.
- Sonnet 4.6 only; no Haiku or Opus comparison.
- The T2/T3 numbers are essentially measuring agent variance, not the branch — they're "doesn't get worse" sanity checks, not headlines.
- The 49% byte figure was measured on a single T1 task; other T1-shape tasks in real workloads may see different ratios depending on how many Top sites snippets the composer would have emitted.

## Recommendation

Ship the token cut. It's a deterministic, opt-out-free improvement that:
- Cuts 49% of output bytes on the most common narrow-lookup task shape
- Causes no measurable harm at n=3 on the same task
- Cannot affect wider tasks (gate is symbol-count-conditional)
- Pairs with the mnemon-takeaways doc to round out PR #491
17 changes: 15 additions & 2 deletions src/mcp.zig
Original file line number Diff line number Diff line change
Expand Up @@ -1952,14 +1952,27 @@ fn handleContext(io: std.Io, alloc: std.mem.Allocator, args: *const std.json.Obj
out.appendSlice(alloc, "\n(no content matches — try codedb_search or codedb_word for narrower queries)\n") catch {};
return;
}
// Token-efficiency gate: when symbol_definitions already inlined ≥3 bodies
// (the inline_bodies branch in the Symbol definitions section), the agent
// has the function bodies in-band. "Top sites" snippets then duplicate
// information at high token cost. Trim Most-relevant to 3 entries and
// skip the snippet body entirely in that case.
const have_inline_bodies = sym_refs.items.len > 0 and sym_refs.items.len <= 3;
const display_top_n = if (have_inline_bodies) @min(top_n, @as(usize, 3)) else top_n;
w.print("\n## Most-relevant files\n", .{}) catch {};
for (ranked.items[0..top_n]) |f| {
for (ranked.items[0..display_top_n]) |f| {
w.print("- {s} ({d} matches)\n", .{ f.path, f.hits }) catch {};
}
if (have_inline_bodies) {
// Symbol definitions + Callers already give the agent enough — skip
// the snippet body to save ~500-1000 tokens per call. This is the
// T1 flask shape (3 symbol defs, all with inline bodies).
return;
}
w.print("\n## Top sites (with ±2 lines of context)\n", .{}) catch {};
explorer.mu.lockShared();
defer explorer.mu.unlockShared();
for (ranked.items[0..top_n]) |f| {
for (ranked.items[0..display_top_n]) |f| {
// Fetch full file content once per file, then slice ±2 lines around
// each hit. Indexed cache hits common files in ~µs; arena owns the
// dupe so we don't leak.
Expand Down
Loading