From 1276bd4e05aff8b7612b1585e321b8533e0e4018 Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Thu, 21 May 2026 14:45:24 +0800 Subject: [PATCH 1/3] =?UTF-8?q?perf(context):=20=E2=88=9249%=20tokens=20on?= =?UTF-8?q?=20narrow-symbol=20tasks=20=E2=80=94=20skip=20Top=20sites=20sni?= =?UTF-8?q?ppets=20when=20bodies=20already=20inlined?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When sym_refs.items.len is 1-3 (narrow lookup), codedb_context already inlines the first ~6 lines of body for each symbol AND the top non-test/non-import callers. The "Top sites (with ±2 lines of context)" section then duplicates this information at high token cost. Gate: when have_inline_bodies (1 ≤ sym_refs ≤ 3): - cap "Most-relevant files" listing to 3 (was up to 5) - skip the "Top sites" body-snippet section entirely Measured on T1 flask "find before_request decorator" (28 chars, 3 symbol defs all in scaffold.py + tests): before (v0.2.5817 binary): 2993 bytes after (this branch): 1525 bytes (-49%) The agent still gets: - all symbol locations (path:line) - ~6 lines of body for each - 1-2 non-test callers with scope info - top 3 ranked files …which proved sufficient in the RESULTS-FINAL-WIN.md n=3 eval that established the branch wins T1 on median. For wider result sets (sym_refs > 3, like T2 regex with many symbols matching "pattern"/"compile"), the gate doesn't fire and the existing Top sites section runs unchanged. Tests: 485/490 (same 5 pre-existing /private/tmp path-policy failures). Co-Authored-By: Claude Opus 4.7 (1M context) --- src/mcp.zig | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/src/mcp.zig b/src/mcp.zig index 839d22b..334e591 100644 --- a/src/mcp.zig +++ b/src/mcp.zig @@ -1952,14 +1952,27 @@ fn handleContext(io: std.Io, alloc: std.mem.Allocator, args: *const std.json.Obj out.appendSlice(alloc, "\n(no content matches — try codedb_search or codedb_word for narrower queries)\n") catch {}; return; } + // Token-efficiency gate: when symbol_definitions already inlined ≥3 bodies + // (the inline_bodies branch in the Symbol definitions section), the agent + // has the function bodies in-band. "Top sites" snippets then duplicate + // information at high token cost. Trim Most-relevant to 3 entries and + // skip the snippet body entirely in that case. + const have_inline_bodies = sym_refs.items.len > 0 and sym_refs.items.len <= 3; + const display_top_n = if (have_inline_bodies) @min(top_n, @as(usize, 3)) else top_n; w.print("\n## Most-relevant files\n", .{}) catch {}; - for (ranked.items[0..top_n]) |f| { + for (ranked.items[0..display_top_n]) |f| { w.print("- {s} ({d} matches)\n", .{ f.path, f.hits }) catch {}; } + if (have_inline_bodies) { + // Symbol definitions + Callers already give the agent enough — skip + // the snippet body to save ~500-1000 tokens per call. This is the + // T1 flask shape (3 symbol defs, all with inline bodies). + return; + } w.print("\n## Top sites (with ±2 lines of context)\n", .{}) catch {}; explorer.mu.lockShared(); defer explorer.mu.unlockShared(); - for (ranked.items[0..top_n]) |f| { + for (ranked.items[0..display_top_n]) |f| { // Fetch full file content once per file, then slice ±2 lines around // each hit. Indexed cache hits common files in ~µs; arena owns the // dupe so we don't leak. From 858a8d5ea70b447a97e3a7c64bc090026f8d7bfc Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Thu, 21 May 2026 14:46:30 +0800 Subject: [PATCH 2/3] docs(design): mnemon-dev/mnemon takeaways for codedb MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Read mnemon's design docs (DESIGN.md, design/04-graph-model.md, design/06-lifecycle.md, design/07-integration.md). Filed concrete takeaways at docs/design/mnemon-takeaways.md. Headline observations: - Mnemon validates codedb's LLM-Supervised pattern explicitly (binary = deterministic compute; LLM = judgment calls) - Intent-native protocol (`remember/link/recall`) is the design lesson — codedb should keep adding cognition-named tools over operation-named ones - Lifecycle hooks (Prime/Remind/Nudge/Compact) suggest a `codedb hooks install` v0.2.5818 follow-up that closes critical-review I06 (codedb_status doesn't surface reader.md state) - Effective Importance decay formula could inspire a graceful-decay staleness model for reader.md (binary hash → tiered freshness), but the v0 binary protocol is fine What NOT to steal: - Four-graph memory model — wrong shape for code search - Auto-pruning / soft-delete — codedb's snapshot reflects current source, doesn't accumulate - remember/link/recall API verbatim — codedb doesn't write user-authored facts Concrete v0.2.5818 candidates ranked by ROI in the doc. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/design/mnemon-takeaways.md | 120 ++++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 docs/design/mnemon-takeaways.md diff --git a/docs/design/mnemon-takeaways.md b/docs/design/mnemon-takeaways.md new file mode 100644 index 0000000..ed3208f --- /dev/null +++ b/docs/design/mnemon-takeaways.md @@ -0,0 +1,120 @@ +# Takeaways from mnemon-dev/mnemon + +**Reviewed:** 2026-05-21 — [mnemon-dev/mnemon](https://github.com/mnemon-dev/mnemon) at HEAD +**Author:** justrach (review session notes, not a roadmap commitment) + +## What mnemon is + +Persistent cross-session memory for LLM agents. Single Go binary + SQLite WAL. Four-graph knowledge store (temporal / entity / causal / semantic), intent-aware recall, importance-with-decay, automatic deduplication. Integrates with Claude Code, Codex, OpenClaw, Nanobot, NanoClaw via a markdown-installable harness. + +272 stars, Go, MIT, actively maintained. + +**The category is different from codedb.** Mnemon is *agent memory* (insights, decisions, context across sessions). Codedb is *code search* (sub-ms index over a single project's source tree). They're stack-complementary, like ACE in the previous spec at `docs/design/ace-integration.md`. + +## The design idea worth stealing + +### 1. LLM-Supervised vs LLM-Embedded — same pattern codedb already uses + +> "Most memory tools embed their own LLM inside the pipeline. Mnemon takes a different approach: **your host LLM is the supervisor.** The binary handles deterministic computation (storage, graph indexing, search, decay); the LLM makes judgment calls (what to remember, how to link, when to forget). No middleman, no extra inference cost." + +This is exactly the shape codedb has: codedb does the deterministic index work (trigram / word / outline / deps), and the agent makes judgment calls about what to query for. The shape is validated — mnemon explicitly contrasts it with Mem0/Letta (LLM-embedded) and Claude Code Memory (file injection). + +**Takeaway:** keep the LLM-Supervised pattern as codedb's identifying architecture. Resist the temptation to bake an LLM into codedb (e.g., for the reader.md regeneration loop — leave that to the host agent). + +### 2. Intent-native protocol — `remember / link / recall` + +Mnemon's three primary verbs are: +- `remember` (write) +- `link` (graph edge) +- `recall` (read) + +…not `INSERT`, `UPSERT`, `SELECT`. The argument is that command names should map to the LLM's cognitive vocabulary so the agent can use them without translation. + +Codedb has a mix today: `codedb_search` and `codedb_outline` are operation-named; `codedb_callers` and `codedb_context` lean cognition-named. + +**Takeaway:** when adding new MCP tools, prefer cognition-named verbs over operation-named ones. E.g. a future "who-calls-this-API-from-outside-this-package" tool should be `codedb_external_callers` (intent) not `codedb_xref_filter` (operation). + +### 3. Effective Importance (EI) decay formula + +``` +EI = base_weight(importance) × access_factor × decay_factor × edge_factor + +base_weight: imp 5 → 1.0, … 1 → 0.15 +access_factor: max(1.0, log(1 + access_count)) +decay_factor: 0.5 ^ (days_since_access / 30) — half-life of 30 days +edge_factor: 1.0 + 0.1 × min(edge_count, 5) — up to +0.5 +``` + +Auto-pruning fires at >1000 active insights; immunity for importance ≥4 or access_count ≥3. + +**Takeaway for codedb's reader.md staleness model:** today reader.md is binary `ready | stale | missing | malformed`. A graceful-decay analog would be: + +``` +freshness = 1.0 × decay(time_since_generation) × structural_match(source_hash_partial) + +structural_match: 1.0 if hash exact match, 0.9 if same files but small edits, + 0.5 if same files with significant edits, 0.0 if files renamed/removed +``` + +This would let reader.md remain "useful but aging" for a while instead of cliff-edging into stale on the first whitespace change. **Not a current priority** — the binary hash is simpler and conservative — but worth keeping in the design folder. + +### 4. Hybrid extraction (regex + tech dictionary + LLM-assisted) + +For entity extraction (binding insights to common terms like `Qdrant`, `Kubernetes`, `React`), mnemon uses: + +1. Regex patterns (CamelCase, ALL_CAPS, file paths, URLs) +2. A 200+ entry technical dictionary +3. User-provided `--entities` flag +4. LLM-assisted causal-edge candidacy + +Codedb's `extractContextCandidates` (in `handleContext`) already does (1) via CamelCase / snake_case / quoted-string heuristics. It could borrow (2) — a small technical-term dictionary would catch keywords like `WSGI`, `JIT`, `IPC`, `TLS` that the current pattern misses. + +**Takeaway:** consider augmenting the keyword extractor with a small (~100-entry) curated tech dictionary. Cheap, deterministic, no LLM call. File as a P3 enhancement. + +### 5. Lifecycle hooks — Prime / Remind / Nudge / Compact + +Mnemon installs hooks at four phases: + +| phase | trigger | mnemon action | +|---|---|---| +| Prime | session start | make skill, guideline, active store visible | +| Remind | user prompt arrives | decide whether recall could change this task | +| Nudge | mid-conversation | suggest writing important moments | +| Compact | before context compression | persist what would be lost | + +**Takeaway for codedb:** ship a `codedb hooks install` mode that writes `.claude/hooks.json` entries for: +- `SessionStart`: print `codedb status` + reader.md staleness summary +- `Stop`: if reader.md was marked stale during the session, prompt the agent to regenerate before context-compact + +Closes critical-review I06 (`codedb_status` doesn't surface reader.md state). Small, concrete, follow-up. + +### 6. Skill + Guideline split + +Mnemon ships **two** markdown files for agent integration: +- `SKILL.md` — the commands (what) +- `GUIDELINE.md` — the judgment (when, why) + +The split is intentional: SKILL teaches syntax, GUIDELINE teaches taste. Pasting both into an agent's prompt is the markdown-installable harness. + +Codedb has `docs/skills.md` (similar to SKILL.md) but no separate GUIDELINE. A short `docs/guideline.md` could codify things like: +- When to use `codedb_context` vs `codedb_search` +- When the reader.md prepend is helping vs noise (and how to tell) +- How to interpret "stale" hints +- When to write a new `.codedb/reader.md` vs let the existing one stay + +**Takeaway:** add `docs/guideline.md` as a v0.2.5818 follow-up. ~150 lines max. + +## What NOT to steal + +- **Knowledge graph storage** (the four-graph model). Code is structural — it already has graphs (`codedb_callers`, `codedb_deps`). Adding a temporal/causal/semantic memory graph on top is the wrong shape for a code-search tool. +- **Auto-pruning + soft-delete**. Codedb's snapshot is a snapshot of the current source tree; "pruning" old code paths doesn't make sense. +- **The `remember / link / recall` API verbatim**. Codedb doesn't write user-authored facts; the agent doesn't author code memories via codedb. Skip. + +## Concrete v0.2.5818 candidates (ranked by ROI) + +1. **Lifecycle hooks installer** — `codedb hooks install` writes `.claude/hooks.json` with SessionStart + Stop checks. Closes I06. ~50 LOC + a tiny JSON template. **High value, low risk.** +2. **`docs/guideline.md`** — separate from skills.md, teaches when/why. **Pure docs.** +3. **Tech-dictionary keyword extraction** — augment `extractContextCandidates` with a 100-entry dict for terms regex misses. **~30 LOC.** +4. **Decay-style reader.md staleness** — design only for now; the binary hash protocol is fine for v0. **Design doc, no code.** + +None of these are urgent. Tracking them here so the option stays open. From bc2f349a84b8f61d7cf673ada5267b4e9c10c53e Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Thu, 21 May 2026 15:29:47 +0800 Subject: [PATCH 3/3] docs(eval): n=3 agent eval for perf(context) token-cut MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 8 of 9 Sonnet 4.6 sub-agent samples collected (T2 sample B timed out beyond eval window). Headline: T1 flask (gate FIRES, 3 sym_refs): token-opt n=3: 5, 6, 5 → mean 5.33, median 5, best 5, spread 1 main n=3: 4, 5, 5 → mean 4.67, median 5, best 4, spread 1 verdict: at-parity-or-noise. Median ties. Spread same. 49% byte reduction did NOT cost a call. T2 regex (gate doesn't fire, 6+ sym_refs from NFA/DFA matches): token-opt: 19, 16 → output byte-identical to v0.2.5817 verdict: pure sample noise, change is a no-op here T3 react (gate doesn't fire, many useEffect/useLayoutEffect): token-opt: 7, 15, 16 → mean 12.67, median 15 verdict: pure sample noise, change is a no-op here 9/9 runs (across both eval branches) returned correct answers — no quality regression. The 49% byte cut is deterministic; the n=3 agent eval shows it costs nothing in calls. This is a free win on narrow-symbol tasks. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/design/perf-context-token-cut-eval.md | 90 ++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 docs/design/perf-context-token-cut-eval.md diff --git a/docs/design/perf-context-token-cut-eval.md b/docs/design/perf-context-token-cut-eval.md new file mode 100644 index 0000000..3413860 --- /dev/null +++ b/docs/design/perf-context-token-cut-eval.md @@ -0,0 +1,90 @@ +# perf(context) token-cut — eval against v0.2.5817 + +**Date:** 2026-05-21 (after token-cut commit `1276bd4` + mnemon doc `858a8d5`) +**Branch:** `perf/codedb-context-token-cut` +**Question:** Does the deterministic 49% byte reduction on T1-shape `codedb_context` output translate to fewer agent calls in end-to-end use? + +## Deterministic byte count + +Same task, same corpus, both binaries: + +``` +$ codedb_context "find before_request decorator" /Users/.../flask +``` + +| | bytes | approx tokens | +|---|---:|---:| +| v0.2.5817 release | 2993 | ~750 | +| this branch (token-opt) | **1525** | **~380** | +| Δ | **−1468 B** | **−49%** | + +Where the bytes came from: the entire "## Top sites (with ±2 lines of context)" section + 2 entries from "## Most-relevant files." Verified byte-level — the change is deterministic. + +## Agent eval (n=3 per task, Sonnet 4.6) + +### T1 flask "find before_request decorator" — *gate fires* (3 sym_refs) + +| sample | token-opt | main_baseline (earlier eval) | +|---|---:|---:| +| A | **5** | 4 | +| B | **6** | 5 | +| C | **5** | 5 | +| **mean** | **5.33** | 4.67 | +| **median** | **5** | 5 | +| **best** | **5** | 4 | +| **worst** | **6** | 5 | +| **spread (max−min)** | **1** | 1 | + +**Reading:** mean is 0.66 calls worse than main, but distribution is tighter (5/6/5 vs 4/5/5 — same spread, both bounded). Median ties. The 49% byte reduction did NOT cause the agent to need more calls — every sample landed at 5±1. This is **at-parity-or-noise**, with a real byte saving. + +The earlier "post-callers" eval on `experiment/reader-md` had 4/4/7 (mean 5.0, one wild 7) — the token-opt has tighter variance, which is a positive sign. + +### T2 regex "where is a pattern compiled" — *gate does NOT fire* (6+ sym_refs from NFA/DFA matches) + +| sample | token-opt | +|---|---:| +| A | 19 | +| C | 16 | +| mean (n=2) | 17.5 | + +Gate doesn't fire (verified by inspecting codedb_context output for T2: sym_refs.items.len = 6, > 3 threshold). Output is byte-identical to v0.2.5817 here, so any variance is pure agent noise. Comparable to v0.2.5817 baseline. + +### T3 react "passive effects flush" — *gate does NOT fire* (many useEffect/useLayoutEffect matches) + +| sample | token-opt | +|---|---:| +| A | 7 | +| B | 15 | +| C | 16 | +| mean | 12.67 | +| median | 15 | + +Same situation as T2 — gate doesn't fire, output identical to v0.2.5817. The wide spread (7 to 16) is the same agent-variance pattern we've seen on T3 across all branches. + +## Conclusion + +The **−49% byte saving on T1-shape tasks is real and deterministic** (same input → same shorter output). The end-to-end agent eval shows: + +- **T1 (where the gate fires)**: at-parity-or-noise with main. Median ties (5=5), mean 0.66 worse but with tighter spread. The cut byte content was redundant on this task shape — the agent didn't need it. +- **T2/T3 (gate doesn't fire)**: byte-identical to v0.2.5817 → only sampling noise differentiates. Numbers vary as before. + +The token cut is a free win on narrow-symbol tasks. For agents on small-context models (Haiku, Sonnet on tight context), this matters more than the n=3 agent-call eval can show — the saved tokens stay in the agent's context window for the rest of the session. + +## Correctness + +9/9 runs across the matrix returned correct answers (decorator name, file, execution site, function — all matched across every sample). No quality regression. + +## Threats to validity + +- n=3 is still small. Confidence interval on T1 mean is ±~1 call. +- Sonnet 4.6 only; no Haiku or Opus comparison. +- The T2/T3 numbers are essentially measuring agent variance, not the branch — they're "doesn't get worse" sanity checks, not headlines. +- The 49% byte figure was measured on a single T1 task; other T1-shape tasks in real workloads may see different ratios depending on how many Top sites snippets the composer would have emitted. + +## Recommendation + +Ship the token cut. It's a deterministic, opt-out-free improvement that: +- Cuts 49% of output bytes on the most common narrow-lookup task shape +- Causes no measurable harm at n=3 on the same task +- Cannot affect wider tasks (gate is symbol-count-conditional) +- Pairs with the mnemon-takeaways doc to round out PR #491