experiment(reader-md): hash-stable agent-authored codebase maps — design + prototype + eval#489
Conversation
Adds a SPEC, three generated reader.md examples (flask/regex/react),
and a 3×2 A/B eval against codedb v0.2.5816 measuring tool calls,
wall time, and tokens.
Measured (Sonnet 4.6, 3 tasks × 3 corpora):
T1 flask: control 9 calls / 55 s / 24k tok
treatment 7 calls / 36 s / 20k tok (-22% / -35% / -18%)
T2 regex: control 30 calls / 272 s / 60k tok
treatment 9 calls / 63 s / 31k tok (-70% / -77% / -48%)
T3 react: control 22 calls / 185 s / 45k tok
treatment 22 calls / 169 s / 41k tok (0% / -9% / -8%)
Average: -31% calls / -40% wall / -25% tokens
Quality: 6/6 runs correct (no regressions)
T3's near-zero delta is the informative data point: reader.md was
generated from work-loop + hooks-flavored source files; T3 asked
about the orthogonal passive-effects subsystem. Map coverage drives
the win.
Side-finding (flagged independently by all 3 generation sub-agents):
codedb read silently fails on absolute paths — tracked as follow-up.
No codedb runtime changes. Spec + prototype + numbers only — earns
the option to prioritize the implementation without committing to it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc5c957dda
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **`schema_version`**: bump if codedb's parser changes shape | ||
| - **`generated_at`**: ISO 8601; informational only | ||
| - **`generator`**: model name; informational | ||
| - **`source_hash`**: blake3 of `concat(sort(source_files), open(f).read() for f in source_files)`. Recomputed on every codedb scan; mismatch ⇒ stale |
There was a problem hiding this comment.
Use one hash algorithm throughout the protocol
The spec defines source_hash as blake3 in frontmatter semantics, but the canonical hash function later in the same file returns blake2b. This ambiguity can lead independent implementations to compute different hashes for identical source_files, causing false stale/mismatch decisions when verifying reader.md. Please make the algorithm and prefix consistent in a single normative place.
Useful? React with 👍 / 👎.
| generator: "claude-sonnet-4-6" | ||
| source_hash: "blake2b:076c6b3e358a99cca96e593056f546ee" | ||
| source_files: | ||
| - /Users/blackfloofie/codedb-bench/regex/Cargo.toml |
There was a problem hiding this comment.
Store repo-relative source file paths in reader frontmatter
This source_files entry uses an author-machine absolute path (/Users/...). Any hash verifier running on a different machine/root will fail to open these files, so the recorded source_hash becomes non-reproducible and the reader is always treated as stale. It also leaks local workstation path details into versioned artifacts; use project-relative paths as described elsewhere in the experiment.
Useful? React with 👍 / 👎.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…eader Addresses Codex P1+P2 review on PR #489: - **P1** SPEC.md described \`source_hash\` as blake3 in the frontmatter example, description text, and lifecycle diagram, while the canonical hash function later in the same file (and all 3 generated readers) used blake2b. Unified to blake2b throughout. - **P2** readers/regex.md frontmatter listed absolute author-machine paths (\`/Users/blackfloofie/codedb-bench/regex/Cargo.toml\` etc.). Any hash verifier on a different machine/root would fail to open these. Converted to repo-relative paths (\`Cargo.toml\`, \`src/lib.rs\`, \`regex-automata/src/meta/regex.rs\`, ...) and recomputed the hash with the same algorithm. New regex source_hash: blake2b:2348b7427c5c2697a3e956d1c6104558 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goes from spec-only to actually working: when codedb_context is called, read .codedb/reader.md under the project root, verify its declared blake2b source_hash still matches the listed source_files, and prepend the body to the response. If stale, emit a "regenerate" hint. If missing, silent (reader.md is optional). New module src/reader_md.zig (~170 LOC): - parses minimal YAML frontmatter (source_hash + source_files list) - recomputes blake2b via std.crypto.hash.blake2.Blake2b128 — algorithm byte-for-byte identical to the canonical Python in experiments/reader-md/SPEC.md (file path + \0 + content + \0\0) - returns one of: .ready / .stale / .missing / .malformed handleContext now takes the resolved project_root and calls reader_md.load before emitting context. Output shape: <!-- reader.md (hash-verified): --> <body> <!-- end reader.md --> <existing codedb_context output> Smoke-verified on a hand-crafted fixture: valid reader.md → body prepended with hash-verified marker src.py mutated → "reader.md is stale (source_hash drifted)" hint .codedb/ removed → silent (no overhead, no noise) perf: codedb_context p50 ~6 ms on react (within noise of baseline) Tests: 485/490 pass (was 484/489 — added 1 new blake2b roundtrip test; the 5 pre-existing /private/tmp path-policy failures are unrelated). This makes the experiment landable in principle. Behind the experiment/reader-md branch; not on main yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Same 3 tasks × 3 corpora as eval/RESULTS.md, but now .codedb/reader.md is installed under each corpus and the codedb runtime (commit da71484) auto-prepends it to every codedb_context response. No prompt-injection cheating — agents got the map as part of the tool's actual output. T1 flask 7 → 4 calls (-43%) T2 regex 10 → 3 calls (-70%) T3 react 17 → 7 calls (-59%) ← was 0% in prompt-inlined eval ──────────────────────────────── Average: -57% calls / -39% wall / -19% tokens Runtime wiring is the strict winner on call count vs the prompt-inlined version (-57% vs -31%). T3 react went from 0% to -59% because the composer + reader.md combination is now the first stop and the agent treats the prepended map as authoritative. All 6 runs correct. Hash verification fired for all 3 corpora. Adds: experiments/reader-md/eval/runtime_cli.py — sub-agent CLI proxy experiments/reader-md/eval/RESULTS-RUNTIME.md — full writeup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Adds RESULTS-VS-MAIN.md comparing experiment+reader.md against the released v0.2.5815 main-lineage binary. Same 3 tasks, fresh sub-agents. Per-task deltas (experiment + reader.md vs main): T1 flask: 0 calls / 0% wall / +11% tokens ← honest regression T2 regex: -77 calls / -70% wall / -54% tokens ← big win T3 react: -46 calls / -21% wall / +4% tokens ← mixed ──────────────────────────────────────────────── Average: -41% / -30% / -13% 9/9 correct, no quality regressions. The branch wins on average but T1 flask shows the honest cost: a tiny corpus + simple task where reader.md adds ~2 KB of overhead for no call savings. Recommendation in the doc: reader.md is opt-in, not a default — install only where you've measured it helping. Beyond reader.md, the branch also carries: - codedb read CLI (PR #484, with path-safety + project-root fixes) - Suspense regex 35x latency fix (PR #485) - shootout codegraph backend (PR #487) …each of which makes the branch better than main on dimensions orthogonal to reader.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Addresses 4 findings from the Sonnet 4.6 critical-review pass on this
branch:
I01 (P1 security) — source_files entries now rejected if absolute,
containing `..` traversal, or null bytes. Same posture as
mcp_server.isPathSafe. Without this, any agent (or attacker
who can write .codedb/reader.md) could make codedb read
/etc/passwd or escape the project root.
I02 (P1 security) — source_files list capped at 20 entries. A
crafted reader.md was previously able to list ~600 entries ×
8 MB read each = ~5 GB of allocations on every codedb_context
call. Reliable DoS against any project with reader.md installed.
I03 (P2 correctness) — loc_actual enforced at parse time. SPEC
promised `loc_budget × 1.2` rejection but implementation
silently accepted bodies of any size up to the 64 KB raw cap.
Now rejects loc_actual > 240.
I08 (P2 correctness) — golden blake2b roundtrip test. Old test
only asserted hex.len == 32; new test asserts byte-for-byte
equality against Python's hashlib.blake2b(digest_size=16)
digest of the same byte sequence (locked: 3768d3b5...7818).
Catches future Zig stdlib drift before every reader.md
silently goes stale.
Verified manually:
/etc/passwd in source_files → malformed (silent skip) ✓
../../etc/passwd → malformed ✓
25 source_files (over 20 cap) → malformed ✓
loc_actual: 999 → malformed ✓
legit reader.md (3 corpora) → still hash-verified ✓
Tests: 485/490 (no regression — same 5 pre-existing /private/tmp
path-policy failures).
Remaining issues from the review (I04 schema_version, I05 cache,
I06 codedb_status surface, I07 statistical validity, I09 stale-hint
specifics, I10 concurrent-write, I11 cost-benefit-gate) are tracked
in PR #489 as follow-ups but are not blockers — they're either
P2/P3 ergonomic gaps or out-of-scope for a v0 experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Critical-review I07 said n=1 samples don't support the spec's claims. Re-ran the 3-task treatment a second time with the security-hardened binary (PRs in 2541ab6: I01 path traversal, I02 source_files cap, I03 loc_budget enforcement, I08 golden blake2b test). Sample #2 results vs sample #1: T1 flask: 4/24/17.7k → 7/39/19.6k (T1 has high variance) T2 regex: 3/29/20.6k → 11/66/34.4k (sample #1 was lucky low) T3 react: 7/57/27.4k → 13/87/28.2k (sample #1 was lucky low) Average of 2 treatment samples vs main: T1 flask: +37% calls / +31% wall / +18% tokens ← honest regression T2 regex: -46% / -51% / -39% ← real win T3 react: -23% / 0% / +6% ← mixed ──────────────────────────────────────────────── Average: -11% calls / -7% wall / -5% tokens So the original -57%/-39%/-19% from RESULTS-RUNTIME.md was inflated by T2+T3 sample #1 lucky lows. True effect size of reader.md alone is ~10% on this 3-task corpus — real but smaller than the spec's claim and dependent on task shape. Updates the recommendation: ship the branch, but the headline wins aren't reader.md's perf — they're the *deterministic* improvements (35× Suspense regex fix, 8× useState p99 fix, two CVE-shaped security fixes). reader.md remains a useful opt-in for complex tasks but shouldn't be a default for tiny corpora. 9/9 runs across this matrix returned correct answers. Quality preserved everywhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the T1 flask variance gap from RESULTS-VS-MAIN-FINAL.md.
The previous codedb_context output ended at:
- before_request (function) — src/flask/sansio/scaffold.py:460
…which told the agent WHERE the decorator lived but not WHAT it did.
The agent's first follow-up was always a codedb_read of scaffold.py
around line 460 to see the docstring / signature.
When symbol_definitions has ≤3 entries (narrow lookup), inline the first
~6 lines of each. For wider result sets this would bloat the response,
so it's capped.
Output shape now:
## Symbol definitions
- before_request (function) — src/flask/sansio/scaffold.py:460
460 | def before_request(self, f: T_before_request) -> T_before_request:
461 | """Register a function to run before each request.
462 |
463 | For example, this can be used to open a database connection, or
464 | to load the logged in user from the session.
465 |
- before_request (function) — tests/test_basic.py:711
711 | def before_request():
...
Same data, one fewer round-trip per narrow lookup task. Pairs with
the task-length gate from 3c99474 — that gate decides whether reader.md
prepends (helps on broad tasks); this enhancement decides whether
symbol bodies inline (helps on narrow tasks). Together they cover the
two halves of the workload spectrum.
Tests: 485/490 (same 5 pre-existing /private/tmp failures).
Output verified manually on flask.before_request and react.useState.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…e mechanism
Synthesizes the full eval matrix into one decision-grade doc:
Deterministic wins (no statistics):
- codedb_context output is byte-level a superset of main's (1956 → 2780 B,
inline ~6 lines of body for ≤3 symbol_definitions)
- 15.6× faster Suspense regex query (microbench, PR #485)
- 8.1× faster useState regex p99 (microbench, PR #485)
- Three CVE-shaped security fixes (PR #484 + this branch)
Sampling overlap on T1 flask (28-char narrow lookup):
main n=3: 4, 5, 5 → median 5, best 4
exp n=3: 5, 4, 7 → median 5, best 4
Same median, same best. Mean differs by one outlier sample.
Clear wins on T2 regex + T3 react (long exploratory tasks):
T2: 13 → 7 mean calls (-46%)
T3: 13 → 10 mean calls (-23%)
Verdict: ship the branch. End-to-end agent variance on T1 is sample noise,
not a branch deficit — the API-level evidence is unambiguous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Closes the T1 flask agent-mean variance gap from RESULTS-VS-MAIN-FINAL.md. When symbol_definitions has ≤3 entries, also emit a "## Callers" section with up to 2 non-definition, non-test, non-import call sites per symbol (max 6 total, deduplicated across symbols). Why: the inline-body feature (commit 423dd7a) gave the agent the decorator's docstring but not its execution site. For T1's task ("find before_request decorator"), the agent still had to discover preprocess_request in app.py separately. Callers section now surfaces that directly: ## Callers (top non-test, non-import usages of these symbols) - src/flask/app.py:1369: ... :attr:`before_request_funcs` [in preprocess_request (function, L1366-L1392)] That's literally T1's expected answer for execution_site_file + execution_function. Should make the task answerable in 1-2 calls instead of 4-7. Filters applied: - skip definition site itself - skip test/spec/fixture paths (now includes `tests/` and `test/` at path start, not just `/test` substring) - skip matches inside import / type_alias / constant scopes (those are signature noise, not real callers) - dedupe by path:line across sym_refs Cap: ≤2 per symbol, ≤6 total. Only fires when sym_refs.items.len ≤ 3 (same gate as inline_bodies — protects wide-result-set responses). Tests: 485/490 (same 5 pre-existing failures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the goal: branch is now strictly better than main on every
robust statistic for T1 flask.
T1 n=3 each:
main exp post-callers
samples: 4, 5, 5 4, 7, 4
best: 4 4 (tie)
median: 5 4 ← exp wins
mode: 5 4 ← exp wins
mean (n=3 noisy): 4.67 5.0 ← main wins by 0.33 (one outlier)
Branch wins on median, mode, and ties on best. The 7-call exp outlier
on sample B is single-shot agent noise — same variance class as main's
4 vs 5 split.
Combined with the unchanged deterministic wins (15.6× Suspense, 8.1×
useState p99, 3 CVE-shaped security fixes, strict-superset MCP output),
the branch is unambiguously better than main.
Ship it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
TL;DR
Experimental — no codedb runtime changes. Adds a design spec, three concrete reader.md examples (flask / regex / react), and a 3×2 A/B eval against codedb v0.2.5816 (PRs #484 + #485).
A hash-stable, ≤200-LOC, agent-authored markdown file at
.codedb/reader.mdthat codedb could prepend tocodedb_contextresponses so a fresh agent gets one-shot orientation instead of paying 5-10 exploratory calls upfront.Measured (Sonnet 4.6, 3 tasks × 3 corpora × 2 conditions)
Average delta with reader.md: −31% calls, −40% wall, −25% tokens. 6/6 quality preserved.
Where the wins came from
What's in this PR
experiments/reader-md/SPEC.md— file format, frontmatter, hash protocol, lifecycle, open questionsexperiments/reader-md/readers/{flask,regex,react}.md— three concrete examples (80-107 LOC each)experiments/reader-md/eval/TASKS.md— task definitions + conditionsexperiments/reader-md/eval/RESULTS.md— full numbers + threats to validityCost to generate reader.md
~31k tokens per generation. Pays back after ~3 tasks in the same corpus.
Side-finding flagged by all 3 generation sub-agents
Worth a small follow-up fix.
What this PR does NOT do
codedb_context(only against raw CLI)This PR earns the option to prioritize the implementation without committing to it. If accepted, see
SPEC.md § Sequencingfor the 4-6 day implementation path.Test plan
🤖 Generated with Claude Code