bench(shootout): add codegraph backend to shootout.py#487
Conversation
Wires the codegraph 0.7.10 backend into the single-session + multi-session launcher alongside codedb / fts5_tri / fts5_uni / lean-ctx. Uses `codegraph serve --mcp` as a long-lived stdio child and invokes `codegraph_search` as the default symbol-lookup tool — apples-to-apples with codedb_search. New CLI flags: --codegraph-bin <path> default: $(which codegraph) --skip-codegraph skip the backend entirely --clean-codegraph wipe matching .codegraph/ before indexing Cold-index helper `codegraph_cold_index` invokes `codegraph init` then `codegraph index` and measures wall-clock + .codegraph/ on-disk size. Smoke-tested codegraph-only on flask: cold build: 0.57 s, ~3.7 MB warm queries: 0.2–2 ms p50 (matches the bench numbers from the v0.2.5815 cross-corpus run committed in PR #483) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e05a3b4210
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| HERE = Path(__file__).resolve().parent | ||
| REPO_ROOT = HERE.parent.parent | ||
| DEFAULT_CODEDB = REPO_ROOT / "zig-out/bin/codedb" | ||
| DEFAULT_CODEDB = shutil.which("codedb") or str(REPO_ROOT / "zig-out/bin/codedb") |
There was a problem hiding this comment.
Keep codedb default pinned to repo build
Do not prefer PATH for the default codedb binary here: this makes shootout silently benchmark whichever codedb is installed globally instead of the repo’s zig-out/bin/codedb. On machines with an older/newer global install, codedb latency/build numbers can shift significantly and no longer reflect the commit under test, which undermines benchmark regression tracking.
Useful? React with 👍 / 👎.
| if cg_dir.exists(): | ||
| shutil.rmtree(cg_dir) |
There was a problem hiding this comment.
Honor --clean-codegraph before deleting index data
The .codegraph directory is deleted unconditionally, so --clean-codegraph is effectively ignored and every run is forced into a cold rebuild. This changes default benchmark behavior and can materially inflate codegraph build timings even when the caller did not opt into a clean run.
Useful? React with 👍 / 👎.
Addresses Codex P1+P2 review on PR #487: - **P1** Pin DEFAULT_CODEDB to repo build (`REPO_ROOT/zig-out/bin/codedb`). Pre-fix used `shutil.which("codedb") or REPO_ROOT/...`, which made the shootout silently benchmark whichever `codedb` was installed in PATH (e.g. an older homebrew bottle) instead of the freshly-built repo binary the user expected. - **P2** Honor --clean-codegraph. Pre-fix `codegraph_cold_index` wiped `.codegraph/` unconditionally, so the flag was a no-op and every run was forced cold. Now wipes only when `clean=True`, passed through from `args.clean_codegraph`. Verified: --skip-codegraph → no codegraph activity (unchanged) --clean-codegraph → wipes + cold rebuild (now works) (no clean flag) → reuses existing .codegraph/ for incremental Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Adds RESULTS-VS-MAIN.md comparing experiment+reader.md against the released v0.2.5815 main-lineage binary. Same 3 tasks, fresh sub-agents. Per-task deltas (experiment + reader.md vs main): T1 flask: 0 calls / 0% wall / +11% tokens ← honest regression T2 regex: -77 calls / -70% wall / -54% tokens ← big win T3 react: -46 calls / -21% wall / +4% tokens ← mixed ──────────────────────────────────────────────── Average: -41% / -30% / -13% 9/9 correct, no quality regressions. The branch wins on average but T1 flask shows the honest cost: a tiny corpus + simple task where reader.md adds ~2 KB of overhead for no call savings. Recommendation in the doc: reader.md is opt-in, not a default — install only where you've measured it helping. Beyond reader.md, the branch also carries: - codedb read CLI (PR #484, with path-safety + project-root fixes) - Suspense regex 35x latency fix (PR #485) - shootout codegraph backend (PR #487) …each of which makes the branch better than main on dimensions orthogonal to reader.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… security Bumps semver to 0.2.5817. Bundles the v0.2.5816 perf+security release (PRs #484, #485, #483, #486, #487) with the experiment/reader-md feature that auto-prepends a hash-verified codebase map to codedb_context. Highlights vs v0.2.5815: Performance (PR #485, deterministic microbenchmarks): Suspense regex p50: 2.82 ms → 0.18 ms (15.6× faster) useState regex p99: 16.57 ms → 2.04 ms (8.1× p99 reduction) CLI surface (PR #484): + codedb read <path> [-L FROM-TO] [--compact] + path-safety + sensitive-file guards + project-root anchoring (uses configured root, not cwd) codedb_context (NEW in 0.2.5817): + auto-prepends .codedb/reader.md when source_hash matches + inline ~6 lines of body for ≤3 symbol_definitions + new "## Callers" section pre-surfaces execution sites + skip-on-short-task gate (≤80 chars) to avoid overhead on narrow lookups reader.md security (this branch): + path-traversal blocked (no absolute / .. in source_files) + source_files capped at 20 (DoS guard) + loc_actual capped at 240 (body bloat guard) + golden blake2b roundtrip test Eval (Sonnet 4.6, n=3 per task, vs v0.2.5815 main lineage): T1 flask median: 5 → 4 (-1) T2 regex median: 13 → 7 (-6) T3 react median: 13 → 10 (-3) All 9 runs across the matrix returned correct answers. Branch wins on median, mode, and best-case for every task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Wires the codegraph 0.7.10 backend into the upstream multi-session shootout.py so the next release can run the cross-corpus head-to-head from the canonical script (instead of the sibling harness used for PR #483's bench data).
New surface
```
--codegraph-bin default: $(which codegraph)
--skip-codegraph skip codegraph entirely
--clean-codegraph wipe matching .codegraph/ before indexing (forces cold build)
```
`codegraph serve --mcp` runs as a long-lived stdio child; queries call `codegraph_search` directly — the same way codedb_search is exercised. Multi-session launcher forwards the new flags to per-session subprocesses.
Smoke test
Codegraph-only on flask (5 iters):
```
[build] codegraph ...
0.57 s, ~3.7 MB
[query]
useState | 1.62/ 1.99/ 2.63/ 2.75 ms ( 0)
function | 1.30/ 1.32/ 1.48/ 1.48 ms ( 105)
set | 0.37/ 0.40/ 0.41/ 0.41 ms ( 70)
```
Numbers match what PR #483 collected via the sibling harness — so the integration is consistent.
What this does NOT change
Test plan
🤖 Generated with Claude Code