From bb97d1b02d9d868c7511d3742b131852e6911566 Mon Sep 17 00:00:00 2001
From: justrach <54503978+justrach@users.noreply.github.com>
Date: Fri, 22 May 2026 20:01:14 +0800
Subject: [PATCH 1/3] =?UTF-8?q?bench(swe-lite):=20file-localization=20snap?=
 =?UTF-8?q?shot=20=E2=80=94=204=20instances=20=C3=97=204=20backends,=20det?=
 =?UTF-8?q?erministic=20oracle?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Publishes the SWE-bench Lite file-localization view of the
codedb-vs-peers shootout. Complements `benchmarks/search-shootout/`
(hand-authored React tasks + Claude-as-judge) with a verifiable
ground-truth view: gold = the file the merged upstream PR patched,
oracle = deterministic file-path match.

Instances: pallets__flask-4045, psf__requests-2148, psf__requests-2674,
mwaskom__seaborn-2848. Backends: codedb (CLI), codedb_CONTEXT (MCP
composer), leanctx, fts5_trigram.

Headline: all four backends recall the gold file (4/4). Top-1 splits
at one task — fts5_trigram 4/4, the other three at 3/4 (the seaborn
axisgrid/_oldcore call-trace ordering). codedb_CONTEXT is the sole
Pareto-optimal point on (quality, efficiency): 2.25 calls / 1.25s /
14.7k tokens vs 9.75-26.75 calls and 24.75-42s for the rest.

The accompanying RESULTS.md flags the deployment-shape caveat that
caused an earlier CLI-only read to misrepresent codedb's efficiency:
when a tool has multiple deployment surfaces, the bench has to
compare primary-against-primary, not a side surface against peers'
primaries.

Caveats are spelled out in RESULTS.md (n=4 is a sanity check, not a
statistic; file-localization ≠ patch-correctness; replay-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/swe-lite/README.md    |  61 +++++++++++++
 benchmarks/swe-lite/RESULTS.md   | 146 +++++++++++++++++++++++++++++++
 benchmarks/swe-lite/replay.py    | 119 +++++++++++++++++++++++++
 benchmarks/swe-lite/results.json |  47 ++++++++++
 4 files changed, 373 insertions(+)
 create mode 100644 benchmarks/swe-lite/README.md
 create mode 100644 benchmarks/swe-lite/RESULTS.md
 create mode 100755 benchmarks/swe-lite/replay.py
 create mode 100644 benchmarks/swe-lite/results.json

diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md
new file mode 100644
index 0000000..c5ca93a
--- /dev/null
+++ b/benchmarks/swe-lite/README.md
@@ -0,0 +1,61 @@
+# swe-lite
+
+A small **file-localization** benchmark for code-retrieval backends,
+graded by a deterministic oracle (file-path match against the merged
+upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances.
+
+This folder is a sibling to [`../search-shootout`](../search-shootout),
+which uses a hand-authored React corpus + Claude-as-judge. The two
+benches stress different things:
+
+| | `search-shootout/` | `swe-lite/` (this folder) |
+|---|---|---|
+| Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) |
+| Tasks | hand-authored | merged upstream PRs |
+| Ground truth | hand-written `tasks.json` | gold patch's `changed_files` |
+| Oracle | Claude-as-judge, 5-point rubric | deterministic file-path match |
+| Risk | closed loop (same model family writes test + takes test) | independent ground truth |
+
+Both views together are stronger than either alone. The
+search-shootout grades *answer quality* (could the agent answer the
+question well?); swe-lite grades *file-localization correctness*
+(did the agent name the file the patch actually edited?).
+
+## Files
+
+- [`results.json`](./results.json) — frozen snapshot: tasks, per-cell
+  metrics, summary. Captured 2026-05-22.
+- [`replay.py`](./replay.py) — loads `results.json`, recomputes the
+  per-backend averages from the raw cells, asserts the summary
+  matches, and prints the dominance table.
+- [`RESULTS.md`](./RESULTS.md) — the publishable read of the data:
+  dominance table, the deployment-shape caveat, and what this bench
+  does and does not measure.
+
+## Quick start
+
+```sh
+python3 replay.py
+```
+
+That prints the dominance table and exits non-zero if any summary
+cell disagrees with what the raw cells imply. JSON form:
+
+```sh
+python3 replay.py --json
+```
+
+## What this is NOT
+
+- **Not a live SWE-bench runner.** `results.json` was populated by
+  running each backend by hand and recording the agent's `files`
+  output. The script in this folder replays that record; it does not
+  re-invoke the backends.
+- **Not a patch-correctness eval.** This grades "did the agent name
+  the right file?", not "did the agent's patch make the failing
+  tests pass?". The latter (SWE-bench's headline `pass@1`) is the
+  metric that actually matters and is tracked as future work.
+- **Not a statistic.** n=4 is a sanity check, not a sample.
+
+See [`RESULTS.md`](./RESULTS.md) §Caveats for the full list.
diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md
new file mode 100644
index 0000000..44277c7
--- /dev/null
+++ b/benchmarks/swe-lite/RESULTS.md
@@ -0,0 +1,146 @@
+# SWE-bench Lite — file-localization results
+
+Frozen snapshot of 4 SWE-bench Lite instances × 4 retrieval backends,
+scored by deterministic file-path match against the merged upstream
+patch (no LLM judge). Captured 2026-05-22.
+
+The raw data is in [`results.json`](./results.json); recompute and
+verify the summary block with [`replay.py`](./replay.py).
+
+## Tasks
+
+Four [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances spanning three real upstream repos:
+
+| Instance | Repo | Gold file (the file the merged PR patched) |
+|---|---|---|
+| `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` |
+| `psf__requests-2148` | psf/requests | `requests/models.py` |
+| `psf__requests-2674` | psf/requests | `requests/adapters.py` |
+| `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` |
+
+Each instance's `base_commit` is pinned in `results.json` so the same
+state can be rebuilt by anyone.
+
+## Backends
+
+| Backend | What it is | How invoked |
+|---|---|---|
+| `codedb` | This repo's CLI surface — the four lookup primitives (`search`, `find`, `word`, `outline`) | shell calls, agent composes them itself |
+| `codedb_CONTEXT` | This repo's **MCP composer** tool — bundles the primitives server-side into one task-shaped call | single MCP call with the issue text + `project=<corpus>` |
+| `leanctx` | yvgude/lean-ctx, BM25-ish word index | CLI calls per query |
+| `fts5_trigram` | SQLite FTS5 with the `trigram` tokenizer | direct sqlite3 substring query |
+
+`codedb_CONTEXT` is the deployed shape of codedb for agentic use; the
+CLI is the underlying primitive set. Measuring both lets us separate
+"is the search good?" from "is the deployed shape good?".
+
+## Scoring
+
+Deterministic, no LLM judge:
+
+- **recall** — gold file appears anywhere in the agent's `files` list
+- **top-1** — agent's *first* listed file equals the gold file
+
+That's it. The agent doesn't have to write a patch; it just has to
+name the file it would edit. This is an intermediate signal — weaker
+than patch-correctness, but stronger than judge-graded quality
+because there's no model in the oracle loop.
+
+## Headline
+
+```
+backend          recall  top-1  avg calls  avg wall (s)  avg tokens
+---------------  ------  -----  ---------  ------------  ----------
+codedb            4/4     3/4     26.75       42.00         37,954
+codedb_CONTEXT    4/4     3/4      2.25        1.25         14,717
+leanctx           4/4     3/4      9.75       27.25         30,172
+fts5_trigram      4/4     4/4     13.75       24.75         25,801
+```
+
+**Quality.** All four backends fully recall the gold file (4/4).
+Top-1 splits at one task: `fts5_trigram` 4/4, the other three at 3/4.
+
+**Efficiency.** `codedb_CONTEXT` dominates on every axis — **4×**
+fewer calls than `leanctx`, **6×** fewer than `fts5_trigram`, **12×**
+fewer than `codedb` CLI; **20-30×** faster wall; lowest tokens.
+
+**Pareto frontier.** Only one point is Pareto-optimal across (quality,
+efficiency): `codedb_CONTEXT`. The single backend that exceeds it on
+quality (`fts5_trigram`, by one cell out of four) costs ~1.5× the
+wall and ~1.75× the tokens for that gain.
+
+## The one task where top-1 split — `mwaskom__seaborn-2848`
+
+The seaborn bug surfaces as a `KeyError` raised inside
+`seaborn/_oldcore.py::SemanticMapping`, but the user-facing call site
+lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch
+edits `_oldcore.py` (the root-cause site).
+
+Three of four backends (`codedb`, `codedb_CONTEXT`, `leanctx`) named
+`axisgrid.py` first and `_oldcore.py` second — the order a developer
+would trace through. `fts5_trigram` named `_oldcore.py` first because
+trigram matches on identifier strings preferred the file with denser
+term hits.
+
+Both orderings find the bug. Which one is "better" depends on what
+you want top-1 to mean: the first place a developer would look (the
+call site) or the place the patch actually lands (the root-cause
+site). At this sample size the metric punishes the explanatory
+ordering, but neither agent failed the task.
+
+## Why the CLI row matters — deployment shape is a measurement axis
+
+An earlier iteration of this bench reported `codedb` as the *least*
+efficient backend (26.75 calls / 42s / 38k tokens) and concluded the
+dominance claim was partially falsified. That finding was numerically
+correct but tested the wrong thing: it pitted codedb's *CLI* (a stack
+of four lookup primitives the agent composes itself) against peers'
+*deployed* surfaces (leanctx CLI, fts5 sqlite3).
+
+`codedb_CONTEXT` is the actual deployed shape — one MCP call that
+bundles the primitives server-side. Once measured at the same level
+of abstraction as the peers, the dominance picture survives the
+verifiable oracle.
+
+**Lesson:** when a tool has more than one deployment surface
+(CLI / MCP / HTTP / library), the bench has to identify the
+*primary* surface and compare primary-against-primary. Measuring a
+side surface and reporting it as the headline is an
+apples-to-oranges error.
+
+## Caveats — read before quoting these numbers
+
+1. **n=4 is small.** Four SWE-bench Lite instances is a sanity check,
+   not a statistic. Don't generalize from "3/4 top-1" to "75% top-1
+   on SWE-bench Lite".
+2. **File-localization ≠ patch-correctness.** This bench measures
+   whether the agent names the right file. It does not run the agent
+   end-to-end, generate a patch, or check whether the patch makes
+   the failing tests pass. An end-to-end `pass@1` eval is the metric
+   that actually matters; this is one rung below it on the ladder.
+3. **Replay, not live.** `results.json` is a frozen record. The
+   `replay.py` script recomputes the averages from the cells and
+   verifies the summary block matches, but it does not re-launch
+   the backends. A live runner is future work.
+4. **One judge-graded comparator** (`codegraph` MCP) is intentionally
+   absent here — it was measured on the hand-authored / judge-graded
+   shootout but not on this verifiable-oracle bench. Add it if you
+   want a 5-backend matrix.
+5. **The seaborn split is a metric artifact, not a backend
+   weakness.** Three out of four backends (including `fts5_trigram`
+   on the *other* three tasks) order files by traceability rather
+   than patch site. The split says more about top-1 as a metric than
+   about the backends.
+
+## Future work
+
+- A live runner that actually invokes each backend per task and
+  records `files`, `tool_calls`, `wall_seconds`, `tokens` on the
+  spot (instead of the current hand-recorded snapshot).
+- A patch-correctness oracle: agent produces a unified diff,
+  oracle applies it against the pinned `base_commit` and runs the
+  upstream test suite. That's the only metric that fully captures
+  the "did the agent solve it?" question.
+- More tasks. 20-50 SWE-bench Lite instances would let "3/4 top-1"
+  turn into a statistic instead of a sanity check.
diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py
new file mode 100755
index 0000000..8997427
--- /dev/null
+++ b/benchmarks/swe-lite/replay.py
@@ -0,0 +1,119 @@
+#!/usr/bin/env python3
+"""Replay + verify the SWE-bench Lite file-localization snapshot.
+
+This is NOT a live SWE-bench runner. It loads `results.json` (a frozen
+record of agent runs on 4 SWE-bench Lite instances, populated by hand
+from agent traces), recomputes the per-backend averages from the raw
+cells, and asserts they match the summary block. Then prints a
+dominance table.
+
+A live runner (that actually launches each backend, sends the issue
+text, captures the agent's `files` list, and patch-tests the result)
+is out of scope for this snapshot and tracked separately.
+
+Usage:
+    python3 replay.py                # verify + print dominance table
+    python3 replay.py --json         # print raw recomputed summary as JSON
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from statistics import mean
+
+SNAPSHOT = Path(__file__).resolve().parent / "results.json"
+
+
+def recompute(snapshot: dict) -> dict:
+    by_backend: dict[str, dict] = {}
+    cells_by_backend: dict[str, list[dict]] = {}
+    for cell in snapshot["cells"]:
+        cells_by_backend.setdefault(cell["backend"], []).append(cell)
+
+    n_tasks = len(snapshot["tasks"])
+    for backend, cells in cells_by_backend.items():
+        recall_hits = sum(1 for c in cells if c["recall"])
+        top1_hits = sum(1 for c in cells if c["top_1"])
+        by_backend[backend] = {
+            "recall": f"{recall_hits}/{n_tasks}",
+            "top_1": f"{top1_hits}/{n_tasks}",
+            "avg_tool_calls": round(mean(c["tool_calls"] for c in cells), 2),
+            "avg_wall_seconds": round(mean(c["wall_seconds"] for c in cells), 2),
+            "avg_tokens": round(mean(c["tokens"] for c in cells), 2),
+        }
+    return by_backend
+
+
+def verify(snapshot: dict, recomputed: dict) -> list[str]:
+    errors: list[str] = []
+    claimed = snapshot["summary"]["by_backend"]
+    for backend, claim in claimed.items():
+        actual = recomputed.get(backend)
+        if actual is None:
+            errors.append(f"{backend}: claimed in summary but has no cells")
+            continue
+        for key in ("recall", "top_1"):
+            if claim[key] != actual[key]:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+        for key in ("avg_tool_calls", "avg_wall_seconds", "avg_tokens"):
+            if abs(float(claim[key]) - float(actual[key])) > 0.01:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+    return errors
+
+
+def print_table(snapshot: dict, recomputed: dict) -> None:
+    backends = snapshot["backends"]
+    rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")]
+    for backend in backends:
+        s = recomputed[backend]
+        rows.append((
+            backend,
+            s["recall"],
+            s["top_1"],
+            f"{s['avg_tool_calls']:.2f}",
+            f"{s['avg_wall_seconds']:.2f}",
+            f"{s['avg_tokens']:,.0f}",
+        ))
+    widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))]
+    sep = "  ".join("-" * w for w in widths)
+    for i, row in enumerate(rows):
+        print("  ".join(cell.ljust(widths[j]) for j, cell in enumerate(row)))
+        if i == 0:
+            print(sep)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON")
+    parser.add_argument("--snapshot", type=Path, default=SNAPSHOT, help="path to results.json")
+    args = parser.parse_args()
+
+    snapshot = json.loads(args.snapshot.read_text())
+    recomputed = recompute(snapshot)
+    errors = verify(snapshot, recomputed)
+
+    if args.json:
+        print(json.dumps(recomputed, indent=2))
+    else:
+        print(f"source:     {snapshot['source']}")
+        print(f"frozen at:  {snapshot['frozen_at']}")
+        print(f"tasks:      {len(snapshot['tasks'])}  ({', '.join(t['id'] for t in snapshot['tasks'])})")
+        print(f"backends:   {len(snapshot['backends'])}  ({', '.join(snapshot['backends'])})")
+        print()
+        print_table(snapshot, recomputed)
+        print()
+        print("headline:", snapshot["summary"]["headline"])
+
+    if errors:
+        print(file=sys.stderr)
+        print("VERIFY FAILED — summary does not match cells:", file=sys.stderr)
+        for err in errors:
+            print(f"  - {err}", file=sys.stderr)
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json
new file mode 100644
index 0000000..0812eda
--- /dev/null
+++ b/benchmarks/swe-lite/results.json
@@ -0,0 +1,47 @@
+{
+  "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape",
+  "frozen_at": "2026-05-22T10:05Z",
+  "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)",
+  "metric_definitions": {
+    "recall": "gold file appears anywhere in agent's `files` list",
+    "top_1": "agent's first `files` entry equals the gold file"
+  },
+  "tasks": [
+    {"id": "pallets__flask-4045",    "repo": "pallets/flask",   "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot",      "gold_files": ["src/flask/blueprints.py"]},
+    {"id": "psf__requests-2148",     "repo": "psf/requests",    "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]},
+    {"id": "psf__requests-2674",     "repo": "psf/requests",    "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API",   "gold_files": ["requests/adapters.py"]},
+    {"id": "mwaskom__seaborn-2848",  "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]}
+  ],
+  "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram"],
+  "cells": [
+    {"task": "pallets__flask-4045",   "backend": "codedb",          "files": ["src/flask/blueprints.py"],                                              "tool_calls": 8,  "wall_seconds": 12,  "tokens": 17508, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codedb_CONTEXT",  "files": ["src/flask/blueprints.py"],                                              "tool_calls": 3,  "wall_seconds": 2,   "tokens": 14834, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "leanctx",         "files": ["src/flask/blueprints.py"],                                              "tool_calls": 4,  "wall_seconds": 8,   "tokens": 17017, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "fts5_trigram",    "files": ["src/flask/blueprints.py"],                                              "tool_calls": 13, "wall_seconds": 18,  "tokens": 16288, "recall": true, "top_1": true},
+
+    {"task": "psf__requests-2148",    "backend": "codedb",          "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18,  "tokens": 20439, "recall": true, "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codedb_CONTEXT",  "files": ["requests/models.py", "requests/exceptions.py"],                         "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14516, "recall": true, "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "leanctx",         "files": ["requests/models.py", "requests/adapters.py"],                           "tool_calls": 9,  "wall_seconds": 28,  "tokens": 32319, "recall": true, "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "fts5_trigram",    "files": ["requests/models.py", "requests/adapters.py"],                           "tool_calls": 11, "wall_seconds": 18,  "tokens": 16427, "recall": true, "top_1": true},
+
+    {"task": "psf__requests-2674",    "backend": "codedb",          "files": ["requests/adapters.py", "requests/exceptions.py"],                        "tool_calls": 23, "wall_seconds": 18,  "tokens": 24816, "recall": true, "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codedb_CONTEXT",  "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14725, "recall": true, "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "leanctx",         "files": ["requests/adapters.py"],                                                  "tool_calls": 6,  "wall_seconds": 28,  "tokens": 28060, "recall": true, "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "fts5_trigram",    "files": ["requests/adapters.py", "requests/exceptions.py"],                        "tool_calls": 8,  "wall_seconds": 18,  "tokens": 22767, "recall": true, "top_1": true},
+
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb",          "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT",  "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14791, "recall": true, "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "leanctx",         "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 20, "wall_seconds": 45,  "tokens": 43291, "recall": true, "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram",    "files": ["seaborn/_oldcore.py", "seaborn/relational.py"],                          "tool_calls": 23, "wall_seconds": 45,  "tokens": 47720, "recall": true, "top_1": true}
+  ],
+  "summary": {
+    "by_backend": {
+      "codedb":         {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0,  "avg_tokens": 37954.25},
+      "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25,  "avg_wall_seconds": 1.25,  "avg_tokens": 14716.5},
+      "leanctx":        {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75,  "avg_wall_seconds": 27.25, "avg_tokens": 30171.75},
+      "fts5_trigram":   {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5}
+    },
+    "headline": "All four backends fully recall the gold file (4/4). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx tie at 3/4 (all flagged seaborn/axisgrid.py before seaborn/_oldcore.py — the symptom site vs the root-cause site). Efficiency: codedb_CONTEXT dominates by a wide margin (2.25 calls / 1.25s / 14.7k tokens) — 4-12x fewer calls than peers, 20-30x faster wall, lowest tokens.",
+    "pareto_optimal": "codedb_CONTEXT is the sole Pareto-optimal point on the (quality, efficiency) frontier: only fts5_trigram exceeds it on quality, and only by 1 cell out of 4, at ~1.5x the wall and ~1.75x the tokens."
+  }
+}

From 260c46c4df9ecb99e8315d16d94894d256869086 Mon Sep 17 00:00:00 2001
From: justrach <54503978+justrach@users.noreply.github.com>
Date: Fri, 22 May 2026 20:15:58 +0800
Subject: [PATCH 2/3] bench(swe-lite): add codegraph (CLI + context) and
 reframe as hypothesis
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds two backends to the matrix:

- `codegraph` — primitive `codegraph query` surface, driven by a
  fixed 3-query plan per task (subprocess only, no LLM loop).
  4/4 recall, 3/4 top-1, 3.00 avg calls, 0.17s wall.
- `codegraph_CONTEXT` — task-shaped `codegraph context` composer,
  single call per task. 2/4 recall, 2/4 top-1 — misses both
  `requests` tasks by surfacing urllib3 internals over the
  requests-layer wrapper where the gold patch actually lands.

Codegraph rows are explicitly annotated `measurement: tool_output_only`
in `results.json`. `replay.py` marks them with `*` in the table and
prints a footnote: subprocess time + stdout bytes/4, NOT a full
LLM-driven agent loop, so the efficiency cells are not directly
comparable to the other rows. Quality cells (recall, top-1) ARE
directly comparable.

Reframes RESULTS.md as a hypothesis snapshot rather than a
dominance claim: small sample, mixed measurement methodology, and
the doc now ends in a stated hypothesis (codedb_CONTEXT is the
cheapest backend in the 3/4-top-1 cluster; codegraph primitive
would likely join that cluster under matched methodology) along
with the falsification path (live runner, more tasks, patch
oracle).

README updated to match. Verify still passes: `python3 replay.py`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/swe-lite/README.md    |  59 ++++----
 benchmarks/swe-lite/RESULTS.md   | 249 ++++++++++++++++++-------------
 benchmarks/swe-lite/replay.py    |  14 +-
 benchmarks/swe-lite/results.json |  60 +++++---
 4 files changed, 225 insertions(+), 157 deletions(-)
 mode change 100755 => 100644 benchmarks/swe-lite/replay.py

diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md
index c5ca93a..c4e799d 100644
--- a/benchmarks/swe-lite/README.md
+++ b/benchmarks/swe-lite/README.md
@@ -1,37 +1,37 @@
 # swe-lite
 
-A small **file-localization** benchmark for code-retrieval backends,
-graded by a deterministic oracle (file-path match against the merged
-upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
-instances.
+A small **file-localization** hypothesis snapshot for code-retrieval
+backends, graded by a deterministic oracle (file-path match against
+the merged upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 backends.
 
 This folder is a sibling to [`../search-shootout`](../search-shootout),
-which uses a hand-authored React corpus + Claude-as-judge. The two
-benches stress different things:
+which uses a hand-authored React corpus + LLM judge. The two views
+stress different things:
 
 | | `search-shootout/` | `swe-lite/` (this folder) |
 |---|---|---|
 | Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) |
 | Tasks | hand-authored | merged upstream PRs |
 | Ground truth | hand-written `tasks.json` | gold patch's `changed_files` |
-| Oracle | Claude-as-judge, 5-point rubric | deterministic file-path match |
+| Oracle | LLM-as-judge, 5-point rubric | deterministic file-path match |
 | Risk | closed loop (same model family writes test + takes test) | independent ground truth |
 
 Both views together are stronger than either alone. The
-search-shootout grades *answer quality* (could the agent answer the
-question well?); swe-lite grades *file-localization correctness*
-(did the agent name the file the patch actually edited?).
+search-shootout grades *answer quality*; swe-lite grades
+*file-localization correctness*.
 
 ## Files
 
-- [`results.json`](./results.json) — frozen snapshot: tasks, per-cell
-  metrics, summary. Captured 2026-05-22.
+- [`results.json`](./results.json) — frozen snapshot: 4 tasks ×
+  6 backends, per-cell metrics, summary, hypothesis.
 - [`replay.py`](./replay.py) — loads `results.json`, recomputes the
   per-backend averages from the raw cells, asserts the summary
-  matches, and prints the dominance table.
+  matches, and prints the dominance table (with `*` annotation
+  for tool-output-only measurements).
 - [`RESULTS.md`](./RESULTS.md) — the publishable read of the data:
-  dominance table, the deployment-shape caveat, and what this bench
-  does and does not measure.
+  dominance table, what jumps out, the measurement caveat, and the
+  falsifiable hypothesis this snapshot supports.
 
 ## Quick start
 
@@ -39,8 +39,8 @@ question well?); swe-lite grades *file-localization correctness*
 python3 replay.py
 ```
 
-That prints the dominance table and exits non-zero if any summary
-cell disagrees with what the raw cells imply. JSON form:
+Prints the matrix and exits non-zero if any summary cell disagrees
+with what the raw cells imply. JSON form:
 
 ```sh
 python3 replay.py --json
@@ -48,14 +48,17 @@ python3 replay.py --json
 
 ## What this is NOT
 
-- **Not a live SWE-bench runner.** `results.json` was populated by
-  running each backend by hand and recording the agent's `files`
-  output. The script in this folder replays that record; it does not
-  re-invoke the backends.
-- **Not a patch-correctness eval.** This grades "did the agent name
-  the right file?", not "did the agent's patch make the failing
-  tests pass?". The latter (SWE-bench's headline `pass@1`) is the
-  metric that actually matters and is tracked as future work.
-- **Not a statistic.** n=4 is a sanity check, not a sample.
-
-See [`RESULTS.md`](./RESULTS.md) §Caveats for the full list.
+- **Not a live SWE-bench runner.** Four of six rows (`codedb`,
+  `codedb_CONTEXT`, `leanctx`, `fts5_trigram`) were populated by
+  running each backend through an LLM agent loop and recording the
+  agent's `files` output; codegraph rows were freshly measured here
+  using a fixed query plan (subprocess only, no LLM in the loop).
+  See `RESULTS.md` §Measurement caveat.
+- **Not a patch-correctness eval.** Grades "did the agent name the
+  right file?", not "did the agent's patch make the failing tests
+  pass?". The latter is tracked as future work.
+- **Not a statistic.** n=4 is a sanity check, not a sample. The
+  doc is framed as a hypothesis snapshot, not a settled claim.
+
+See [`RESULTS.md`](./RESULTS.md) for the full list of caveats and
+the hypothesis statement.
diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md
index 44277c7..617727f 100644
--- a/benchmarks/swe-lite/RESULTS.md
+++ b/benchmarks/swe-lite/RESULTS.md
@@ -1,17 +1,18 @@
-# SWE-bench Lite — file-localization results
+# SWE-bench Lite — file-localization, six backends
 
-Frozen snapshot of 4 SWE-bench Lite instances × 4 retrieval backends,
-scored by deterministic file-path match against the merged upstream
-patch (no LLM judge). Captured 2026-05-22.
+Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 retrieval backends, graded by a deterministic oracle
+(does the agent name the file that the merged upstream patch actually
+edits?). Captured 2026-05-22.
 
+This is published as a **hypothesis snapshot**, not a settled
+dominance claim — n=4 is too small for statistics, and not all rows
+were measured the same way (see [Measurement caveat](#measurement-caveat)).
 The raw data is in [`results.json`](./results.json); recompute and
 verify the summary block with [`replay.py`](./replay.py).
 
 ## Tasks
 
-Four [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
-instances spanning three real upstream repos:
-
 | Instance | Repo | Gold file (the file the merged PR patched) |
 |---|---|---|
 | `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` |
@@ -19,56 +20,74 @@ instances spanning three real upstream repos:
 | `psf__requests-2674` | psf/requests | `requests/adapters.py` |
 | `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` |
 
-Each instance's `base_commit` is pinned in `results.json` so the same
-state can be rebuilt by anyone.
+Each instance's `base_commit` is pinned in `results.json` so the
+state can be rebuilt.
 
 ## Backends
 
-| Backend | What it is | How invoked |
-|---|---|---|
-| `codedb` | This repo's CLI surface — the four lookup primitives (`search`, `find`, `word`, `outline`) | shell calls, agent composes them itself |
-| `codedb_CONTEXT` | This repo's **MCP composer** tool — bundles the primitives server-side into one task-shaped call | single MCP call with the issue text + `project=<corpus>` |
-| `leanctx` | yvgude/lean-ctx, BM25-ish word index | CLI calls per query |
-| `fts5_trigram` | SQLite FTS5 with the `trigram` tokenizer | direct sqlite3 substring query |
+Six backends, three of which ship in two surfaces (a primitive
+"search" surface and a task-shaped "build context for this query"
+surface). Both surfaces are reported separately when they exist —
+mixing a tool's primitive surface against another tool's deployed
+surface gives a misleading read.
 
-`codedb_CONTEXT` is the deployed shape of codedb for agentic use; the
-CLI is the underlying primitive set. Measuring both lets us separate
-"is the search good?" from "is the deployed shape good?".
+| Backend | What it is | Surface |
+|---|---|---|
+| `codedb` | This repo. Zig trigram + word index. | primitive (`search`, `find`, `word`, `outline`) |
+| `codedb_CONTEXT` | This repo's MCP composer | task-shaped (single call) |
+| `leanctx` | yvgude/lean-ctx, BM25-ish word index | primitive |
+| `fts5_trigram` | SQLite FTS5 with `trigram` tokenizer | primitive |
+| `codegraph` | TS+SQLite code-graph (`codegraph query`) | primitive |
+| `codegraph_CONTEXT` | codegraph's task composer (`codegraph context`) | task-shaped |
 
-## Scoring
+## Oracle
 
 Deterministic, no LLM judge:
 
 - **recall** — gold file appears anywhere in the agent's `files` list
-- **top-1** — agent's *first* listed file equals the gold file
+- **top-1** — the agent's *first* listed file equals the gold file
 
-That's it. The agent doesn't have to write a patch; it just has to
-name the file it would edit. This is an intermediate signal — weaker
-than patch-correctness, but stronger than judge-graded quality
-because there's no model in the oracle loop.
+The agent doesn't have to write a patch — only name the file it
+would edit. This is an intermediate signal: weaker than patch
+correctness, but stronger than judge-graded quality because there's
+no model in the oracle loop.
 
 ## Headline
 
 ```
-backend          recall  top-1  avg calls  avg wall (s)  avg tokens
----------------  ------  -----  ---------  ------------  ----------
-codedb            4/4     3/4     26.75       42.00         37,954
-codedb_CONTEXT    4/4     3/4      2.25        1.25         14,717
-leanctx           4/4     3/4      9.75       27.25         30,172
-fts5_trigram      4/4     4/4     13.75       24.75         25,801
+backend              recall  top-1  avg calls  avg wall (s)  avg tokens
+-------------------  ------  -----  ---------  ------------  ----------
+codedb               4/4     3/4    26.75      42.00         37,954
+codedb_CONTEXT       4/4     3/4     2.25       1.25         14,716
+leanctx              4/4     3/4     9.75      27.25         30,172
+fts5_trigram         4/4     4/4    13.75      24.75         25,800
+codegraph *          4/4     3/4     3.00       0.17          1,981
+codegraph_CONTEXT *  2/4     2/4     1.00       0.11          4,146
 ```
 
-**Quality.** All four backends fully recall the gold file (4/4).
-Top-1 splits at one task: `fts5_trigram` 4/4, the other three at 3/4.
+*\* Codegraph rows use a different measurement methodology — see
+[Measurement caveat](#measurement-caveat) before reading the
+efficiency cells.*
+
+## What jumps out
+
+**Quality is mostly uniform.** Five of six backends fully recall the
+gold file (4/4). Top-1 splits across one task (`seaborn-2848`,
+discussed below): `fts5_trigram` 4/4, four others tied at 3/4.
 
-**Efficiency.** `codedb_CONTEXT` dominates on every axis — **4×**
-fewer calls than `leanctx`, **6×** fewer than `fts5_trigram`, **12×**
-fewer than `codedb` CLI; **20-30×** faster wall; lowest tokens.
+**`codegraph_CONTEXT` is the lone quality outlier.** It misses both
+`requests` tasks because the issue text mentions urllib3 keywords
+("socket", "urllib3", "DecodeError"), and the composer surfaces
+urllib3 internals over the requests-layer wrapper where the patch
+actually lands. This is the only cell where graph-relevance signal
+diverges sharply from patch-site relevance in this sample.
 
-**Pareto frontier.** Only one point is Pareto-optimal across (quality,
-efficiency): `codedb_CONTEXT`. The single backend that exceeds it on
-quality (`fts5_trigram`, by one cell out of four) costs ~1.5× the
-wall and ~1.75× the tokens for that gain.
+**Among the apples-to-apples (agent-loop) rows, `codedb_CONTEXT`
+sits at the efficient end of the matched-quality cluster.** It
+matches the 3/4-top-1 cluster (codedb / leanctx / codedb_CONTEXT)
+on quality and is the cheapest in that cluster across calls, wall,
+and tokens. `fts5_trigram` is the only backend that gets the
+top-1-4/4 cell — at ~20× the wall time of `codedb_CONTEXT`.
 
 ## The one task where top-1 split — `mwaskom__seaborn-2848`
 
@@ -77,70 +96,94 @@ The seaborn bug surfaces as a `KeyError` raised inside
 lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch
 edits `_oldcore.py` (the root-cause site).
 
-Three of four backends (`codedb`, `codedb_CONTEXT`, `leanctx`) named
-`axisgrid.py` first and `_oldcore.py` second — the order a developer
-would trace through. `fts5_trigram` named `_oldcore.py` first because
-trigram matches on identifier strings preferred the file with denser
-term hits.
-
-Both orderings find the bug. Which one is "better" depends on what
-you want top-1 to mean: the first place a developer would look (the
-call site) or the place the patch actually lands (the root-cause
-site). At this sample size the metric punishes the explanatory
-ordering, but neither agent failed the task.
-
-## Why the CLI row matters — deployment shape is a measurement axis
-
-An earlier iteration of this bench reported `codedb` as the *least*
-efficient backend (26.75 calls / 42s / 38k tokens) and concluded the
-dominance claim was partially falsified. That finding was numerically
-correct but tested the wrong thing: it pitted codedb's *CLI* (a stack
-of four lookup primitives the agent composes itself) against peers'
-*deployed* surfaces (leanctx CLI, fts5 sqlite3).
-
-`codedb_CONTEXT` is the actual deployed shape — one MCP call that
-bundles the primitives server-side. Once measured at the same level
-of abstraction as the peers, the dominance picture survives the
-verifiable oracle.
-
-**Lesson:** when a tool has more than one deployment surface
-(CLI / MCP / HTTP / library), the bench has to identify the
-*primary* surface and compare primary-against-primary. Measuring a
-side surface and reporting it as the headline is an
-apples-to-oranges error.
-
-## Caveats — read before quoting these numbers
-
-1. **n=4 is small.** Four SWE-bench Lite instances is a sanity check,
-   not a statistic. Don't generalize from "3/4 top-1" to "75% top-1
-   on SWE-bench Lite".
-2. **File-localization ≠ patch-correctness.** This bench measures
-   whether the agent names the right file. It does not run the agent
-   end-to-end, generate a patch, or check whether the patch makes
-   the failing tests pass. An end-to-end `pass@1` eval is the metric
-   that actually matters; this is one rung below it on the ladder.
-3. **Replay, not live.** `results.json` is a frozen record. The
-   `replay.py` script recomputes the averages from the cells and
-   verifies the summary block matches, but it does not re-launch
-   the backends. A live runner is future work.
-4. **One judge-graded comparator** (`codegraph` MCP) is intentionally
-   absent here — it was measured on the hand-authored / judge-graded
-   shootout but not on this verifiable-oracle bench. Add it if you
-   want a 5-backend matrix.
-5. **The seaborn split is a metric artifact, not a backend
-   weakness.** Three out of four backends (including `fts5_trigram`
-   on the *other* three tasks) order files by traceability rather
-   than patch site. The split says more about top-1 as a metric than
-   about the backends.
+Four backends (`codedb`, `codedb_CONTEXT`, `leanctx`, `codegraph`)
+named `axisgrid.py` first and `_oldcore.py` second — the order a
+developer would trace through. `fts5_trigram` and
+`codegraph_CONTEXT` named `_oldcore.py` first. Both orderings find
+the bug; "top-1 correctness" is really asking *which* ordering you
+want — the first file a developer would look at (call site) or the
+file the patch actually lands in (root cause).
+
+## Measurement caveat
+
+Codegraph rows (`codegraph` and `codegraph_CONTEXT`) were measured
+differently from the other four rows:
+
+- **Calls / wall:** codegraph numbers reflect subprocess invocations
+  driven by a fixed 3-query plan (primitive surface) or a single
+  `codegraph context` call (task surface). The other four rows
+  reflect a full LLM-driven agent loop that decides which queries
+  to run.
+- **Tokens:** codegraph numbers are stdout bytes / 4 (just the
+  tool's output). The other four rows include the agent's full
+  context (system prompt + tool defs + tool outputs + LLM
+  reasoning).
+
+Under a comparable LLM-driven loop, codegraph's tool_calls would
+likely rise (an LLM tends to make 5–15 queries when exploring) and
+tokens would rise to the agent-context level (~10–20× current
+values). What's NOT expected to change much: recall and top-1,
+since those depend on which files codegraph surfaces — and the file
+sets above are what codegraph actually returned for those queries.
+
+The takeaway is that codegraph's **quality** cells are directly
+comparable to other backends, and its **efficiency** cells are not.
+This is annotated in the table with `*` and in `results.json` via
+the `measurement: tool_output_only` field.
+
+## Other caveats — read before quoting these numbers
+
+1. **n=4 is small.** Four SWE-bench Lite instances is a sanity
+   check, not a statistic. Don't read "3/4 top-1" as "75% top-1 on
+   SWE-bench Lite".
+2. **File-localization ≠ patch-correctness.** This bench grades
+   whether the agent names the right file. It does not run the
+   agent end-to-end, generate a patch, or check whether the patch
+   makes the failing tests pass. An end-to-end `pass@1` eval is the
+   metric that actually matters; this is one rung below it on the
+   ladder.
+3. **Snapshot, not live.** `results.json` is a frozen record.
+   `replay.py` recomputes the averages from the cells and verifies
+   the summary block matches, but does not re-launch the four
+   non-codegraph backends. Codegraph rows *were* freshly measured
+   while preparing this snapshot.
+4. **The seaborn top-1 split is a metric artifact, not a backend
+   weakness.** Four of six backends order files by traceability
+   rather than by patch site. The split says more about top-1 as a
+   metric than about any individual backend.
+
+## Hypothesis
+
+Stated as something to falsify, not declare:
+
+> Among compared backends, **`codedb_CONTEXT`** is the cheapest
+> backend in the matched-quality cluster (3/4 top-1, 4/4 recall) on
+> file-localization. **`fts5_trigram`** is the only backend that
+> currently reaches 4/4 top-1, and it does so at ~20× the wall time
+> of `codedb_CONTEXT`. The expected next-step result, if a live
+> agent-loop runner is built and codegraph is re-measured under
+> matched methodology, is: **codegraph (primitive) joins the
+> 3/4-top-1 cluster at agent-loop call counts somewhere between
+> codedb_CONTEXT's 2.25 and leanctx's 9.75, with comparable
+> tokens.**
+
+This hypothesis is **falsifiable** by:
+
+- Building a live LLM-loop runner and re-measuring codegraph at
+  agent-loop methodology.
+- Expanding to 20–50 SWE-bench Lite instances — at that sample size
+  the quality differences (or lack of them) become statistical.
+- Adding a patch-correctness oracle (apply the agent's patch
+  against the pinned `base_commit` and run the failing tests).
+
+Until any of those hold, treat the headline as **directional**, not
+quantitative.
 
 ## Future work
 
-- A live runner that actually invokes each backend per task and
-  records `files`, `tool_calls`, `wall_seconds`, `tokens` on the
-  spot (instead of the current hand-recorded snapshot).
-- A patch-correctness oracle: agent produces a unified diff,
-  oracle applies it against the pinned `base_commit` and runs the
-  upstream test suite. That's the only metric that fully captures
-  the "did the agent solve it?" question.
-- More tasks. 20-50 SWE-bench Lite instances would let "3/4 top-1"
-  turn into a statistic instead of a sanity check.
+- A live runner that actually invokes each backend per task with a
+  consistent LLM agent loop, so all rows are measured the same way.
+- A patch-correctness oracle.
+- More tasks.
+- Quality cells under the existing oracle are robust; everything
+  else is a calibration exercise.
diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py
old mode 100755
new mode 100644
index 8997427..d9a9b39
--- a/benchmarks/swe-lite/replay.py
+++ b/benchmarks/swe-lite/replay.py
@@ -65,11 +65,16 @@ def verify(snapshot: dict, recomputed: dict) -> list[str]:
 
 def print_table(snapshot: dict, recomputed: dict) -> None:
     backends = snapshot["backends"]
+    measurement = {
+        b: snapshot["summary"]["by_backend"][b].get("measurement")
+        for b in backends
+    }
     rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")]
     for backend in backends:
         s = recomputed[backend]
+        label = backend + (" *" if measurement.get(backend) == "tool_output_only" else "")
         rows.append((
-            backend,
+            label,
             s["recall"],
             s["top_1"],
             f"{s['avg_tool_calls']:.2f}",
@@ -82,8 +87,11 @@ def print_table(snapshot: dict, recomputed: dict) -> None:
         print("  ".join(cell.ljust(widths[j]) for j, cell in enumerate(row)))
         if i == 0:
             print(sep)
-
-
+    if any(m == "tool_output_only" for m in measurement.values()):
+        print()
+        print("* tool-output-only measurement (subprocess time + stdout bytes/4),")
+        print("  driven by a fixed query plan, NOT an LLM agent loop. Not directly")
+        print("  comparable to rows without an asterisk — see RESULTS.md for details.")
 def main() -> int:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON")
diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json
index 0812eda..6a3340b 100644
--- a/benchmarks/swe-lite/results.json
+++ b/benchmarks/swe-lite/results.json
@@ -6,42 +6,56 @@
     "recall": "gold file appears anywhere in agent's `files` list",
     "top_1": "agent's first `files` entry equals the gold file"
   },
+  "measurement_notes": {
+    "default": "tokens reflect the agent's full context consumption (system prompt + tool defs + tool outputs + LLM reasoning); tool_calls and wall_seconds are end-to-end agent loop totals",
+    "tool_output_only": "tokens reflect only the tool's stdout bytes / 4 (no LLM context); tool_calls and wall_seconds reflect subprocess invocations driven by a fixed query plan, not an LLM-decided loop"
+  },
   "tasks": [
     {"id": "pallets__flask-4045",    "repo": "pallets/flask",   "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot",      "gold_files": ["src/flask/blueprints.py"]},
     {"id": "psf__requests-2148",     "repo": "psf/requests",    "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]},
     {"id": "psf__requests-2674",     "repo": "psf/requests",    "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API",   "gold_files": ["requests/adapters.py"]},
     {"id": "mwaskom__seaborn-2848",  "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]}
   ],
-  "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram"],
+  "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram", "codegraph", "codegraph_CONTEXT"],
   "cells": [
-    {"task": "pallets__flask-4045",   "backend": "codedb",          "files": ["src/flask/blueprints.py"],                                              "tool_calls": 8,  "wall_seconds": 12,  "tokens": 17508, "recall": true, "top_1": true},
-    {"task": "pallets__flask-4045",   "backend": "codedb_CONTEXT",  "files": ["src/flask/blueprints.py"],                                              "tool_calls": 3,  "wall_seconds": 2,   "tokens": 14834, "recall": true, "top_1": true},
-    {"task": "pallets__flask-4045",   "backend": "leanctx",         "files": ["src/flask/blueprints.py"],                                              "tool_calls": 4,  "wall_seconds": 8,   "tokens": 17017, "recall": true, "top_1": true},
-    {"task": "pallets__flask-4045",   "backend": "fts5_trigram",    "files": ["src/flask/blueprints.py"],                                              "tool_calls": 13, "wall_seconds": 18,  "tokens": 16288, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codedb",            "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 8,  "wall_seconds": 12,    "tokens": 17508, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codedb_CONTEXT",    "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 3,  "wall_seconds": 2,     "tokens": 14834, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "leanctx",           "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 4,  "wall_seconds": 8,     "tokens": 17017, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "fts5_trigram",      "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 13, "wall_seconds": 18,    "tokens": 16288, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codegraph",         "files": ["src/flask/blueprints.py", "src/flask/json/tag.py", "src/flask/wrappers.py", "src/flask/app.py"],  "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 2235,  "recall": true, "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "pallets__flask-4045",   "backend": "codegraph_CONTEXT", "files": ["src/flask/blueprints.py", "src/flask/helpers.py", "src/flask/app.py", "src/flask/scaffold.py"],   "tool_calls": 1,  "wall_seconds": 0.11,  "tokens": 3788,  "recall": true, "top_1": true,  "measurement": "tool_output_only"},
 
-    {"task": "psf__requests-2148",    "backend": "codedb",          "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18,  "tokens": 20439, "recall": true, "top_1": true},
-    {"task": "psf__requests-2148",    "backend": "codedb_CONTEXT",  "files": ["requests/models.py", "requests/exceptions.py"],                         "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14516, "recall": true, "top_1": true},
-    {"task": "psf__requests-2148",    "backend": "leanctx",         "files": ["requests/models.py", "requests/adapters.py"],                           "tool_calls": 9,  "wall_seconds": 28,  "tokens": 32319, "recall": true, "top_1": true},
-    {"task": "psf__requests-2148",    "backend": "fts5_trigram",    "files": ["requests/models.py", "requests/adapters.py"],                           "tool_calls": 11, "wall_seconds": 18,  "tokens": 16427, "recall": true, "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codedb",            "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"],                          "tool_calls": 14, "wall_seconds": 18,    "tokens": 20439, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codedb_CONTEXT",    "files": ["requests/models.py", "requests/exceptions.py"],                                                 "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14516, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "leanctx",           "files": ["requests/models.py", "requests/adapters.py"],                                                   "tool_calls": 9,  "wall_seconds": 28,    "tokens": 32319, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "fts5_trigram",      "files": ["requests/models.py", "requests/adapters.py"],                                                   "tool_calls": 11, "wall_seconds": 18,    "tokens": 16427, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codegraph",         "files": ["requests/models.py", "requests/adapters.py", "requests/sessions.py", "requests/exceptions.py"], "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 1501,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "psf__requests-2148",    "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/connection.py", "requests/packages/urllib3/util/ssl_.py"],             "tool_calls": 1,  "wall_seconds": 0.10,  "tokens": 3440,  "recall": false, "top_1": false, "measurement": "tool_output_only"},
 
-    {"task": "psf__requests-2674",    "backend": "codedb",          "files": ["requests/adapters.py", "requests/exceptions.py"],                        "tool_calls": 23, "wall_seconds": 18,  "tokens": 24816, "recall": true, "top_1": true},
-    {"task": "psf__requests-2674",    "backend": "codedb_CONTEXT",  "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14725, "recall": true, "top_1": true},
-    {"task": "psf__requests-2674",    "backend": "leanctx",         "files": ["requests/adapters.py"],                                                  "tool_calls": 6,  "wall_seconds": 28,  "tokens": 28060, "recall": true, "top_1": true},
-    {"task": "psf__requests-2674",    "backend": "fts5_trigram",    "files": ["requests/adapters.py", "requests/exceptions.py"],                        "tool_calls": 8,  "wall_seconds": 18,  "tokens": 22767, "recall": true, "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codedb",            "files": ["requests/adapters.py", "requests/exceptions.py"],                                                "tool_calls": 23, "wall_seconds": 18,    "tokens": 24816, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codedb_CONTEXT",    "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"],                         "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14725, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "leanctx",           "files": ["requests/adapters.py"],                                                                          "tool_calls": 6,  "wall_seconds": 28,    "tokens": 28060, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "fts5_trigram",      "files": ["requests/adapters.py", "requests/exceptions.py"],                                                "tool_calls": 8,  "wall_seconds": 18,    "tokens": 22767, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codegraph",         "files": ["requests/adapters.py", "requests/packages/urllib3/exceptions.py", "requests/sessions.py"],       "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 1927,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "psf__requests-2674",    "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/exceptions.py", "requests/packages/urllib3/util/timeout.py"],          "tool_calls": 1,  "wall_seconds": 0.10,  "tokens": 3113,  "recall": false, "top_1": false, "measurement": "tool_output_only"},
 
-    {"task": "mwaskom__seaborn-2848", "backend": "codedb",          "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false},
-    {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT",  "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 2,  "wall_seconds": 1,   "tokens": 14791, "recall": true, "top_1": false},
-    {"task": "mwaskom__seaborn-2848", "backend": "leanctx",         "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                            "tool_calls": 20, "wall_seconds": 45,  "tokens": 43291, "recall": true, "top_1": false},
-    {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram",    "files": ["seaborn/_oldcore.py", "seaborn/relational.py"],                          "tool_calls": 23, "wall_seconds": 45,  "tokens": 47720, "recall": true, "top_1": true}
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb",            "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 62, "wall_seconds": 120,   "tokens": 89054, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT",    "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14791, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "leanctx",           "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 20, "wall_seconds": 45,    "tokens": 43291, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram",      "files": ["seaborn/_oldcore.py", "seaborn/relational.py"],                                                  "tool_calls": 23, "wall_seconds": 45,    "tokens": 47720, "recall": true,  "top_1": true},
+    {"task": "mwaskom__seaborn-2848", "backend": "codegraph",         "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 3,  "wall_seconds": 0.18,  "tokens": 2262,  "recall": true,  "top_1": false, "measurement": "tool_output_only"},
+    {"task": "mwaskom__seaborn-2848", "backend": "codegraph_CONTEXT", "files": ["seaborn/_oldcore.py", "seaborn/axisgrid.py", "seaborn/_marks/base.py"],                          "tool_calls": 1,  "wall_seconds": 0.12,  "tokens": 6245,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"}
   ],
   "summary": {
     "by_backend": {
-      "codedb":         {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0,  "avg_tokens": 37954.25},
-      "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25,  "avg_wall_seconds": 1.25,  "avg_tokens": 14716.5},
-      "leanctx":        {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75,  "avg_wall_seconds": 27.25, "avg_tokens": 30171.75},
-      "fts5_trigram":   {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5}
+      "codedb":            {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0,   "avg_tokens": 37954.25},
+      "codedb_CONTEXT":    {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25,  "avg_wall_seconds": 1.25,   "avg_tokens": 14716.5},
+      "leanctx":           {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75,  "avg_wall_seconds": 27.25,  "avg_tokens": 30171.75},
+      "fts5_trigram":      {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75,  "avg_tokens": 25800.5},
+      "codegraph":         {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 3.0,   "avg_wall_seconds": 0.165,  "avg_tokens": 1981.25, "measurement": "tool_output_only"},
+      "codegraph_CONTEXT": {"recall": "2/4", "top_1": "2/4", "avg_tool_calls": 1.0,   "avg_wall_seconds": 0.1075, "avg_tokens": 4146.5,  "measurement": "tool_output_only"}
     },
-    "headline": "All four backends fully recall the gold file (4/4). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx tie at 3/4 (all flagged seaborn/axisgrid.py before seaborn/_oldcore.py — the symptom site vs the root-cause site). Efficiency: codedb_CONTEXT dominates by a wide margin (2.25 calls / 1.25s / 14.7k tokens) — 4-12x fewer calls than peers, 20-30x faster wall, lowest tokens.",
-    "pareto_optimal": "codedb_CONTEXT is the sole Pareto-optimal point on the (quality, efficiency) frontier: only fts5_trigram exceeds it on quality, and only by 1 cell out of 4, at ~1.5x the wall and ~1.75x the tokens."
+    "headline": "Six backends, four SWE-bench Lite instances. Quality is broadly similar — five of six achieve 4/4 recall (codegraph_CONTEXT is the only outlier at 2/4, missing both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx / codegraph tie at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering); codegraph_CONTEXT at 2/4. Efficiency cells for the codegraph rows reflect subprocess-only measurement under a fixed query plan, not a full LLM agent loop — they are not directly comparable to the other rows' agent-loop numbers.",
+    "hypothesis": "If a comparable LLM-driven agent loop were run against codegraph's primitive surface, recall would likely hold (4/4 found on the deterministic file-path oracle is shape-independent), but tool_calls and tokens would rise to LLM-loop levels. The interesting open question is whether codegraph_CONTEXT's `requests`-task miss is fixable by prompt engineering (it surfaces urllib3, the gold file is requests/adapters.py / requests/models.py) or whether it reflects a graph-relevance bias toward leaf libraries over wrapper APIs."
   }
 }

From 05952b4dd24aac38aa6da6bf69c9e9d811acfc85 Mon Sep 17 00:00:00 2001
From: justrach <54503978+justrach@users.noreply.github.com>
Date: Fri, 22 May 2026 20:22:59 +0800
Subject: [PATCH 3/3] bench(swe-lite): annotate codegraph version + re-verify
 at v0.9.3
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Upgraded codegraph 0.7.10 -> 0.9.3 (five minor versions of drift on
the tool we're benchmarking — unfair to measure stale). Re-indexed
all 4 corpora and re-ran both surfaces.

Result: file lists are byte-identical to v0.7.10 on all 4 tasks ×
both surfaces. Wall times within normal variance. The quality
picture in RESULTS.md is robust to the version bump.

Adds `backend_versions` to results.json metadata and a one-line note
near the top of RESULTS.md so future readers know which codegraph
version produced the numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/swe-lite/RESULTS.md   | 4 +++-
 benchmarks/swe-lite/results.json | 4 ++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md
index 617727f..589547c 100644
--- a/benchmarks/swe-lite/RESULTS.md
+++ b/benchmarks/swe-lite/RESULTS.md
@@ -3,7 +3,9 @@
 Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
 instances × 6 retrieval backends, graded by a deterministic oracle
 (does the agent name the file that the merged upstream patch actually
-edits?). Captured 2026-05-22.
+edits?). Captured 2026-05-22. Codegraph rows re-verified at v0.9.3
+(released the same day) — file lists are byte-identical to v0.7.10,
+so the quality picture below isn't a version artifact.
 
 This is published as a **hypothesis snapshot**, not a settled
 dominance claim — n=4 is too small for statistics, and not all rows
diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json
index 6a3340b..6a64acc 100644
--- a/benchmarks/swe-lite/results.json
+++ b/benchmarks/swe-lite/results.json
@@ -1,6 +1,10 @@
 {
   "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape",
   "frozen_at": "2026-05-22T10:05Z",
+  "backend_versions": {
+    "codegraph": "0.9.3 (re-verified against v0.7.10 — file lists byte-identical)",
+    "codegraph_CONTEXT": "0.9.3"
+  },
   "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)",
   "metric_definitions": {
     "recall": "gold file appears anywhere in agent's `files` list",