diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md
new file mode 100644
index 0000000..c4e799d
--- /dev/null
+++ b/benchmarks/swe-lite/README.md
@@ -0,0 +1,64 @@
+# swe-lite
+
+A small **file-localization** hypothesis snapshot for code-retrieval
+backends, graded by a deterministic oracle (file-path match against
+the merged upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 backends.
+
+This folder is a sibling to [`../search-shootout`](../search-shootout),
+which uses a hand-authored React corpus + LLM judge. The two views
+stress different things:
+
+| | `search-shootout/` | `swe-lite/` (this folder) |
+|---|---|---|
+| Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) |
+| Tasks | hand-authored | merged upstream PRs |
+| Ground truth | hand-written `tasks.json` | gold patch's `changed_files` |
+| Oracle | LLM-as-judge, 5-point rubric | deterministic file-path match |
+| Risk | closed loop (same model family writes test + takes test) | independent ground truth |
+
+Both views together are stronger than either alone. The
+search-shootout grades *answer quality*; swe-lite grades
+*file-localization correctness*.
+
+## Files
+
+- [`results.json`](./results.json) — frozen snapshot: 4 tasks ×
+  6 backends, per-cell metrics, summary, hypothesis.
+- [`replay.py`](./replay.py) — loads `results.json`, recomputes the
+  per-backend averages from the raw cells, asserts the summary
+  matches, and prints the dominance table (with `*` annotation
+  for tool-output-only measurements).
+- [`RESULTS.md`](./RESULTS.md) — the publishable read of the data:
+  dominance table, what jumps out, the measurement caveat, and the
+  falsifiable hypothesis this snapshot supports.
+
+## Quick start
+
+```sh
+python3 replay.py
+```
+
+Prints the matrix and exits non-zero if any summary cell disagrees
+with what the raw cells imply. JSON form:
+
+```sh
+python3 replay.py --json
+```
+
+## What this is NOT
+
+- **Not a live SWE-bench runner.** Four of six rows (`codedb`,
+  `codedb_CONTEXT`, `leanctx`, `fts5_trigram`) were populated by
+  running each backend through an LLM agent loop and recording the
+  agent's `files` output; codegraph rows were freshly measured here
+  using a fixed query plan (subprocess only, no LLM in the loop).
+  See `RESULTS.md` §Measurement caveat.
+- **Not a patch-correctness eval.** Grades "did the agent name the
+  right file?", not "did the agent's patch make the failing tests
+  pass?". The latter is tracked as future work.
+- **Not a statistic.** n=4 is a sanity check, not a sample. The
+  doc is framed as a hypothesis snapshot, not a settled claim.
+
+See [`RESULTS.md`](./RESULTS.md) for the full list of caveats and
+the hypothesis statement.
diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md
new file mode 100644
index 0000000..589547c
--- /dev/null
+++ b/benchmarks/swe-lite/RESULTS.md
@@ -0,0 +1,191 @@
+# SWE-bench Lite — file-localization, six backends
+
+Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 retrieval backends, graded by a deterministic oracle
+(does the agent name the file that the merged upstream patch actually
+edits?). Captured 2026-05-22. Codegraph rows re-verified at v0.9.3
+(released the same day) — file lists are byte-identical to v0.7.10,
+so the quality picture below isn't a version artifact.
+
+This is published as a **hypothesis snapshot**, not a settled
+dominance claim — n=4 is too small for statistics, and not all rows
+were measured the same way (see [Measurement caveat](#measurement-caveat)).
+The raw data is in [`results.json`](./results.json); recompute and
+verify the summary block with [`replay.py`](./replay.py).
+
+## Tasks
+
+| Instance | Repo | Gold file (the file the merged PR patched) |
+|---|---|---|
+| `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` |
+| `psf__requests-2148` | psf/requests | `requests/models.py` |
+| `psf__requests-2674` | psf/requests | `requests/adapters.py` |
+| `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` |
+
+Each instance's `base_commit` is pinned in `results.json` so the
+state can be rebuilt.
+
+## Backends
+
+Six backends, three of which ship in two surfaces (a primitive
+"search" surface and a task-shaped "build context for this query"
+surface). Both surfaces are reported separately when they exist —
+mixing a tool's primitive surface against another tool's deployed
+surface gives a misleading read.
+
+| Backend | What it is | Surface |
+|---|---|---|
+| `codedb` | This repo. Zig trigram + word index. | primitive (`search`, `find`, `word`, `outline`) |
+| `codedb_CONTEXT` | This repo's MCP composer | task-shaped (single call) |
+| `leanctx` | yvgude/lean-ctx, BM25-ish word index | primitive |
+| `fts5_trigram` | SQLite FTS5 with `trigram` tokenizer | primitive |
+| `codegraph` | TS+SQLite code-graph (`codegraph query`) | primitive |
+| `codegraph_CONTEXT` | codegraph's task composer (`codegraph context`) | task-shaped |
+
+## Oracle
+
+Deterministic, no LLM judge:
+
+- **recall** — gold file appears anywhere in the agent's `files` list
+- **top-1** — the agent's *first* listed file equals the gold file
+
+The agent doesn't have to write a patch — only name the file it
+would edit. This is an intermediate signal: weaker than patch
+correctness, but stronger than judge-graded quality because there's
+no model in the oracle loop.
+
+## Headline
+
+```
+backend              recall  top-1  avg calls  avg wall (s)  avg tokens
+-------------------  ------  -----  ---------  ------------  ----------
+codedb               4/4     3/4    26.75      42.00         37,954
+codedb_CONTEXT       4/4     3/4     2.25       1.25         14,716
+leanctx              4/4     3/4     9.75      27.25         30,172
+fts5_trigram         4/4     4/4    13.75      24.75         25,800
+codegraph *          4/4     3/4     3.00       0.17          1,981
+codegraph_CONTEXT *  2/4     2/4     1.00       0.11          4,146
+```
+
+*\* Codegraph rows use a different measurement methodology — see
+[Measurement caveat](#measurement-caveat) before reading the
+efficiency cells.*
+
+## What jumps out
+
+**Quality is mostly uniform.** Five of six backends fully recall the
+gold file (4/4). Top-1 splits across one task (`seaborn-2848`,
+discussed below): `fts5_trigram` 4/4, four others tied at 3/4.
+
+**`codegraph_CONTEXT` is the lone quality outlier.** It misses both
+`requests` tasks because the issue text mentions urllib3 keywords
+("socket", "urllib3", "DecodeError"), and the composer surfaces
+urllib3 internals over the requests-layer wrapper where the patch
+actually lands. This is the only cell where graph-relevance signal
+diverges sharply from patch-site relevance in this sample.
+
+**Among the apples-to-apples (agent-loop) rows, `codedb_CONTEXT`
+sits at the efficient end of the matched-quality cluster.** It
+matches the 3/4-top-1 cluster (codedb / leanctx / codedb_CONTEXT)
+on quality and is the cheapest in that cluster across calls, wall,
+and tokens. `fts5_trigram` is the only backend that gets the
+top-1-4/4 cell — at ~20× the wall time of `codedb_CONTEXT`.
+
+## The one task where top-1 split — `mwaskom__seaborn-2848`
+
+The seaborn bug surfaces as a `KeyError` raised inside
+`seaborn/_oldcore.py::SemanticMapping`, but the user-facing call site
+lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch
+edits `_oldcore.py` (the root-cause site).
+
+Four backends (`codedb`, `codedb_CONTEXT`, `leanctx`, `codegraph`)
+named `axisgrid.py` first and `_oldcore.py` second — the order a
+developer would trace through. `fts5_trigram` and
+`codegraph_CONTEXT` named `_oldcore.py` first. Both orderings find
+the bug; "top-1 correctness" is really asking *which* ordering you
+want — the first file a developer would look at (call site) or the
+file the patch actually lands in (root cause).
+
+## Measurement caveat
+
+Codegraph rows (`codegraph` and `codegraph_CONTEXT`) were measured
+differently from the other four rows:
+
+- **Calls / wall:** codegraph numbers reflect subprocess invocations
+  driven by a fixed 3-query plan (primitive surface) or a single
+  `codegraph context` call (task surface). The other four rows
+  reflect a full LLM-driven agent loop that decides which queries
+  to run.
+- **Tokens:** codegraph numbers are stdout bytes / 4 (just the
+  tool's output). The other four rows include the agent's full
+  context (system prompt + tool defs + tool outputs + LLM
+  reasoning).
+
+Under a comparable LLM-driven loop, codegraph's tool_calls would
+likely rise (an LLM tends to make 5–15 queries when exploring) and
+tokens would rise to the agent-context level (~10–20× current
+values). What's NOT expected to change much: recall and top-1,
+since those depend on which files codegraph surfaces — and the file
+sets above are what codegraph actually returned for those queries.
+
+The takeaway is that codegraph's **quality** cells are directly
+comparable to other backends, and its **efficiency** cells are not.
+This is annotated in the table with `*` and in `results.json` via
+the `measurement: tool_output_only` field.
+
+## Other caveats — read before quoting these numbers
+
+1. **n=4 is small.** Four SWE-bench Lite instances is a sanity
+   check, not a statistic. Don't read "3/4 top-1" as "75% top-1 on
+   SWE-bench Lite".
+2. **File-localization ≠ patch-correctness.** This bench grades
+   whether the agent names the right file. It does not run the
+   agent end-to-end, generate a patch, or check whether the patch
+   makes the failing tests pass. An end-to-end `pass@1` eval is the
+   metric that actually matters; this is one rung below it on the
+   ladder.
+3. **Snapshot, not live.** `results.json` is a frozen record.
+   `replay.py` recomputes the averages from the cells and verifies
+   the summary block matches, but does not re-launch the four
+   non-codegraph backends. Codegraph rows *were* freshly measured
+   while preparing this snapshot.
+4. **The seaborn top-1 split is a metric artifact, not a backend
+   weakness.** Four of six backends order files by traceability
+   rather than by patch site. The split says more about top-1 as a
+   metric than about any individual backend.
+
+## Hypothesis
+
+Stated as something to falsify, not declare:
+
+> Among compared backends, **`codedb_CONTEXT`** is the cheapest
+> backend in the matched-quality cluster (3/4 top-1, 4/4 recall) on
+> file-localization. **`fts5_trigram`** is the only backend that
+> currently reaches 4/4 top-1, and it does so at ~20× the wall time
+> of `codedb_CONTEXT`. The expected next-step result, if a live
+> agent-loop runner is built and codegraph is re-measured under
+> matched methodology, is: **codegraph (primitive) joins the
+> 3/4-top-1 cluster at agent-loop call counts somewhere between
+> codedb_CONTEXT's 2.25 and leanctx's 9.75, with comparable
+> tokens.**
+
+This hypothesis is **falsifiable** by:
+
+- Building a live LLM-loop runner and re-measuring codegraph at
+  agent-loop methodology.
+- Expanding to 20–50 SWE-bench Lite instances — at that sample size
+  the quality differences (or lack of them) become statistical.
+- Adding a patch-correctness oracle (apply the agent's patch
+  against the pinned `base_commit` and run the failing tests).
+
+Until any of those hold, treat the headline as **directional**, not
+quantitative.
+
+## Future work
+
+- A live runner that actually invokes each backend per task with a
+  consistent LLM agent loop, so all rows are measured the same way.
+- A patch-correctness oracle.
+- More tasks.
+- Quality cells under the existing oracle are robust; everything
+  else is a calibration exercise.
diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py
new file mode 100644
index 0000000..d9a9b39
--- /dev/null
+++ b/benchmarks/swe-lite/replay.py
@@ -0,0 +1,127 @@
+#!/usr/bin/env python3
+"""Replay + verify the SWE-bench Lite file-localization snapshot.
+
+This is NOT a live SWE-bench runner. It loads `results.json` (a frozen
+record of agent runs on 4 SWE-bench Lite instances, populated by hand
+from agent traces), recomputes the per-backend averages from the raw
+cells, and asserts they match the summary block. Then prints a
+dominance table.
+
+A live runner (that actually launches each backend, sends the issue
+text, captures the agent's `files` list, and patch-tests the result)
+is out of scope for this snapshot and tracked separately.
+
+Usage:
+    python3 replay.py                # verify + print dominance table
+    python3 replay.py --json         # print raw recomputed summary as JSON
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from statistics import mean
+
+SNAPSHOT = Path(__file__).resolve().parent / "results.json"
+
+
+def recompute(snapshot: dict) -> dict:
+    by_backend: dict[str, dict] = {}
+    cells_by_backend: dict[str, list[dict]] = {}
+    for cell in snapshot["cells"]:
+        cells_by_backend.setdefault(cell["backend"], []).append(cell)
+
+    n_tasks = len(snapshot["tasks"])
+    for backend, cells in cells_by_backend.items():
+        recall_hits = sum(1 for c in cells if c["recall"])
+        top1_hits = sum(1 for c in cells if c["top_1"])
+        by_backend[backend] = {
+            "recall": f"{recall_hits}/{n_tasks}",
+            "top_1": f"{top1_hits}/{n_tasks}",
+            "avg_tool_calls": round(mean(c["tool_calls"] for c in cells), 2),
+            "avg_wall_seconds": round(mean(c["wall_seconds"] for c in cells), 2),
+            "avg_tokens": round(mean(c["tokens"] for c in cells), 2),
+        }
+    return by_backend
+
+
+def verify(snapshot: dict, recomputed: dict) -> list[str]:
+    errors: list[str] = []
+    claimed = snapshot["summary"]["by_backend"]
+    for backend, claim in claimed.items():
+        actual = recomputed.get(backend)
+        if actual is None:
+            errors.append(f"{backend}: claimed in summary but has no cells")
+            continue
+        for key in ("recall", "top_1"):
+            if claim[key] != actual[key]:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+        for key in ("avg_tool_calls", "avg_wall_seconds", "avg_tokens"):
+            if abs(float(claim[key]) - float(actual[key])) > 0.01:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+    return errors
+
+
+def print_table(snapshot: dict, recomputed: dict) -> None:
+    backends = snapshot["backends"]
+    measurement = {
+        b: snapshot["summary"]["by_backend"][b].get("measurement")
+        for b in backends
+    }
+    rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")]
+    for backend in backends:
+        s = recomputed[backend]
+        label = backend + (" *" if measurement.get(backend) == "tool_output_only" else "")
+        rows.append((
+            label,
+            s["recall"],
+            s["top_1"],
+            f"{s['avg_tool_calls']:.2f}",
+            f"{s['avg_wall_seconds']:.2f}",
+            f"{s['avg_tokens']:,.0f}",
+        ))
+    widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))]
+    sep = "  ".join("-" * w for w in widths)
+    for i, row in enumerate(rows):
+        print("  ".join(cell.ljust(widths[j]) for j, cell in enumerate(row)))
+        if i == 0:
+            print(sep)
+    if any(m == "tool_output_only" for m in measurement.values()):
+        print()
+        print("* tool-output-only measurement (subprocess time + stdout bytes/4),")
+        print("  driven by a fixed query plan, NOT an LLM agent loop. Not directly")
+        print("  comparable to rows without an asterisk — see RESULTS.md for details.")
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON")
+    parser.add_argument("--snapshot", type=Path, default=SNAPSHOT, help="path to results.json")
+    args = parser.parse_args()
+
+    snapshot = json.loads(args.snapshot.read_text())
+    recomputed = recompute(snapshot)
+    errors = verify(snapshot, recomputed)
+
+    if args.json:
+        print(json.dumps(recomputed, indent=2))
+    else:
+        print(f"source:     {snapshot['source']}")
+        print(f"frozen at:  {snapshot['frozen_at']}")
+        print(f"tasks:      {len(snapshot['tasks'])}  ({', '.join(t['id'] for t in snapshot['tasks'])})")
+        print(f"backends:   {len(snapshot['backends'])}  ({', '.join(snapshot['backends'])})")
+        print()
+        print_table(snapshot, recomputed)
+        print()
+        print("headline:", snapshot["summary"]["headline"])
+
+    if errors:
+        print(file=sys.stderr)
+        print("VERIFY FAILED — summary does not match cells:", file=sys.stderr)
+        for err in errors:
+            print(f"  - {err}", file=sys.stderr)
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json
new file mode 100644
index 0000000..6a64acc
--- /dev/null
+++ b/benchmarks/swe-lite/results.json
@@ -0,0 +1,65 @@
+{
+  "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape",
+  "frozen_at": "2026-05-22T10:05Z",
+  "backend_versions": {
+    "codegraph": "0.9.3 (re-verified against v0.7.10 — file lists byte-identical)",
+    "codegraph_CONTEXT": "0.9.3"
+  },
+  "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)",
+  "metric_definitions": {
+    "recall": "gold file appears anywhere in agent's `files` list",
+    "top_1": "agent's first `files` entry equals the gold file"
+  },
+  "measurement_notes": {
+    "default": "tokens reflect the agent's full context consumption (system prompt + tool defs + tool outputs + LLM reasoning); tool_calls and wall_seconds are end-to-end agent loop totals",
+    "tool_output_only": "tokens reflect only the tool's stdout bytes / 4 (no LLM context); tool_calls and wall_seconds reflect subprocess invocations driven by a fixed query plan, not an LLM-decided loop"
+  },
+  "tasks": [
+    {"id": "pallets__flask-4045",    "repo": "pallets/flask",   "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot",      "gold_files": ["src/flask/blueprints.py"]},
+    {"id": "psf__requests-2148",     "repo": "psf/requests",    "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]},
+    {"id": "psf__requests-2674",     "repo": "psf/requests",    "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API",   "gold_files": ["requests/adapters.py"]},
+    {"id": "mwaskom__seaborn-2848",  "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]}
+  ],
+  "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram", "codegraph", "codegraph_CONTEXT"],
+  "cells": [
+    {"task": "pallets__flask-4045",   "backend": "codedb",            "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 8,  "wall_seconds": 12,    "tokens": 17508, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codedb_CONTEXT",    "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 3,  "wall_seconds": 2,     "tokens": 14834, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "leanctx",           "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 4,  "wall_seconds": 8,     "tokens": 17017, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "fts5_trigram",      "files": ["src/flask/blueprints.py"],                                                                       "tool_calls": 13, "wall_seconds": 18,    "tokens": 16288, "recall": true, "top_1": true},
+    {"task": "pallets__flask-4045",   "backend": "codegraph",         "files": ["src/flask/blueprints.py", "src/flask/json/tag.py", "src/flask/wrappers.py", "src/flask/app.py"],  "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 2235,  "recall": true, "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "pallets__flask-4045",   "backend": "codegraph_CONTEXT", "files": ["src/flask/blueprints.py", "src/flask/helpers.py", "src/flask/app.py", "src/flask/scaffold.py"],   "tool_calls": 1,  "wall_seconds": 0.11,  "tokens": 3788,  "recall": true, "top_1": true,  "measurement": "tool_output_only"},
+
+    {"task": "psf__requests-2148",    "backend": "codedb",            "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"],                          "tool_calls": 14, "wall_seconds": 18,    "tokens": 20439, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codedb_CONTEXT",    "files": ["requests/models.py", "requests/exceptions.py"],                                                 "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14516, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "leanctx",           "files": ["requests/models.py", "requests/adapters.py"],                                                   "tool_calls": 9,  "wall_seconds": 28,    "tokens": 32319, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "fts5_trigram",      "files": ["requests/models.py", "requests/adapters.py"],                                                   "tool_calls": 11, "wall_seconds": 18,    "tokens": 16427, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2148",    "backend": "codegraph",         "files": ["requests/models.py", "requests/adapters.py", "requests/sessions.py", "requests/exceptions.py"], "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 1501,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "psf__requests-2148",    "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/connection.py", "requests/packages/urllib3/util/ssl_.py"],             "tool_calls": 1,  "wall_seconds": 0.10,  "tokens": 3440,  "recall": false, "top_1": false, "measurement": "tool_output_only"},
+
+    {"task": "psf__requests-2674",    "backend": "codedb",            "files": ["requests/adapters.py", "requests/exceptions.py"],                                                "tool_calls": 23, "wall_seconds": 18,    "tokens": 24816, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codedb_CONTEXT",    "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"],                         "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14725, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "leanctx",           "files": ["requests/adapters.py"],                                                                          "tool_calls": 6,  "wall_seconds": 28,    "tokens": 28060, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "fts5_trigram",      "files": ["requests/adapters.py", "requests/exceptions.py"],                                                "tool_calls": 8,  "wall_seconds": 18,    "tokens": 22767, "recall": true,  "top_1": true},
+    {"task": "psf__requests-2674",    "backend": "codegraph",         "files": ["requests/adapters.py", "requests/packages/urllib3/exceptions.py", "requests/sessions.py"],       "tool_calls": 3,  "wall_seconds": 0.16,  "tokens": 1927,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"},
+    {"task": "psf__requests-2674",    "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/exceptions.py", "requests/packages/urllib3/util/timeout.py"],          "tool_calls": 1,  "wall_seconds": 0.10,  "tokens": 3113,  "recall": false, "top_1": false, "measurement": "tool_output_only"},
+
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb",            "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 62, "wall_seconds": 120,   "tokens": 89054, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT",    "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 2,  "wall_seconds": 1,     "tokens": 14791, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "leanctx",           "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 20, "wall_seconds": 45,    "tokens": 43291, "recall": true,  "top_1": false},
+    {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram",      "files": ["seaborn/_oldcore.py", "seaborn/relational.py"],                                                  "tool_calls": 23, "wall_seconds": 45,    "tokens": 47720, "recall": true,  "top_1": true},
+    {"task": "mwaskom__seaborn-2848", "backend": "codegraph",         "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"],                                                    "tool_calls": 3,  "wall_seconds": 0.18,  "tokens": 2262,  "recall": true,  "top_1": false, "measurement": "tool_output_only"},
+    {"task": "mwaskom__seaborn-2848", "backend": "codegraph_CONTEXT", "files": ["seaborn/_oldcore.py", "seaborn/axisgrid.py", "seaborn/_marks/base.py"],                          "tool_calls": 1,  "wall_seconds": 0.12,  "tokens": 6245,  "recall": true,  "top_1": true,  "measurement": "tool_output_only"}
+  ],
+  "summary": {
+    "by_backend": {
+      "codedb":            {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0,   "avg_tokens": 37954.25},
+      "codedb_CONTEXT":    {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25,  "avg_wall_seconds": 1.25,   "avg_tokens": 14716.5},
+      "leanctx":           {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75,  "avg_wall_seconds": 27.25,  "avg_tokens": 30171.75},
+      "fts5_trigram":      {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75,  "avg_tokens": 25800.5},
+      "codegraph":         {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 3.0,   "avg_wall_seconds": 0.165,  "avg_tokens": 1981.25, "measurement": "tool_output_only"},
+      "codegraph_CONTEXT": {"recall": "2/4", "top_1": "2/4", "avg_tool_calls": 1.0,   "avg_wall_seconds": 0.1075, "avg_tokens": 4146.5,  "measurement": "tool_output_only"}
+    },
+    "headline": "Six backends, four SWE-bench Lite instances. Quality is broadly similar — five of six achieve 4/4 recall (codegraph_CONTEXT is the only outlier at 2/4, missing both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx / codegraph tie at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering); codegraph_CONTEXT at 2/4. Efficiency cells for the codegraph rows reflect subprocess-only measurement under a fixed query plan, not a full LLM agent loop — they are not directly comparable to the other rows' agent-loop numbers.",
+    "hypothesis": "If a comparable LLM-driven agent loop were run against codegraph's primitive surface, recall would likely hold (4/4 found on the deterministic file-path oracle is shape-independent), but tool_calls and tokens would rise to LLM-loop levels. The interesting open question is whether codegraph_CONTEXT's `requests`-task miss is fixable by prompt engineering (it surfaces urllib3, the gold file is requests/adapters.py / requests/models.py) or whether it reflects a graph-relevance bias toward leaf libraries over wrapper APIs."
+  }
+}