justrach · justrach · May 22, 2026 · May 22, 2026 · May 22, 2026 · chatgpt-codex-connector
diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md
@@ -0,0 +1,64 @@
+# swe-lite
+
+A small **file-localization** hypothesis snapshot for code-retrieval
+backends, graded by a deterministic oracle (file-path match against
+the merged upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 backends.
+
+This folder is a sibling to [`../search-shootout`](../search-shootout),
+which uses a hand-authored React corpus + LLM judge. The two views
+stress different things:
+
+| | `search-shootout/` | `swe-lite/` (this folder) |
+|---|---|---|
+| Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) |
+| Tasks | hand-authored | merged upstream PRs |
+| Ground truth | hand-written `tasks.json` | gold patch's `changed_files` |
+| Oracle | LLM-as-judge, 5-point rubric | deterministic file-path match |
+| Risk | closed loop (same model family writes test + takes test) | independent ground truth |
+
+Both views together are stronger than either alone. The
+search-shootout grades *answer quality*; swe-lite grades
+*file-localization correctness*.
+
+## Files
+
+- [`results.json`](./results.json) — frozen snapshot: 4 tasks ×
+  6 backends, per-cell metrics, summary, hypothesis.
+- [`replay.py`](./replay.py) — loads `results.json`, recomputes the
+  per-backend averages from the raw cells, asserts the summary
+  matches, and prints the dominance table (with `*` annotation
+  for tool-output-only measurements).
+- [`RESULTS.md`](./RESULTS.md) — the publishable read of the data:
+  dominance table, what jumps out, the measurement caveat, and the
+  falsifiable hypothesis this snapshot supports.
+
+## Quick start
+
+```sh
+python3 replay.py
+```
+
+Prints the matrix and exits non-zero if any summary cell disagrees
+with what the raw cells imply. JSON form:
+
+```sh
+python3 replay.py --json
+```
+
+## What this is NOT
+
+- **Not a live SWE-bench runner.** Four of six rows (`codedb`,
+  `codedb_CONTEXT`, `leanctx`, `fts5_trigram`) were populated by
+  running each backend through an LLM agent loop and recording the
+  agent's `files` output; codegraph rows were freshly measured here
+  using a fixed query plan (subprocess only, no LLM in the loop).
+  See `RESULTS.md` §Measurement caveat.
+- **Not a patch-correctness eval.** Grades "did the agent name the
+  right file?", not "did the agent's patch make the failing tests
+  pass?". The latter is tracked as future work.
+- **Not a statistic.** n=4 is a sanity check, not a sample. The
+  doc is framed as a hypothesis snapshot, not a settled claim.
+
+See [`RESULTS.md`](./RESULTS.md) for the full list of caveats and
+the hypothesis statement.
diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md
@@ -0,0 +1,191 @@
+# SWE-bench Lite — file-localization, six backends
+
+Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench)
+instances × 6 retrieval backends, graded by a deterministic oracle
+(does the agent name the file that the merged upstream patch actually
+edits?). Captured 2026-05-22. Codegraph rows re-verified at v0.9.3
+(released the same day) — file lists are byte-identical to v0.7.10,
+so the quality picture below isn't a version artifact.
+
+This is published as a **hypothesis snapshot**, not a settled
+dominance claim — n=4 is too small for statistics, and not all rows
+were measured the same way (see [Measurement caveat](#measurement-caveat)).
+The raw data is in [`results.json`](./results.json); recompute and
+verify the summary block with [`replay.py`](./replay.py).
+
+## Tasks
+
+| Instance | Repo | Gold file (the file the merged PR patched) |
+|---|---|---|
+| `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` |
+| `psf__requests-2148` | psf/requests | `requests/models.py` |
+| `psf__requests-2674` | psf/requests | `requests/adapters.py` |
+| `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` |
+
+Each instance's `base_commit` is pinned in `results.json` so the
+state can be rebuilt.
+
+## Backends
+
+Six backends, three of which ship in two surfaces (a primitive
+"search" surface and a task-shaped "build context for this query"
+surface). Both surfaces are reported separately when they exist —
+mixing a tool's primitive surface against another tool's deployed
+surface gives a misleading read.
+
+| Backend | What it is | Surface |
+|---|---|---|
+| `codedb` | This repo. Zig trigram + word index. | primitive (`search`, `find`, `word`, `outline`) |
+| `codedb_CONTEXT` | This repo's MCP composer | task-shaped (single call) |
+| `leanctx` | yvgude/lean-ctx, BM25-ish word index | primitive |
+| `fts5_trigram` | SQLite FTS5 with `trigram` tokenizer | primitive |
+| `codegraph` | TS+SQLite code-graph (`codegraph query`) | primitive |
+| `codegraph_CONTEXT` | codegraph's task composer (`codegraph context`) | task-shaped |
+
+## Oracle
+
+Deterministic, no LLM judge:
+
+- **recall** — gold file appears anywhere in the agent's `files` list
+- **top-1** — the agent's *first* listed file equals the gold file
+
+The agent doesn't have to write a patch — only name the file it
+would edit. This is an intermediate signal: weaker than patch
+correctness, but stronger than judge-graded quality because there's
+no model in the oracle loop.
+
+## Headline
+
+```
+backend              recall  top-1  avg calls  avg wall (s)  avg tokens
+-------------------  ------  -----  ---------  ------------  ----------
+codedb               4/4     3/4    26.75      42.00         37,954
+codedb_CONTEXT       4/4     3/4     2.25       1.25         14,716
+leanctx              4/4     3/4     9.75      27.25         30,172
+fts5_trigram         4/4     4/4    13.75      24.75         25,800
+codegraph *          4/4     3/4     3.00       0.17          1,981
+codegraph_CONTEXT *  2/4     2/4     1.00       0.11          4,146
+```
+
+*\* Codegraph rows use a different measurement methodology — see
+[Measurement caveat](#measurement-caveat) before reading the
+efficiency cells.*
+
+## What jumps out
+
+**Quality is mostly uniform.** Five of six backends fully recall the
+gold file (4/4). Top-1 splits across one task (`seaborn-2848`,
+discussed below): `fts5_trigram` 4/4, four others tied at 3/4.
+
+**`codegraph_CONTEXT` is the lone quality outlier.** It misses both
+`requests` tasks because the issue text mentions urllib3 keywords
+("socket", "urllib3", "DecodeError"), and the composer surfaces
+urllib3 internals over the requests-layer wrapper where the patch
+actually lands. This is the only cell where graph-relevance signal
+diverges sharply from patch-site relevance in this sample.
+
+**Among the apples-to-apples (agent-loop) rows, `codedb_CONTEXT`
+sits at the efficient end of the matched-quality cluster.** It
+matches the 3/4-top-1 cluster (codedb / leanctx / codedb_CONTEXT)
+on quality and is the cheapest in that cluster across calls, wall,
+and tokens. `fts5_trigram` is the only backend that gets the
+top-1-4/4 cell — at ~20× the wall time of `codedb_CONTEXT`.
+
+## The one task where top-1 split — `mwaskom__seaborn-2848`
+
+The seaborn bug surfaces as a `KeyError` raised inside
+`seaborn/_oldcore.py::SemanticMapping`, but the user-facing call site
+lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch
+edits `_oldcore.py` (the root-cause site).
+
+Four backends (`codedb`, `codedb_CONTEXT`, `leanctx`, `codegraph`)
+named `axisgrid.py` first and `_oldcore.py` second — the order a
+developer would trace through. `fts5_trigram` and
+`codegraph_CONTEXT` named `_oldcore.py` first. Both orderings find
+the bug; "top-1 correctness" is really asking *which* ordering you
+want — the first file a developer would look at (call site) or the
+file the patch actually lands in (root cause).
+
+## Measurement caveat
+
+Codegraph rows (`codegraph` and `codegraph_CONTEXT`) were measured
+differently from the other four rows:
+
+- **Calls / wall:** codegraph numbers reflect subprocess invocations
+  driven by a fixed 3-query plan (primitive surface) or a single
+  `codegraph context` call (task surface). The other four rows
+  reflect a full LLM-driven agent loop that decides which queries
+  to run.
+- **Tokens:** codegraph numbers are stdout bytes / 4 (just the
+  tool's output). The other four rows include the agent's full
+  context (system prompt + tool defs + tool outputs + LLM
+  reasoning).
+
+Under a comparable LLM-driven loop, codegraph's tool_calls would
+likely rise (an LLM tends to make 5–15 queries when exploring) and
+tokens would rise to the agent-context level (~10–20× current
+values). What's NOT expected to change much: recall and top-1,
+since those depend on which files codegraph surfaces — and the file
+sets above are what codegraph actually returned for those queries.
+
+The takeaway is that codegraph's **quality** cells are directly
+comparable to other backends, and its **efficiency** cells are not.
+This is annotated in the table with `*` and in `results.json` via
+the `measurement: tool_output_only` field.
+
+## Other caveats — read before quoting these numbers
+
+1. **n=4 is small.** Four SWE-bench Lite instances is a sanity
+   check, not a statistic. Don't read "3/4 top-1" as "75% top-1 on
+   SWE-bench Lite".
+2. **File-localization ≠ patch-correctness.** This bench grades
+   whether the agent names the right file. It does not run the
+   agent end-to-end, generate a patch, or check whether the patch
+   makes the failing tests pass. An end-to-end `pass@1` eval is the
+   metric that actually matters; this is one rung below it on the
+   ladder.
+3. **Snapshot, not live.** `results.json` is a frozen record.
+   `replay.py` recomputes the averages from the cells and verifies
+   the summary block matches, but does not re-launch the four
+   non-codegraph backends. Codegraph rows *were* freshly measured
+   while preparing this snapshot.
+4. **The seaborn top-1 split is a metric artifact, not a backend
+   weakness.** Four of six backends order files by traceability
+   rather than by patch site. The split says more about top-1 as a
+   metric than about any individual backend.
+
+## Hypothesis
+
+Stated as something to falsify, not declare:
+
+> Among compared backends, **`codedb_CONTEXT`** is the cheapest
+> backend in the matched-quality cluster (3/4 top-1, 4/4 recall) on
+> file-localization. **`fts5_trigram`** is the only backend that
+> currently reaches 4/4 top-1, and it does so at ~20× the wall time
+> of `codedb_CONTEXT`. The expected next-step result, if a live
+> agent-loop runner is built and codegraph is re-measured under
+> matched methodology, is: **codegraph (primitive) joins the
+> 3/4-top-1 cluster at agent-loop call counts somewhere between
+> codedb_CONTEXT's 2.25 and leanctx's 9.75, with comparable
+> tokens.**
+
+This hypothesis is **falsifiable** by:
+
+- Building a live LLM-loop runner and re-measuring codegraph at
+  agent-loop methodology.
+- Expanding to 20–50 SWE-bench Lite instances — at that sample size
+  the quality differences (or lack of them) become statistical.
+- Adding a patch-correctness oracle (apply the agent's patch
+  against the pinned `base_commit` and run the failing tests).
+
+Until any of those hold, treat the headline as **directional**, not
+quantitative.
+
+## Future work
+
+- A live runner that actually invokes each backend per task with a
+  consistent LLM agent loop, so all rows are measured the same way.
+- A patch-correctness oracle.
+- More tasks.
+- Quality cells under the existing oracle are robust; everything
+  else is a calibration exercise.
diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py
@@ -0,0 +1,127 @@
+#!/usr/bin/env python3
+"""Replay + verify the SWE-bench Lite file-localization snapshot.
+
+This is NOT a live SWE-bench runner. It loads `results.json` (a frozen
+record of agent runs on 4 SWE-bench Lite instances, populated by hand
+from agent traces), recomputes the per-backend averages from the raw
+cells, and asserts they match the summary block. Then prints a
+dominance table.
+
+A live runner (that actually launches each backend, sends the issue
+text, captures the agent's `files` list, and patch-tests the result)
+is out of scope for this snapshot and tracked separately.
+
+Usage:
+    python3 replay.py                # verify + print dominance table
+    python3 replay.py --json         # print raw recomputed summary as JSON
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from statistics import mean
+
+SNAPSHOT = Path(__file__).resolve().parent / "results.json"
+
+
+def recompute(snapshot: dict) -> dict:
+    by_backend: dict[str, dict] = {}
+    cells_by_backend: dict[str, list[dict]] = {}
+    for cell in snapshot["cells"]:
+        cells_by_backend.setdefault(cell["backend"], []).append(cell)
+
+    n_tasks = len(snapshot["tasks"])
+    for backend, cells in cells_by_backend.items():
+        recall_hits = sum(1 for c in cells if c["recall"])
+        top1_hits = sum(1 for c in cells if c["top_1"])
+        by_backend[backend] = {
+            "recall": f"{recall_hits}/{n_tasks}",
+            "top_1": f"{top1_hits}/{n_tasks}",
+            "avg_tool_calls": round(mean(c["tool_calls"] for c in cells), 2),
+            "avg_wall_seconds": round(mean(c["wall_seconds"] for c in cells), 2),
+            "avg_tokens": round(mean(c["tokens"] for c in cells), 2),
+        }
+    return by_backend
+
+
+def verify(snapshot: dict, recomputed: dict) -> list[str]:
+    errors: list[str] = []
+    claimed = snapshot["summary"]["by_backend"]
+    for backend, claim in claimed.items():
+        actual = recomputed.get(backend)
+        if actual is None:
+            errors.append(f"{backend}: claimed in summary but has no cells")
+            continue
+        for key in ("recall", "top_1"):
+            if claim[key] != actual[key]:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+        for key in ("avg_tool_calls", "avg_wall_seconds", "avg_tokens"):
+            if abs(float(claim[key]) - float(actual[key])) > 0.01:
+                errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}")
+    return errors
+
+
+def print_table(snapshot: dict, recomputed: dict) -> None:
+    backends = snapshot["backends"]
+    measurement = {
+        b: snapshot["summary"]["by_backend"][b].get("measurement")
+        for b in backends
+    }
+    rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")]
+    for backend in backends:
+        s = recomputed[backend]
+        label = backend + (" *" if measurement.get(backend) == "tool_output_only" else "")
+        rows.append((
+            label,
+            s["recall"],
+            s["top_1"],
+            f"{s['avg_tool_calls']:.2f}",
+            f"{s['avg_wall_seconds']:.2f}",
+            f"{s['avg_tokens']:,.0f}",
+        ))
+    widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))]
+    sep = "  ".join("-" * w for w in widths)
+    for i, row in enumerate(rows):
+        print("  ".join(cell.ljust(widths[j]) for j, cell in enumerate(row)))
+        if i == 0:
+            print(sep)
+    if any(m == "tool_output_only" for m in measurement.values()):
+        print()
+        print("* tool-output-only measurement (subprocess time + stdout bytes/4),")
+        print("  driven by a fixed query plan, NOT an LLM agent loop. Not directly")
+        print("  comparable to rows without an asterisk — see RESULTS.md for details.")
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON")
+    parser.add_argument("--snapshot", type=Path, default=SNAPSHOT, help="path to results.json")
+    args = parser.parse_args()
+
+    snapshot = json.loads(args.snapshot.read_text())
+    recomputed = recompute(snapshot)
+    errors = verify(snapshot, recomputed)
+
+    if args.json:
+        print(json.dumps(recomputed, indent=2))
+    else:
+        print(f"source:     {snapshot['source']}")
+        print(f"frozen at:  {snapshot['frozen_at']}")
+        print(f"tasks:      {len(snapshot['tasks'])}  ({', '.join(t['id'] for t in snapshot['tasks'])})")
+        print(f"backends:   {len(snapshot['backends'])}  ({', '.join(snapshot['backends'])})")
+        print()
+        print_table(snapshot, recomputed)
+        print()
+        print("headline:", snapshot["summary"]["headline"])
+
+    if errors:
+        print(file=sys.stderr)
+        print("VERIFY FAILED — summary does not match cells:", file=sys.stderr)
+        for err in errors:
+            print(f"  - {err}", file=sys.stderr)
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())