From bb97d1b02d9d868c7511d3742b131852e6911566 Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Fri, 22 May 2026 20:01:14 +0800 Subject: [PATCH 1/3] =?UTF-8?q?bench(swe-lite):=20file-localization=20snap?= =?UTF-8?q?shot=20=E2=80=94=204=20instances=20=C3=97=204=20backends,=20det?= =?UTF-8?q?erministic=20oracle?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Publishes the SWE-bench Lite file-localization view of the codedb-vs-peers shootout. Complements `benchmarks/search-shootout/` (hand-authored React tasks + Claude-as-judge) with a verifiable ground-truth view: gold = the file the merged upstream PR patched, oracle = deterministic file-path match. Instances: pallets__flask-4045, psf__requests-2148, psf__requests-2674, mwaskom__seaborn-2848. Backends: codedb (CLI), codedb_CONTEXT (MCP composer), leanctx, fts5_trigram. Headline: all four backends recall the gold file (4/4). Top-1 splits at one task — fts5_trigram 4/4, the other three at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering). codedb_CONTEXT is the sole Pareto-optimal point on (quality, efficiency): 2.25 calls / 1.25s / 14.7k tokens vs 9.75-26.75 calls and 24.75-42s for the rest. The accompanying RESULTS.md flags the deployment-shape caveat that caused an earlier CLI-only read to misrepresent codedb's efficiency: when a tool has multiple deployment surfaces, the bench has to compare primary-against-primary, not a side surface against peers' primaries. Caveats are spelled out in RESULTS.md (n=4 is a sanity check, not a statistic; file-localization ≠ patch-correctness; replay-only). Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmarks/swe-lite/README.md | 61 +++++++++++++ benchmarks/swe-lite/RESULTS.md | 146 +++++++++++++++++++++++++++++++ benchmarks/swe-lite/replay.py | 119 +++++++++++++++++++++++++ benchmarks/swe-lite/results.json | 47 ++++++++++ 4 files changed, 373 insertions(+) create mode 100644 benchmarks/swe-lite/README.md create mode 100644 benchmarks/swe-lite/RESULTS.md create mode 100755 benchmarks/swe-lite/replay.py create mode 100644 benchmarks/swe-lite/results.json diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md new file mode 100644 index 0000000..c5ca93a --- /dev/null +++ b/benchmarks/swe-lite/README.md @@ -0,0 +1,61 @@ +# swe-lite + +A small **file-localization** benchmark for code-retrieval backends, +graded by a deterministic oracle (file-path match against the merged +upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances. + +This folder is a sibling to [`../search-shootout`](../search-shootout), +which uses a hand-authored React corpus + Claude-as-judge. The two +benches stress different things: + +| | `search-shootout/` | `swe-lite/` (this folder) | +|---|---|---| +| Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) | +| Tasks | hand-authored | merged upstream PRs | +| Ground truth | hand-written `tasks.json` | gold patch's `changed_files` | +| Oracle | Claude-as-judge, 5-point rubric | deterministic file-path match | +| Risk | closed loop (same model family writes test + takes test) | independent ground truth | + +Both views together are stronger than either alone. The +search-shootout grades *answer quality* (could the agent answer the +question well?); swe-lite grades *file-localization correctness* +(did the agent name the file the patch actually edited?). + +## Files + +- [`results.json`](./results.json) — frozen snapshot: tasks, per-cell + metrics, summary. Captured 2026-05-22. +- [`replay.py`](./replay.py) — loads `results.json`, recomputes the + per-backend averages from the raw cells, asserts the summary + matches, and prints the dominance table. +- [`RESULTS.md`](./RESULTS.md) — the publishable read of the data: + dominance table, the deployment-shape caveat, and what this bench + does and does not measure. + +## Quick start + +```sh +python3 replay.py +``` + +That prints the dominance table and exits non-zero if any summary +cell disagrees with what the raw cells imply. JSON form: + +```sh +python3 replay.py --json +``` + +## What this is NOT + +- **Not a live SWE-bench runner.** `results.json` was populated by + running each backend by hand and recording the agent's `files` + output. The script in this folder replays that record; it does not + re-invoke the backends. +- **Not a patch-correctness eval.** This grades "did the agent name + the right file?", not "did the agent's patch make the failing + tests pass?". The latter (SWE-bench's headline `pass@1`) is the + metric that actually matters and is tracked as future work. +- **Not a statistic.** n=4 is a sanity check, not a sample. + +See [`RESULTS.md`](./RESULTS.md) §Caveats for the full list. diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md new file mode 100644 index 0000000..44277c7 --- /dev/null +++ b/benchmarks/swe-lite/RESULTS.md @@ -0,0 +1,146 @@ +# SWE-bench Lite — file-localization results + +Frozen snapshot of 4 SWE-bench Lite instances × 4 retrieval backends, +scored by deterministic file-path match against the merged upstream +patch (no LLM judge). Captured 2026-05-22. + +The raw data is in [`results.json`](./results.json); recompute and +verify the summary block with [`replay.py`](./replay.py). + +## Tasks + +Four [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances spanning three real upstream repos: + +| Instance | Repo | Gold file (the file the merged PR patched) | +|---|---|---| +| `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` | +| `psf__requests-2148` | psf/requests | `requests/models.py` | +| `psf__requests-2674` | psf/requests | `requests/adapters.py` | +| `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` | + +Each instance's `base_commit` is pinned in `results.json` so the same +state can be rebuilt by anyone. + +## Backends + +| Backend | What it is | How invoked | +|---|---|---| +| `codedb` | This repo's CLI surface — the four lookup primitives (`search`, `find`, `word`, `outline`) | shell calls, agent composes them itself | +| `codedb_CONTEXT` | This repo's **MCP composer** tool — bundles the primitives server-side into one task-shaped call | single MCP call with the issue text + `project=` | +| `leanctx` | yvgude/lean-ctx, BM25-ish word index | CLI calls per query | +| `fts5_trigram` | SQLite FTS5 with the `trigram` tokenizer | direct sqlite3 substring query | + +`codedb_CONTEXT` is the deployed shape of codedb for agentic use; the +CLI is the underlying primitive set. Measuring both lets us separate +"is the search good?" from "is the deployed shape good?". + +## Scoring + +Deterministic, no LLM judge: + +- **recall** — gold file appears anywhere in the agent's `files` list +- **top-1** — agent's *first* listed file equals the gold file + +That's it. The agent doesn't have to write a patch; it just has to +name the file it would edit. This is an intermediate signal — weaker +than patch-correctness, but stronger than judge-graded quality +because there's no model in the oracle loop. + +## Headline + +``` +backend recall top-1 avg calls avg wall (s) avg tokens +--------------- ------ ----- --------- ------------ ---------- +codedb 4/4 3/4 26.75 42.00 37,954 +codedb_CONTEXT 4/4 3/4 2.25 1.25 14,717 +leanctx 4/4 3/4 9.75 27.25 30,172 +fts5_trigram 4/4 4/4 13.75 24.75 25,801 +``` + +**Quality.** All four backends fully recall the gold file (4/4). +Top-1 splits at one task: `fts5_trigram` 4/4, the other three at 3/4. + +**Efficiency.** `codedb_CONTEXT` dominates on every axis — **4×** +fewer calls than `leanctx`, **6×** fewer than `fts5_trigram`, **12×** +fewer than `codedb` CLI; **20-30×** faster wall; lowest tokens. + +**Pareto frontier.** Only one point is Pareto-optimal across (quality, +efficiency): `codedb_CONTEXT`. The single backend that exceeds it on +quality (`fts5_trigram`, by one cell out of four) costs ~1.5× the +wall and ~1.75× the tokens for that gain. + +## The one task where top-1 split — `mwaskom__seaborn-2848` + +The seaborn bug surfaces as a `KeyError` raised inside +`seaborn/_oldcore.py::SemanticMapping`, but the user-facing call site +lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch +edits `_oldcore.py` (the root-cause site). + +Three of four backends (`codedb`, `codedb_CONTEXT`, `leanctx`) named +`axisgrid.py` first and `_oldcore.py` second — the order a developer +would trace through. `fts5_trigram` named `_oldcore.py` first because +trigram matches on identifier strings preferred the file with denser +term hits. + +Both orderings find the bug. Which one is "better" depends on what +you want top-1 to mean: the first place a developer would look (the +call site) or the place the patch actually lands (the root-cause +site). At this sample size the metric punishes the explanatory +ordering, but neither agent failed the task. + +## Why the CLI row matters — deployment shape is a measurement axis + +An earlier iteration of this bench reported `codedb` as the *least* +efficient backend (26.75 calls / 42s / 38k tokens) and concluded the +dominance claim was partially falsified. That finding was numerically +correct but tested the wrong thing: it pitted codedb's *CLI* (a stack +of four lookup primitives the agent composes itself) against peers' +*deployed* surfaces (leanctx CLI, fts5 sqlite3). + +`codedb_CONTEXT` is the actual deployed shape — one MCP call that +bundles the primitives server-side. Once measured at the same level +of abstraction as the peers, the dominance picture survives the +verifiable oracle. + +**Lesson:** when a tool has more than one deployment surface +(CLI / MCP / HTTP / library), the bench has to identify the +*primary* surface and compare primary-against-primary. Measuring a +side surface and reporting it as the headline is an +apples-to-oranges error. + +## Caveats — read before quoting these numbers + +1. **n=4 is small.** Four SWE-bench Lite instances is a sanity check, + not a statistic. Don't generalize from "3/4 top-1" to "75% top-1 + on SWE-bench Lite". +2. **File-localization ≠ patch-correctness.** This bench measures + whether the agent names the right file. It does not run the agent + end-to-end, generate a patch, or check whether the patch makes + the failing tests pass. An end-to-end `pass@1` eval is the metric + that actually matters; this is one rung below it on the ladder. +3. **Replay, not live.** `results.json` is a frozen record. The + `replay.py` script recomputes the averages from the cells and + verifies the summary block matches, but it does not re-launch + the backends. A live runner is future work. +4. **One judge-graded comparator** (`codegraph` MCP) is intentionally + absent here — it was measured on the hand-authored / judge-graded + shootout but not on this verifiable-oracle bench. Add it if you + want a 5-backend matrix. +5. **The seaborn split is a metric artifact, not a backend + weakness.** Three out of four backends (including `fts5_trigram` + on the *other* three tasks) order files by traceability rather + than patch site. The split says more about top-1 as a metric than + about the backends. + +## Future work + +- A live runner that actually invokes each backend per task and + records `files`, `tool_calls`, `wall_seconds`, `tokens` on the + spot (instead of the current hand-recorded snapshot). +- A patch-correctness oracle: agent produces a unified diff, + oracle applies it against the pinned `base_commit` and runs the + upstream test suite. That's the only metric that fully captures + the "did the agent solve it?" question. +- More tasks. 20-50 SWE-bench Lite instances would let "3/4 top-1" + turn into a statistic instead of a sanity check. diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py new file mode 100755 index 0000000..8997427 --- /dev/null +++ b/benchmarks/swe-lite/replay.py @@ -0,0 +1,119 @@ +#!/usr/bin/env python3 +"""Replay + verify the SWE-bench Lite file-localization snapshot. + +This is NOT a live SWE-bench runner. It loads `results.json` (a frozen +record of agent runs on 4 SWE-bench Lite instances, populated by hand +from agent traces), recomputes the per-backend averages from the raw +cells, and asserts they match the summary block. Then prints a +dominance table. + +A live runner (that actually launches each backend, sends the issue +text, captures the agent's `files` list, and patch-tests the result) +is out of scope for this snapshot and tracked separately. + +Usage: + python3 replay.py # verify + print dominance table + python3 replay.py --json # print raw recomputed summary as JSON +""" +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from statistics import mean + +SNAPSHOT = Path(__file__).resolve().parent / "results.json" + + +def recompute(snapshot: dict) -> dict: + by_backend: dict[str, dict] = {} + cells_by_backend: dict[str, list[dict]] = {} + for cell in snapshot["cells"]: + cells_by_backend.setdefault(cell["backend"], []).append(cell) + + n_tasks = len(snapshot["tasks"]) + for backend, cells in cells_by_backend.items(): + recall_hits = sum(1 for c in cells if c["recall"]) + top1_hits = sum(1 for c in cells if c["top_1"]) + by_backend[backend] = { + "recall": f"{recall_hits}/{n_tasks}", + "top_1": f"{top1_hits}/{n_tasks}", + "avg_tool_calls": round(mean(c["tool_calls"] for c in cells), 2), + "avg_wall_seconds": round(mean(c["wall_seconds"] for c in cells), 2), + "avg_tokens": round(mean(c["tokens"] for c in cells), 2), + } + return by_backend + + +def verify(snapshot: dict, recomputed: dict) -> list[str]: + errors: list[str] = [] + claimed = snapshot["summary"]["by_backend"] + for backend, claim in claimed.items(): + actual = recomputed.get(backend) + if actual is None: + errors.append(f"{backend}: claimed in summary but has no cells") + continue + for key in ("recall", "top_1"): + if claim[key] != actual[key]: + errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}") + for key in ("avg_tool_calls", "avg_wall_seconds", "avg_tokens"): + if abs(float(claim[key]) - float(actual[key])) > 0.01: + errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}") + return errors + + +def print_table(snapshot: dict, recomputed: dict) -> None: + backends = snapshot["backends"] + rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")] + for backend in backends: + s = recomputed[backend] + rows.append(( + backend, + s["recall"], + s["top_1"], + f"{s['avg_tool_calls']:.2f}", + f"{s['avg_wall_seconds']:.2f}", + f"{s['avg_tokens']:,.0f}", + )) + widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))] + sep = " ".join("-" * w for w in widths) + for i, row in enumerate(rows): + print(" ".join(cell.ljust(widths[j]) for j, cell in enumerate(row))) + if i == 0: + print(sep) + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON") + parser.add_argument("--snapshot", type=Path, default=SNAPSHOT, help="path to results.json") + args = parser.parse_args() + + snapshot = json.loads(args.snapshot.read_text()) + recomputed = recompute(snapshot) + errors = verify(snapshot, recomputed) + + if args.json: + print(json.dumps(recomputed, indent=2)) + else: + print(f"source: {snapshot['source']}") + print(f"frozen at: {snapshot['frozen_at']}") + print(f"tasks: {len(snapshot['tasks'])} ({', '.join(t['id'] for t in snapshot['tasks'])})") + print(f"backends: {len(snapshot['backends'])} ({', '.join(snapshot['backends'])})") + print() + print_table(snapshot, recomputed) + print() + print("headline:", snapshot["summary"]["headline"]) + + if errors: + print(file=sys.stderr) + print("VERIFY FAILED — summary does not match cells:", file=sys.stderr) + for err in errors: + print(f" - {err}", file=sys.stderr) + return 1 + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json new file mode 100644 index 0000000..0812eda --- /dev/null +++ b/benchmarks/swe-lite/results.json @@ -0,0 +1,47 @@ +{ + "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape", + "frozen_at": "2026-05-22T10:05Z", + "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)", + "metric_definitions": { + "recall": "gold file appears anywhere in agent's `files` list", + "top_1": "agent's first `files` entry equals the gold file" + }, + "tasks": [ + {"id": "pallets__flask-4045", "repo": "pallets/flask", "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot", "gold_files": ["src/flask/blueprints.py"]}, + {"id": "psf__requests-2148", "repo": "psf/requests", "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]}, + {"id": "psf__requests-2674", "repo": "psf/requests", "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API", "gold_files": ["requests/adapters.py"]}, + {"id": "mwaskom__seaborn-2848", "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]} + ], + "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram"], + "cells": [ + {"task": "pallets__flask-4045", "backend": "codedb", "files": ["src/flask/blueprints.py"], "tool_calls": 8, "wall_seconds": 12, "tokens": 17508, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codedb_CONTEXT", "files": ["src/flask/blueprints.py"], "tool_calls": 3, "wall_seconds": 2, "tokens": 14834, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "leanctx", "files": ["src/flask/blueprints.py"], "tool_calls": 4, "wall_seconds": 8, "tokens": 17017, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "fts5_trigram", "files": ["src/flask/blueprints.py"], "tool_calls": 13, "wall_seconds": 18, "tokens": 16288, "recall": true, "top_1": true}, + + {"task": "psf__requests-2148", "backend": "codedb", "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18, "tokens": 20439, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codedb_CONTEXT", "files": ["requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14516, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "leanctx", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 9, "wall_seconds": 28, "tokens": 32319, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "fts5_trigram", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 11, "wall_seconds": 18, "tokens": 16427, "recall": true, "top_1": true}, + + {"task": "psf__requests-2674", "backend": "codedb", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 23, "wall_seconds": 18, "tokens": 24816, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codedb_CONTEXT", "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14725, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "leanctx", "files": ["requests/adapters.py"], "tool_calls": 6, "wall_seconds": 28, "tokens": 28060, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "fts5_trigram", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 8, "wall_seconds": 18, "tokens": 22767, "recall": true, "top_1": true}, + + {"task": "mwaskom__seaborn-2848", "backend": "codedb", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14791, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "leanctx", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 20, "wall_seconds": 45, "tokens": 43291, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram", "files": ["seaborn/_oldcore.py", "seaborn/relational.py"], "tool_calls": 23, "wall_seconds": 45, "tokens": 47720, "recall": true, "top_1": true} + ], + "summary": { + "by_backend": { + "codedb": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0, "avg_tokens": 37954.25}, + "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25, "avg_wall_seconds": 1.25, "avg_tokens": 14716.5}, + "leanctx": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75, "avg_wall_seconds": 27.25, "avg_tokens": 30171.75}, + "fts5_trigram": {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5} + }, + "headline": "All four backends fully recall the gold file (4/4). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx tie at 3/4 (all flagged seaborn/axisgrid.py before seaborn/_oldcore.py — the symptom site vs the root-cause site). Efficiency: codedb_CONTEXT dominates by a wide margin (2.25 calls / 1.25s / 14.7k tokens) — 4-12x fewer calls than peers, 20-30x faster wall, lowest tokens.", + "pareto_optimal": "codedb_CONTEXT is the sole Pareto-optimal point on the (quality, efficiency) frontier: only fts5_trigram exceeds it on quality, and only by 1 cell out of 4, at ~1.5x the wall and ~1.75x the tokens." + } +} From 260c46c4df9ecb99e8315d16d94894d256869086 Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Fri, 22 May 2026 20:15:58 +0800 Subject: [PATCH 2/3] bench(swe-lite): add codegraph (CLI + context) and reframe as hypothesis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds two backends to the matrix: - `codegraph` — primitive `codegraph query` surface, driven by a fixed 3-query plan per task (subprocess only, no LLM loop). 4/4 recall, 3/4 top-1, 3.00 avg calls, 0.17s wall. - `codegraph_CONTEXT` — task-shaped `codegraph context` composer, single call per task. 2/4 recall, 2/4 top-1 — misses both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper where the gold patch actually lands. Codegraph rows are explicitly annotated `measurement: tool_output_only` in `results.json`. `replay.py` marks them with `*` in the table and prints a footnote: subprocess time + stdout bytes/4, NOT a full LLM-driven agent loop, so the efficiency cells are not directly comparable to the other rows. Quality cells (recall, top-1) ARE directly comparable. Reframes RESULTS.md as a hypothesis snapshot rather than a dominance claim: small sample, mixed measurement methodology, and the doc now ends in a stated hypothesis (codedb_CONTEXT is the cheapest backend in the 3/4-top-1 cluster; codegraph primitive would likely join that cluster under matched methodology) along with the falsification path (live runner, more tasks, patch oracle). README updated to match. Verify still passes: `python3 replay.py`. Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmarks/swe-lite/README.md | 59 ++++---- benchmarks/swe-lite/RESULTS.md | 249 ++++++++++++++++++------------- benchmarks/swe-lite/replay.py | 14 +- benchmarks/swe-lite/results.json | 60 +++++--- 4 files changed, 225 insertions(+), 157 deletions(-) mode change 100755 => 100644 benchmarks/swe-lite/replay.py diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md index c5ca93a..c4e799d 100644 --- a/benchmarks/swe-lite/README.md +++ b/benchmarks/swe-lite/README.md @@ -1,37 +1,37 @@ # swe-lite -A small **file-localization** benchmark for code-retrieval backends, -graded by a deterministic oracle (file-path match against the merged -upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) -instances. +A small **file-localization** hypothesis snapshot for code-retrieval +backends, graded by a deterministic oracle (file-path match against +the merged upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances × 6 backends. This folder is a sibling to [`../search-shootout`](../search-shootout), -which uses a hand-authored React corpus + Claude-as-judge. The two -benches stress different things: +which uses a hand-authored React corpus + LLM judge. The two views +stress different things: | | `search-shootout/` | `swe-lite/` (this folder) | |---|---|---| | Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) | | Tasks | hand-authored | merged upstream PRs | | Ground truth | hand-written `tasks.json` | gold patch's `changed_files` | -| Oracle | Claude-as-judge, 5-point rubric | deterministic file-path match | +| Oracle | LLM-as-judge, 5-point rubric | deterministic file-path match | | Risk | closed loop (same model family writes test + takes test) | independent ground truth | Both views together are stronger than either alone. The -search-shootout grades *answer quality* (could the agent answer the -question well?); swe-lite grades *file-localization correctness* -(did the agent name the file the patch actually edited?). +search-shootout grades *answer quality*; swe-lite grades +*file-localization correctness*. ## Files -- [`results.json`](./results.json) — frozen snapshot: tasks, per-cell - metrics, summary. Captured 2026-05-22. +- [`results.json`](./results.json) — frozen snapshot: 4 tasks × + 6 backends, per-cell metrics, summary, hypothesis. - [`replay.py`](./replay.py) — loads `results.json`, recomputes the per-backend averages from the raw cells, asserts the summary - matches, and prints the dominance table. + matches, and prints the dominance table (with `*` annotation + for tool-output-only measurements). - [`RESULTS.md`](./RESULTS.md) — the publishable read of the data: - dominance table, the deployment-shape caveat, and what this bench - does and does not measure. + dominance table, what jumps out, the measurement caveat, and the + falsifiable hypothesis this snapshot supports. ## Quick start @@ -39,8 +39,8 @@ question well?); swe-lite grades *file-localization correctness* python3 replay.py ``` -That prints the dominance table and exits non-zero if any summary -cell disagrees with what the raw cells imply. JSON form: +Prints the matrix and exits non-zero if any summary cell disagrees +with what the raw cells imply. JSON form: ```sh python3 replay.py --json @@ -48,14 +48,17 @@ python3 replay.py --json ## What this is NOT -- **Not a live SWE-bench runner.** `results.json` was populated by - running each backend by hand and recording the agent's `files` - output. The script in this folder replays that record; it does not - re-invoke the backends. -- **Not a patch-correctness eval.** This grades "did the agent name - the right file?", not "did the agent's patch make the failing - tests pass?". The latter (SWE-bench's headline `pass@1`) is the - metric that actually matters and is tracked as future work. -- **Not a statistic.** n=4 is a sanity check, not a sample. - -See [`RESULTS.md`](./RESULTS.md) §Caveats for the full list. +- **Not a live SWE-bench runner.** Four of six rows (`codedb`, + `codedb_CONTEXT`, `leanctx`, `fts5_trigram`) were populated by + running each backend through an LLM agent loop and recording the + agent's `files` output; codegraph rows were freshly measured here + using a fixed query plan (subprocess only, no LLM in the loop). + See `RESULTS.md` §Measurement caveat. +- **Not a patch-correctness eval.** Grades "did the agent name the + right file?", not "did the agent's patch make the failing tests + pass?". The latter is tracked as future work. +- **Not a statistic.** n=4 is a sanity check, not a sample. The + doc is framed as a hypothesis snapshot, not a settled claim. + +See [`RESULTS.md`](./RESULTS.md) for the full list of caveats and +the hypothesis statement. diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md index 44277c7..617727f 100644 --- a/benchmarks/swe-lite/RESULTS.md +++ b/benchmarks/swe-lite/RESULTS.md @@ -1,17 +1,18 @@ -# SWE-bench Lite — file-localization results +# SWE-bench Lite — file-localization, six backends -Frozen snapshot of 4 SWE-bench Lite instances × 4 retrieval backends, -scored by deterministic file-path match against the merged upstream -patch (no LLM judge). Captured 2026-05-22. +Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances × 6 retrieval backends, graded by a deterministic oracle +(does the agent name the file that the merged upstream patch actually +edits?). Captured 2026-05-22. +This is published as a **hypothesis snapshot**, not a settled +dominance claim — n=4 is too small for statistics, and not all rows +were measured the same way (see [Measurement caveat](#measurement-caveat)). The raw data is in [`results.json`](./results.json); recompute and verify the summary block with [`replay.py`](./replay.py). ## Tasks -Four [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) -instances spanning three real upstream repos: - | Instance | Repo | Gold file (the file the merged PR patched) | |---|---|---| | `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` | @@ -19,56 +20,74 @@ instances spanning three real upstream repos: | `psf__requests-2674` | psf/requests | `requests/adapters.py` | | `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` | -Each instance's `base_commit` is pinned in `results.json` so the same -state can be rebuilt by anyone. +Each instance's `base_commit` is pinned in `results.json` so the +state can be rebuilt. ## Backends -| Backend | What it is | How invoked | -|---|---|---| -| `codedb` | This repo's CLI surface — the four lookup primitives (`search`, `find`, `word`, `outline`) | shell calls, agent composes them itself | -| `codedb_CONTEXT` | This repo's **MCP composer** tool — bundles the primitives server-side into one task-shaped call | single MCP call with the issue text + `project=` | -| `leanctx` | yvgude/lean-ctx, BM25-ish word index | CLI calls per query | -| `fts5_trigram` | SQLite FTS5 with the `trigram` tokenizer | direct sqlite3 substring query | +Six backends, three of which ship in two surfaces (a primitive +"search" surface and a task-shaped "build context for this query" +surface). Both surfaces are reported separately when they exist — +mixing a tool's primitive surface against another tool's deployed +surface gives a misleading read. -`codedb_CONTEXT` is the deployed shape of codedb for agentic use; the -CLI is the underlying primitive set. Measuring both lets us separate -"is the search good?" from "is the deployed shape good?". +| Backend | What it is | Surface | +|---|---|---| +| `codedb` | This repo. Zig trigram + word index. | primitive (`search`, `find`, `word`, `outline`) | +| `codedb_CONTEXT` | This repo's MCP composer | task-shaped (single call) | +| `leanctx` | yvgude/lean-ctx, BM25-ish word index | primitive | +| `fts5_trigram` | SQLite FTS5 with `trigram` tokenizer | primitive | +| `codegraph` | TS+SQLite code-graph (`codegraph query`) | primitive | +| `codegraph_CONTEXT` | codegraph's task composer (`codegraph context`) | task-shaped | -## Scoring +## Oracle Deterministic, no LLM judge: - **recall** — gold file appears anywhere in the agent's `files` list -- **top-1** — agent's *first* listed file equals the gold file +- **top-1** — the agent's *first* listed file equals the gold file -That's it. The agent doesn't have to write a patch; it just has to -name the file it would edit. This is an intermediate signal — weaker -than patch-correctness, but stronger than judge-graded quality -because there's no model in the oracle loop. +The agent doesn't have to write a patch — only name the file it +would edit. This is an intermediate signal: weaker than patch +correctness, but stronger than judge-graded quality because there's +no model in the oracle loop. ## Headline ``` -backend recall top-1 avg calls avg wall (s) avg tokens ---------------- ------ ----- --------- ------------ ---------- -codedb 4/4 3/4 26.75 42.00 37,954 -codedb_CONTEXT 4/4 3/4 2.25 1.25 14,717 -leanctx 4/4 3/4 9.75 27.25 30,172 -fts5_trigram 4/4 4/4 13.75 24.75 25,801 +backend recall top-1 avg calls avg wall (s) avg tokens +------------------- ------ ----- --------- ------------ ---------- +codedb 4/4 3/4 26.75 42.00 37,954 +codedb_CONTEXT 4/4 3/4 2.25 1.25 14,716 +leanctx 4/4 3/4 9.75 27.25 30,172 +fts5_trigram 4/4 4/4 13.75 24.75 25,800 +codegraph * 4/4 3/4 3.00 0.17 1,981 +codegraph_CONTEXT * 2/4 2/4 1.00 0.11 4,146 ``` -**Quality.** All four backends fully recall the gold file (4/4). -Top-1 splits at one task: `fts5_trigram` 4/4, the other three at 3/4. +*\* Codegraph rows use a different measurement methodology — see +[Measurement caveat](#measurement-caveat) before reading the +efficiency cells.* + +## What jumps out + +**Quality is mostly uniform.** Five of six backends fully recall the +gold file (4/4). Top-1 splits across one task (`seaborn-2848`, +discussed below): `fts5_trigram` 4/4, four others tied at 3/4. -**Efficiency.** `codedb_CONTEXT` dominates on every axis — **4×** -fewer calls than `leanctx`, **6×** fewer than `fts5_trigram`, **12×** -fewer than `codedb` CLI; **20-30×** faster wall; lowest tokens. +**`codegraph_CONTEXT` is the lone quality outlier.** It misses both +`requests` tasks because the issue text mentions urllib3 keywords +("socket", "urllib3", "DecodeError"), and the composer surfaces +urllib3 internals over the requests-layer wrapper where the patch +actually lands. This is the only cell where graph-relevance signal +diverges sharply from patch-site relevance in this sample. -**Pareto frontier.** Only one point is Pareto-optimal across (quality, -efficiency): `codedb_CONTEXT`. The single backend that exceeds it on -quality (`fts5_trigram`, by one cell out of four) costs ~1.5× the -wall and ~1.75× the tokens for that gain. +**Among the apples-to-apples (agent-loop) rows, `codedb_CONTEXT` +sits at the efficient end of the matched-quality cluster.** It +matches the 3/4-top-1 cluster (codedb / leanctx / codedb_CONTEXT) +on quality and is the cheapest in that cluster across calls, wall, +and tokens. `fts5_trigram` is the only backend that gets the +top-1-4/4 cell — at ~20× the wall time of `codedb_CONTEXT`. ## The one task where top-1 split — `mwaskom__seaborn-2848` @@ -77,70 +96,94 @@ The seaborn bug surfaces as a `KeyError` raised inside lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch edits `_oldcore.py` (the root-cause site). -Three of four backends (`codedb`, `codedb_CONTEXT`, `leanctx`) named -`axisgrid.py` first and `_oldcore.py` second — the order a developer -would trace through. `fts5_trigram` named `_oldcore.py` first because -trigram matches on identifier strings preferred the file with denser -term hits. - -Both orderings find the bug. Which one is "better" depends on what -you want top-1 to mean: the first place a developer would look (the -call site) or the place the patch actually lands (the root-cause -site). At this sample size the metric punishes the explanatory -ordering, but neither agent failed the task. - -## Why the CLI row matters — deployment shape is a measurement axis - -An earlier iteration of this bench reported `codedb` as the *least* -efficient backend (26.75 calls / 42s / 38k tokens) and concluded the -dominance claim was partially falsified. That finding was numerically -correct but tested the wrong thing: it pitted codedb's *CLI* (a stack -of four lookup primitives the agent composes itself) against peers' -*deployed* surfaces (leanctx CLI, fts5 sqlite3). - -`codedb_CONTEXT` is the actual deployed shape — one MCP call that -bundles the primitives server-side. Once measured at the same level -of abstraction as the peers, the dominance picture survives the -verifiable oracle. - -**Lesson:** when a tool has more than one deployment surface -(CLI / MCP / HTTP / library), the bench has to identify the -*primary* surface and compare primary-against-primary. Measuring a -side surface and reporting it as the headline is an -apples-to-oranges error. - -## Caveats — read before quoting these numbers - -1. **n=4 is small.** Four SWE-bench Lite instances is a sanity check, - not a statistic. Don't generalize from "3/4 top-1" to "75% top-1 - on SWE-bench Lite". -2. **File-localization ≠ patch-correctness.** This bench measures - whether the agent names the right file. It does not run the agent - end-to-end, generate a patch, or check whether the patch makes - the failing tests pass. An end-to-end `pass@1` eval is the metric - that actually matters; this is one rung below it on the ladder. -3. **Replay, not live.** `results.json` is a frozen record. The - `replay.py` script recomputes the averages from the cells and - verifies the summary block matches, but it does not re-launch - the backends. A live runner is future work. -4. **One judge-graded comparator** (`codegraph` MCP) is intentionally - absent here — it was measured on the hand-authored / judge-graded - shootout but not on this verifiable-oracle bench. Add it if you - want a 5-backend matrix. -5. **The seaborn split is a metric artifact, not a backend - weakness.** Three out of four backends (including `fts5_trigram` - on the *other* three tasks) order files by traceability rather - than patch site. The split says more about top-1 as a metric than - about the backends. +Four backends (`codedb`, `codedb_CONTEXT`, `leanctx`, `codegraph`) +named `axisgrid.py` first and `_oldcore.py` second — the order a +developer would trace through. `fts5_trigram` and +`codegraph_CONTEXT` named `_oldcore.py` first. Both orderings find +the bug; "top-1 correctness" is really asking *which* ordering you +want — the first file a developer would look at (call site) or the +file the patch actually lands in (root cause). + +## Measurement caveat + +Codegraph rows (`codegraph` and `codegraph_CONTEXT`) were measured +differently from the other four rows: + +- **Calls / wall:** codegraph numbers reflect subprocess invocations + driven by a fixed 3-query plan (primitive surface) or a single + `codegraph context` call (task surface). The other four rows + reflect a full LLM-driven agent loop that decides which queries + to run. +- **Tokens:** codegraph numbers are stdout bytes / 4 (just the + tool's output). The other four rows include the agent's full + context (system prompt + tool defs + tool outputs + LLM + reasoning). + +Under a comparable LLM-driven loop, codegraph's tool_calls would +likely rise (an LLM tends to make 5–15 queries when exploring) and +tokens would rise to the agent-context level (~10–20× current +values). What's NOT expected to change much: recall and top-1, +since those depend on which files codegraph surfaces — and the file +sets above are what codegraph actually returned for those queries. + +The takeaway is that codegraph's **quality** cells are directly +comparable to other backends, and its **efficiency** cells are not. +This is annotated in the table with `*` and in `results.json` via +the `measurement: tool_output_only` field. + +## Other caveats — read before quoting these numbers + +1. **n=4 is small.** Four SWE-bench Lite instances is a sanity + check, not a statistic. Don't read "3/4 top-1" as "75% top-1 on + SWE-bench Lite". +2. **File-localization ≠ patch-correctness.** This bench grades + whether the agent names the right file. It does not run the + agent end-to-end, generate a patch, or check whether the patch + makes the failing tests pass. An end-to-end `pass@1` eval is the + metric that actually matters; this is one rung below it on the + ladder. +3. **Snapshot, not live.** `results.json` is a frozen record. + `replay.py` recomputes the averages from the cells and verifies + the summary block matches, but does not re-launch the four + non-codegraph backends. Codegraph rows *were* freshly measured + while preparing this snapshot. +4. **The seaborn top-1 split is a metric artifact, not a backend + weakness.** Four of six backends order files by traceability + rather than by patch site. The split says more about top-1 as a + metric than about any individual backend. + +## Hypothesis + +Stated as something to falsify, not declare: + +> Among compared backends, **`codedb_CONTEXT`** is the cheapest +> backend in the matched-quality cluster (3/4 top-1, 4/4 recall) on +> file-localization. **`fts5_trigram`** is the only backend that +> currently reaches 4/4 top-1, and it does so at ~20× the wall time +> of `codedb_CONTEXT`. The expected next-step result, if a live +> agent-loop runner is built and codegraph is re-measured under +> matched methodology, is: **codegraph (primitive) joins the +> 3/4-top-1 cluster at agent-loop call counts somewhere between +> codedb_CONTEXT's 2.25 and leanctx's 9.75, with comparable +> tokens.** + +This hypothesis is **falsifiable** by: + +- Building a live LLM-loop runner and re-measuring codegraph at + agent-loop methodology. +- Expanding to 20–50 SWE-bench Lite instances — at that sample size + the quality differences (or lack of them) become statistical. +- Adding a patch-correctness oracle (apply the agent's patch + against the pinned `base_commit` and run the failing tests). + +Until any of those hold, treat the headline as **directional**, not +quantitative. ## Future work -- A live runner that actually invokes each backend per task and - records `files`, `tool_calls`, `wall_seconds`, `tokens` on the - spot (instead of the current hand-recorded snapshot). -- A patch-correctness oracle: agent produces a unified diff, - oracle applies it against the pinned `base_commit` and runs the - upstream test suite. That's the only metric that fully captures - the "did the agent solve it?" question. -- More tasks. 20-50 SWE-bench Lite instances would let "3/4 top-1" - turn into a statistic instead of a sanity check. +- A live runner that actually invokes each backend per task with a + consistent LLM agent loop, so all rows are measured the same way. +- A patch-correctness oracle. +- More tasks. +- Quality cells under the existing oracle are robust; everything + else is a calibration exercise. diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py old mode 100755 new mode 100644 index 8997427..d9a9b39 --- a/benchmarks/swe-lite/replay.py +++ b/benchmarks/swe-lite/replay.py @@ -65,11 +65,16 @@ def verify(snapshot: dict, recomputed: dict) -> list[str]: def print_table(snapshot: dict, recomputed: dict) -> None: backends = snapshot["backends"] + measurement = { + b: snapshot["summary"]["by_backend"][b].get("measurement") + for b in backends + } rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")] for backend in backends: s = recomputed[backend] + label = backend + (" *" if measurement.get(backend) == "tool_output_only" else "") rows.append(( - backend, + label, s["recall"], s["top_1"], f"{s['avg_tool_calls']:.2f}", @@ -82,8 +87,11 @@ def print_table(snapshot: dict, recomputed: dict) -> None: print(" ".join(cell.ljust(widths[j]) for j, cell in enumerate(row))) if i == 0: print(sep) - - + if any(m == "tool_output_only" for m in measurement.values()): + print() + print("* tool-output-only measurement (subprocess time + stdout bytes/4),") + print(" driven by a fixed query plan, NOT an LLM agent loop. Not directly") + print(" comparable to rows without an asterisk — see RESULTS.md for details.") def main() -> int: parser = argparse.ArgumentParser(description=__doc__) parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON") diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json index 0812eda..6a3340b 100644 --- a/benchmarks/swe-lite/results.json +++ b/benchmarks/swe-lite/results.json @@ -6,42 +6,56 @@ "recall": "gold file appears anywhere in agent's `files` list", "top_1": "agent's first `files` entry equals the gold file" }, + "measurement_notes": { + "default": "tokens reflect the agent's full context consumption (system prompt + tool defs + tool outputs + LLM reasoning); tool_calls and wall_seconds are end-to-end agent loop totals", + "tool_output_only": "tokens reflect only the tool's stdout bytes / 4 (no LLM context); tool_calls and wall_seconds reflect subprocess invocations driven by a fixed query plan, not an LLM-decided loop" + }, "tasks": [ {"id": "pallets__flask-4045", "repo": "pallets/flask", "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot", "gold_files": ["src/flask/blueprints.py"]}, {"id": "psf__requests-2148", "repo": "psf/requests", "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]}, {"id": "psf__requests-2674", "repo": "psf/requests", "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API", "gold_files": ["requests/adapters.py"]}, {"id": "mwaskom__seaborn-2848", "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]} ], - "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram"], + "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram", "codegraph", "codegraph_CONTEXT"], "cells": [ - {"task": "pallets__flask-4045", "backend": "codedb", "files": ["src/flask/blueprints.py"], "tool_calls": 8, "wall_seconds": 12, "tokens": 17508, "recall": true, "top_1": true}, - {"task": "pallets__flask-4045", "backend": "codedb_CONTEXT", "files": ["src/flask/blueprints.py"], "tool_calls": 3, "wall_seconds": 2, "tokens": 14834, "recall": true, "top_1": true}, - {"task": "pallets__flask-4045", "backend": "leanctx", "files": ["src/flask/blueprints.py"], "tool_calls": 4, "wall_seconds": 8, "tokens": 17017, "recall": true, "top_1": true}, - {"task": "pallets__flask-4045", "backend": "fts5_trigram", "files": ["src/flask/blueprints.py"], "tool_calls": 13, "wall_seconds": 18, "tokens": 16288, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codedb", "files": ["src/flask/blueprints.py"], "tool_calls": 8, "wall_seconds": 12, "tokens": 17508, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codedb_CONTEXT", "files": ["src/flask/blueprints.py"], "tool_calls": 3, "wall_seconds": 2, "tokens": 14834, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "leanctx", "files": ["src/flask/blueprints.py"], "tool_calls": 4, "wall_seconds": 8, "tokens": 17017, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "fts5_trigram", "files": ["src/flask/blueprints.py"], "tool_calls": 13, "wall_seconds": 18, "tokens": 16288, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codegraph", "files": ["src/flask/blueprints.py", "src/flask/json/tag.py", "src/flask/wrappers.py", "src/flask/app.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 2235, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "pallets__flask-4045", "backend": "codegraph_CONTEXT", "files": ["src/flask/blueprints.py", "src/flask/helpers.py", "src/flask/app.py", "src/flask/scaffold.py"], "tool_calls": 1, "wall_seconds": 0.11, "tokens": 3788, "recall": true, "top_1": true, "measurement": "tool_output_only"}, - {"task": "psf__requests-2148", "backend": "codedb", "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18, "tokens": 20439, "recall": true, "top_1": true}, - {"task": "psf__requests-2148", "backend": "codedb_CONTEXT", "files": ["requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14516, "recall": true, "top_1": true}, - {"task": "psf__requests-2148", "backend": "leanctx", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 9, "wall_seconds": 28, "tokens": 32319, "recall": true, "top_1": true}, - {"task": "psf__requests-2148", "backend": "fts5_trigram", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 11, "wall_seconds": 18, "tokens": 16427, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codedb", "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18, "tokens": 20439, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codedb_CONTEXT", "files": ["requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14516, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "leanctx", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 9, "wall_seconds": 28, "tokens": 32319, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "fts5_trigram", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 11, "wall_seconds": 18, "tokens": 16427, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codegraph", "files": ["requests/models.py", "requests/adapters.py", "requests/sessions.py", "requests/exceptions.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 1501, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "psf__requests-2148", "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/connection.py", "requests/packages/urllib3/util/ssl_.py"], "tool_calls": 1, "wall_seconds": 0.10, "tokens": 3440, "recall": false, "top_1": false, "measurement": "tool_output_only"}, - {"task": "psf__requests-2674", "backend": "codedb", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 23, "wall_seconds": 18, "tokens": 24816, "recall": true, "top_1": true}, - {"task": "psf__requests-2674", "backend": "codedb_CONTEXT", "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14725, "recall": true, "top_1": true}, - {"task": "psf__requests-2674", "backend": "leanctx", "files": ["requests/adapters.py"], "tool_calls": 6, "wall_seconds": 28, "tokens": 28060, "recall": true, "top_1": true}, - {"task": "psf__requests-2674", "backend": "fts5_trigram", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 8, "wall_seconds": 18, "tokens": 22767, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codedb", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 23, "wall_seconds": 18, "tokens": 24816, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codedb_CONTEXT", "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14725, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "leanctx", "files": ["requests/adapters.py"], "tool_calls": 6, "wall_seconds": 28, "tokens": 28060, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "fts5_trigram", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 8, "wall_seconds": 18, "tokens": 22767, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codegraph", "files": ["requests/adapters.py", "requests/packages/urllib3/exceptions.py", "requests/sessions.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 1927, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "psf__requests-2674", "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/exceptions.py", "requests/packages/urllib3/util/timeout.py"], "tool_calls": 1, "wall_seconds": 0.10, "tokens": 3113, "recall": false, "top_1": false, "measurement": "tool_output_only"}, - {"task": "mwaskom__seaborn-2848", "backend": "codedb", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false}, - {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14791, "recall": true, "top_1": false}, - {"task": "mwaskom__seaborn-2848", "backend": "leanctx", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 20, "wall_seconds": 45, "tokens": 43291, "recall": true, "top_1": false}, - {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram", "files": ["seaborn/_oldcore.py", "seaborn/relational.py"], "tool_calls": 23, "wall_seconds": 45, "tokens": 47720, "recall": true, "top_1": true} + {"task": "mwaskom__seaborn-2848", "backend": "codedb", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14791, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "leanctx", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 20, "wall_seconds": 45, "tokens": 43291, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram", "files": ["seaborn/_oldcore.py", "seaborn/relational.py"], "tool_calls": 23, "wall_seconds": 45, "tokens": 47720, "recall": true, "top_1": true}, + {"task": "mwaskom__seaborn-2848", "backend": "codegraph", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 3, "wall_seconds": 0.18, "tokens": 2262, "recall": true, "top_1": false, "measurement": "tool_output_only"}, + {"task": "mwaskom__seaborn-2848", "backend": "codegraph_CONTEXT", "files": ["seaborn/_oldcore.py", "seaborn/axisgrid.py", "seaborn/_marks/base.py"], "tool_calls": 1, "wall_seconds": 0.12, "tokens": 6245, "recall": true, "top_1": true, "measurement": "tool_output_only"} ], "summary": { "by_backend": { - "codedb": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0, "avg_tokens": 37954.25}, - "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25, "avg_wall_seconds": 1.25, "avg_tokens": 14716.5}, - "leanctx": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75, "avg_wall_seconds": 27.25, "avg_tokens": 30171.75}, - "fts5_trigram": {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5} + "codedb": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0, "avg_tokens": 37954.25}, + "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25, "avg_wall_seconds": 1.25, "avg_tokens": 14716.5}, + "leanctx": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75, "avg_wall_seconds": 27.25, "avg_tokens": 30171.75}, + "fts5_trigram": {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5}, + "codegraph": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 3.0, "avg_wall_seconds": 0.165, "avg_tokens": 1981.25, "measurement": "tool_output_only"}, + "codegraph_CONTEXT": {"recall": "2/4", "top_1": "2/4", "avg_tool_calls": 1.0, "avg_wall_seconds": 0.1075, "avg_tokens": 4146.5, "measurement": "tool_output_only"} }, - "headline": "All four backends fully recall the gold file (4/4). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx tie at 3/4 (all flagged seaborn/axisgrid.py before seaborn/_oldcore.py — the symptom site vs the root-cause site). Efficiency: codedb_CONTEXT dominates by a wide margin (2.25 calls / 1.25s / 14.7k tokens) — 4-12x fewer calls than peers, 20-30x faster wall, lowest tokens.", - "pareto_optimal": "codedb_CONTEXT is the sole Pareto-optimal point on the (quality, efficiency) frontier: only fts5_trigram exceeds it on quality, and only by 1 cell out of 4, at ~1.5x the wall and ~1.75x the tokens." + "headline": "Six backends, four SWE-bench Lite instances. Quality is broadly similar — five of six achieve 4/4 recall (codegraph_CONTEXT is the only outlier at 2/4, missing both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx / codegraph tie at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering); codegraph_CONTEXT at 2/4. Efficiency cells for the codegraph rows reflect subprocess-only measurement under a fixed query plan, not a full LLM agent loop — they are not directly comparable to the other rows' agent-loop numbers.", + "hypothesis": "If a comparable LLM-driven agent loop were run against codegraph's primitive surface, recall would likely hold (4/4 found on the deterministic file-path oracle is shape-independent), but tool_calls and tokens would rise to LLM-loop levels. The interesting open question is whether codegraph_CONTEXT's `requests`-task miss is fixable by prompt engineering (it surfaces urllib3, the gold file is requests/adapters.py / requests/models.py) or whether it reflects a graph-relevance bias toward leaf libraries over wrapper APIs." } } From 05952b4dd24aac38aa6da6bf69c9e9d811acfc85 Mon Sep 17 00:00:00 2001 From: justrach <54503978+justrach@users.noreply.github.com> Date: Fri, 22 May 2026 20:22:59 +0800 Subject: [PATCH 3/3] bench(swe-lite): annotate codegraph version + re-verify at v0.9.3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Upgraded codegraph 0.7.10 -> 0.9.3 (five minor versions of drift on the tool we're benchmarking — unfair to measure stale). Re-indexed all 4 corpora and re-ran both surfaces. Result: file lists are byte-identical to v0.7.10 on all 4 tasks × both surfaces. Wall times within normal variance. The quality picture in RESULTS.md is robust to the version bump. Adds `backend_versions` to results.json metadata and a one-line note near the top of RESULTS.md so future readers know which codegraph version produced the numbers. Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmarks/swe-lite/RESULTS.md | 4 +++- benchmarks/swe-lite/results.json | 4 ++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md index 617727f..589547c 100644 --- a/benchmarks/swe-lite/RESULTS.md +++ b/benchmarks/swe-lite/RESULTS.md @@ -3,7 +3,9 @@ Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) instances × 6 retrieval backends, graded by a deterministic oracle (does the agent name the file that the merged upstream patch actually -edits?). Captured 2026-05-22. +edits?). Captured 2026-05-22. Codegraph rows re-verified at v0.9.3 +(released the same day) — file lists are byte-identical to v0.7.10, +so the quality picture below isn't a version artifact. This is published as a **hypothesis snapshot**, not a settled dominance claim — n=4 is too small for statistics, and not all rows diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json index 6a3340b..6a64acc 100644 --- a/benchmarks/swe-lite/results.json +++ b/benchmarks/swe-lite/results.json @@ -1,6 +1,10 @@ { "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape", "frozen_at": "2026-05-22T10:05Z", + "backend_versions": { + "codegraph": "0.9.3 (re-verified against v0.7.10 — file lists byte-identical)", + "codegraph_CONTEXT": "0.9.3" + }, "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)", "metric_definitions": { "recall": "gold file appears anywhere in agent's `files` list",