diff --git a/benchmarks/swe-lite/README.md b/benchmarks/swe-lite/README.md new file mode 100644 index 0000000..c4e799d --- /dev/null +++ b/benchmarks/swe-lite/README.md @@ -0,0 +1,64 @@ +# swe-lite + +A small **file-localization** hypothesis snapshot for code-retrieval +backends, graded by a deterministic oracle (file-path match against +the merged upstream patch) on 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances × 6 backends. + +This folder is a sibling to [`../search-shootout`](../search-shootout), +which uses a hand-authored React corpus + LLM judge. The two views +stress different things: + +| | `search-shootout/` | `swe-lite/` (this folder) | +|---|---|---| +| Corpus | facebook/react (one repo) | 3 upstream repos (flask, requests, seaborn) | +| Tasks | hand-authored | merged upstream PRs | +| Ground truth | hand-written `tasks.json` | gold patch's `changed_files` | +| Oracle | LLM-as-judge, 5-point rubric | deterministic file-path match | +| Risk | closed loop (same model family writes test + takes test) | independent ground truth | + +Both views together are stronger than either alone. The +search-shootout grades *answer quality*; swe-lite grades +*file-localization correctness*. + +## Files + +- [`results.json`](./results.json) — frozen snapshot: 4 tasks × + 6 backends, per-cell metrics, summary, hypothesis. +- [`replay.py`](./replay.py) — loads `results.json`, recomputes the + per-backend averages from the raw cells, asserts the summary + matches, and prints the dominance table (with `*` annotation + for tool-output-only measurements). +- [`RESULTS.md`](./RESULTS.md) — the publishable read of the data: + dominance table, what jumps out, the measurement caveat, and the + falsifiable hypothesis this snapshot supports. + +## Quick start + +```sh +python3 replay.py +``` + +Prints the matrix and exits non-zero if any summary cell disagrees +with what the raw cells imply. JSON form: + +```sh +python3 replay.py --json +``` + +## What this is NOT + +- **Not a live SWE-bench runner.** Four of six rows (`codedb`, + `codedb_CONTEXT`, `leanctx`, `fts5_trigram`) were populated by + running each backend through an LLM agent loop and recording the + agent's `files` output; codegraph rows were freshly measured here + using a fixed query plan (subprocess only, no LLM in the loop). + See `RESULTS.md` §Measurement caveat. +- **Not a patch-correctness eval.** Grades "did the agent name the + right file?", not "did the agent's patch make the failing tests + pass?". The latter is tracked as future work. +- **Not a statistic.** n=4 is a sanity check, not a sample. The + doc is framed as a hypothesis snapshot, not a settled claim. + +See [`RESULTS.md`](./RESULTS.md) for the full list of caveats and +the hypothesis statement. diff --git a/benchmarks/swe-lite/RESULTS.md b/benchmarks/swe-lite/RESULTS.md new file mode 100644 index 0000000..589547c --- /dev/null +++ b/benchmarks/swe-lite/RESULTS.md @@ -0,0 +1,191 @@ +# SWE-bench Lite — file-localization, six backends + +Small file-localization snapshot: 4 [SWE-bench Lite](https://github.com/princeton-nlp/SWE-bench) +instances × 6 retrieval backends, graded by a deterministic oracle +(does the agent name the file that the merged upstream patch actually +edits?). Captured 2026-05-22. Codegraph rows re-verified at v0.9.3 +(released the same day) — file lists are byte-identical to v0.7.10, +so the quality picture below isn't a version artifact. + +This is published as a **hypothesis snapshot**, not a settled +dominance claim — n=4 is too small for statistics, and not all rows +were measured the same way (see [Measurement caveat](#measurement-caveat)). +The raw data is in [`results.json`](./results.json); recompute and +verify the summary block with [`replay.py`](./replay.py). + +## Tasks + +| Instance | Repo | Gold file (the file the merged PR patched) | +|---|---|---| +| `pallets__flask-4045` | pallets/flask | `src/flask/blueprints.py` | +| `psf__requests-2148` | psf/requests | `requests/models.py` | +| `psf__requests-2674` | psf/requests | `requests/adapters.py` | +| `mwaskom__seaborn-2848` | mwaskom/seaborn | `seaborn/_oldcore.py` | + +Each instance's `base_commit` is pinned in `results.json` so the +state can be rebuilt. + +## Backends + +Six backends, three of which ship in two surfaces (a primitive +"search" surface and a task-shaped "build context for this query" +surface). Both surfaces are reported separately when they exist — +mixing a tool's primitive surface against another tool's deployed +surface gives a misleading read. + +| Backend | What it is | Surface | +|---|---|---| +| `codedb` | This repo. Zig trigram + word index. | primitive (`search`, `find`, `word`, `outline`) | +| `codedb_CONTEXT` | This repo's MCP composer | task-shaped (single call) | +| `leanctx` | yvgude/lean-ctx, BM25-ish word index | primitive | +| `fts5_trigram` | SQLite FTS5 with `trigram` tokenizer | primitive | +| `codegraph` | TS+SQLite code-graph (`codegraph query`) | primitive | +| `codegraph_CONTEXT` | codegraph's task composer (`codegraph context`) | task-shaped | + +## Oracle + +Deterministic, no LLM judge: + +- **recall** — gold file appears anywhere in the agent's `files` list +- **top-1** — the agent's *first* listed file equals the gold file + +The agent doesn't have to write a patch — only name the file it +would edit. This is an intermediate signal: weaker than patch +correctness, but stronger than judge-graded quality because there's +no model in the oracle loop. + +## Headline + +``` +backend recall top-1 avg calls avg wall (s) avg tokens +------------------- ------ ----- --------- ------------ ---------- +codedb 4/4 3/4 26.75 42.00 37,954 +codedb_CONTEXT 4/4 3/4 2.25 1.25 14,716 +leanctx 4/4 3/4 9.75 27.25 30,172 +fts5_trigram 4/4 4/4 13.75 24.75 25,800 +codegraph * 4/4 3/4 3.00 0.17 1,981 +codegraph_CONTEXT * 2/4 2/4 1.00 0.11 4,146 +``` + +*\* Codegraph rows use a different measurement methodology — see +[Measurement caveat](#measurement-caveat) before reading the +efficiency cells.* + +## What jumps out + +**Quality is mostly uniform.** Five of six backends fully recall the +gold file (4/4). Top-1 splits across one task (`seaborn-2848`, +discussed below): `fts5_trigram` 4/4, four others tied at 3/4. + +**`codegraph_CONTEXT` is the lone quality outlier.** It misses both +`requests` tasks because the issue text mentions urllib3 keywords +("socket", "urllib3", "DecodeError"), and the composer surfaces +urllib3 internals over the requests-layer wrapper where the patch +actually lands. This is the only cell where graph-relevance signal +diverges sharply from patch-site relevance in this sample. + +**Among the apples-to-apples (agent-loop) rows, `codedb_CONTEXT` +sits at the efficient end of the matched-quality cluster.** It +matches the 3/4-top-1 cluster (codedb / leanctx / codedb_CONTEXT) +on quality and is the cheapest in that cluster across calls, wall, +and tokens. `fts5_trigram` is the only backend that gets the +top-1-4/4 cell — at ~20× the wall time of `codedb_CONTEXT`. + +## The one task where top-1 split — `mwaskom__seaborn-2848` + +The seaborn bug surfaces as a `KeyError` raised inside +`seaborn/_oldcore.py::SemanticMapping`, but the user-facing call site +lives in `seaborn/axisgrid.py::PairGrid`. The merged upstream patch +edits `_oldcore.py` (the root-cause site). + +Four backends (`codedb`, `codedb_CONTEXT`, `leanctx`, `codegraph`) +named `axisgrid.py` first and `_oldcore.py` second — the order a +developer would trace through. `fts5_trigram` and +`codegraph_CONTEXT` named `_oldcore.py` first. Both orderings find +the bug; "top-1 correctness" is really asking *which* ordering you +want — the first file a developer would look at (call site) or the +file the patch actually lands in (root cause). + +## Measurement caveat + +Codegraph rows (`codegraph` and `codegraph_CONTEXT`) were measured +differently from the other four rows: + +- **Calls / wall:** codegraph numbers reflect subprocess invocations + driven by a fixed 3-query plan (primitive surface) or a single + `codegraph context` call (task surface). The other four rows + reflect a full LLM-driven agent loop that decides which queries + to run. +- **Tokens:** codegraph numbers are stdout bytes / 4 (just the + tool's output). The other four rows include the agent's full + context (system prompt + tool defs + tool outputs + LLM + reasoning). + +Under a comparable LLM-driven loop, codegraph's tool_calls would +likely rise (an LLM tends to make 5–15 queries when exploring) and +tokens would rise to the agent-context level (~10–20× current +values). What's NOT expected to change much: recall and top-1, +since those depend on which files codegraph surfaces — and the file +sets above are what codegraph actually returned for those queries. + +The takeaway is that codegraph's **quality** cells are directly +comparable to other backends, and its **efficiency** cells are not. +This is annotated in the table with `*` and in `results.json` via +the `measurement: tool_output_only` field. + +## Other caveats — read before quoting these numbers + +1. **n=4 is small.** Four SWE-bench Lite instances is a sanity + check, not a statistic. Don't read "3/4 top-1" as "75% top-1 on + SWE-bench Lite". +2. **File-localization ≠ patch-correctness.** This bench grades + whether the agent names the right file. It does not run the + agent end-to-end, generate a patch, or check whether the patch + makes the failing tests pass. An end-to-end `pass@1` eval is the + metric that actually matters; this is one rung below it on the + ladder. +3. **Snapshot, not live.** `results.json` is a frozen record. + `replay.py` recomputes the averages from the cells and verifies + the summary block matches, but does not re-launch the four + non-codegraph backends. Codegraph rows *were* freshly measured + while preparing this snapshot. +4. **The seaborn top-1 split is a metric artifact, not a backend + weakness.** Four of six backends order files by traceability + rather than by patch site. The split says more about top-1 as a + metric than about any individual backend. + +## Hypothesis + +Stated as something to falsify, not declare: + +> Among compared backends, **`codedb_CONTEXT`** is the cheapest +> backend in the matched-quality cluster (3/4 top-1, 4/4 recall) on +> file-localization. **`fts5_trigram`** is the only backend that +> currently reaches 4/4 top-1, and it does so at ~20× the wall time +> of `codedb_CONTEXT`. The expected next-step result, if a live +> agent-loop runner is built and codegraph is re-measured under +> matched methodology, is: **codegraph (primitive) joins the +> 3/4-top-1 cluster at agent-loop call counts somewhere between +> codedb_CONTEXT's 2.25 and leanctx's 9.75, with comparable +> tokens.** + +This hypothesis is **falsifiable** by: + +- Building a live LLM-loop runner and re-measuring codegraph at + agent-loop methodology. +- Expanding to 20–50 SWE-bench Lite instances — at that sample size + the quality differences (or lack of them) become statistical. +- Adding a patch-correctness oracle (apply the agent's patch + against the pinned `base_commit` and run the failing tests). + +Until any of those hold, treat the headline as **directional**, not +quantitative. + +## Future work + +- A live runner that actually invokes each backend per task with a + consistent LLM agent loop, so all rows are measured the same way. +- A patch-correctness oracle. +- More tasks. +- Quality cells under the existing oracle are robust; everything + else is a calibration exercise. diff --git a/benchmarks/swe-lite/replay.py b/benchmarks/swe-lite/replay.py new file mode 100644 index 0000000..d9a9b39 --- /dev/null +++ b/benchmarks/swe-lite/replay.py @@ -0,0 +1,127 @@ +#!/usr/bin/env python3 +"""Replay + verify the SWE-bench Lite file-localization snapshot. + +This is NOT a live SWE-bench runner. It loads `results.json` (a frozen +record of agent runs on 4 SWE-bench Lite instances, populated by hand +from agent traces), recomputes the per-backend averages from the raw +cells, and asserts they match the summary block. Then prints a +dominance table. + +A live runner (that actually launches each backend, sends the issue +text, captures the agent's `files` list, and patch-tests the result) +is out of scope for this snapshot and tracked separately. + +Usage: + python3 replay.py # verify + print dominance table + python3 replay.py --json # print raw recomputed summary as JSON +""" +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from statistics import mean + +SNAPSHOT = Path(__file__).resolve().parent / "results.json" + + +def recompute(snapshot: dict) -> dict: + by_backend: dict[str, dict] = {} + cells_by_backend: dict[str, list[dict]] = {} + for cell in snapshot["cells"]: + cells_by_backend.setdefault(cell["backend"], []).append(cell) + + n_tasks = len(snapshot["tasks"]) + for backend, cells in cells_by_backend.items(): + recall_hits = sum(1 for c in cells if c["recall"]) + top1_hits = sum(1 for c in cells if c["top_1"]) + by_backend[backend] = { + "recall": f"{recall_hits}/{n_tasks}", + "top_1": f"{top1_hits}/{n_tasks}", + "avg_tool_calls": round(mean(c["tool_calls"] for c in cells), 2), + "avg_wall_seconds": round(mean(c["wall_seconds"] for c in cells), 2), + "avg_tokens": round(mean(c["tokens"] for c in cells), 2), + } + return by_backend + + +def verify(snapshot: dict, recomputed: dict) -> list[str]: + errors: list[str] = [] + claimed = snapshot["summary"]["by_backend"] + for backend, claim in claimed.items(): + actual = recomputed.get(backend) + if actual is None: + errors.append(f"{backend}: claimed in summary but has no cells") + continue + for key in ("recall", "top_1"): + if claim[key] != actual[key]: + errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}") + for key in ("avg_tool_calls", "avg_wall_seconds", "avg_tokens"): + if abs(float(claim[key]) - float(actual[key])) > 0.01: + errors.append(f"{backend}.{key}: claimed {claim[key]} != actual {actual[key]}") + return errors + + +def print_table(snapshot: dict, recomputed: dict) -> None: + backends = snapshot["backends"] + measurement = { + b: snapshot["summary"]["by_backend"][b].get("measurement") + for b in backends + } + rows = [("backend", "recall", "top-1", "avg calls", "avg wall (s)", "avg tokens")] + for backend in backends: + s = recomputed[backend] + label = backend + (" *" if measurement.get(backend) == "tool_output_only" else "") + rows.append(( + label, + s["recall"], + s["top_1"], + f"{s['avg_tool_calls']:.2f}", + f"{s['avg_wall_seconds']:.2f}", + f"{s['avg_tokens']:,.0f}", + )) + widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))] + sep = " ".join("-" * w for w in widths) + for i, row in enumerate(rows): + print(" ".join(cell.ljust(widths[j]) for j, cell in enumerate(row))) + if i == 0: + print(sep) + if any(m == "tool_output_only" for m in measurement.values()): + print() + print("* tool-output-only measurement (subprocess time + stdout bytes/4),") + print(" driven by a fixed query plan, NOT an LLM agent loop. Not directly") + print(" comparable to rows without an asterisk — see RESULTS.md for details.") +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--json", action="store_true", help="emit recomputed summary as JSON") + parser.add_argument("--snapshot", type=Path, default=SNAPSHOT, help="path to results.json") + args = parser.parse_args() + + snapshot = json.loads(args.snapshot.read_text()) + recomputed = recompute(snapshot) + errors = verify(snapshot, recomputed) + + if args.json: + print(json.dumps(recomputed, indent=2)) + else: + print(f"source: {snapshot['source']}") + print(f"frozen at: {snapshot['frozen_at']}") + print(f"tasks: {len(snapshot['tasks'])} ({', '.join(t['id'] for t in snapshot['tasks'])})") + print(f"backends: {len(snapshot['backends'])} ({', '.join(snapshot['backends'])})") + print() + print_table(snapshot, recomputed) + print() + print("headline:", snapshot["summary"]["headline"]) + + if errors: + print(file=sys.stderr) + print("VERIFY FAILED — summary does not match cells:", file=sys.stderr) + for err in errors: + print(f" - {err}", file=sys.stderr) + return 1 + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmarks/swe-lite/results.json b/benchmarks/swe-lite/results.json new file mode 100644 index 0000000..6a64acc --- /dev/null +++ b/benchmarks/swe-lite/results.json @@ -0,0 +1,65 @@ +{ + "source": "SWE-bench Lite (princeton-nlp/SWE-bench_Lite) — file-localization shape", + "frozen_at": "2026-05-22T10:05Z", + "backend_versions": { + "codegraph": "0.9.3 (re-verified against v0.7.10 — file lists byte-identical)", + "codegraph_CONTEXT": "0.9.3" + }, + "scoring": "deterministic file-path match against gold patch's changed_files (no LLM judge)", + "metric_definitions": { + "recall": "gold file appears anywhere in agent's `files` list", + "top_1": "agent's first `files` entry equals the gold file" + }, + "measurement_notes": { + "default": "tokens reflect the agent's full context consumption (system prompt + tool defs + tool outputs + LLM reasoning); tool_calls and wall_seconds are end-to-end agent loop totals", + "tool_output_only": "tokens reflect only the tool's stdout bytes / 4 (no LLM context); tool_calls and wall_seconds reflect subprocess invocations driven by a fixed query plan, not an LLM-decided loop" + }, + "tasks": [ + {"id": "pallets__flask-4045", "repo": "pallets/flask", "base_commit": "d8c37f43724cd9fb0870f77877b7c4c7e38a19e0", "title": "Raise error when blueprint name contains a dot", "gold_files": ["src/flask/blueprints.py"]}, + {"id": "psf__requests-2148", "repo": "psf/requests", "base_commit": "fe693c492242ae532211e0c173324f09ca8cf227", "title": "socket.error exception not caught/wrapped in a requests exception", "gold_files": ["requests/models.py"]}, + {"id": "psf__requests-2674", "repo": "psf/requests", "base_commit": "0be38a0c37c59c4b66ce908731da15b401655113", "title": "urllib3 exceptions passing through requests API", "gold_files": ["requests/adapters.py"]}, + {"id": "mwaskom__seaborn-2848", "repo": "mwaskom/seaborn", "base_commit": "94621cef29f80282436d73e8d2c0aa76dab81273", "title": "pairplot fails with hue_order not containing all hue values", "gold_files": ["seaborn/_oldcore.py"]} + ], + "backends": ["codedb", "codedb_CONTEXT", "leanctx", "fts5_trigram", "codegraph", "codegraph_CONTEXT"], + "cells": [ + {"task": "pallets__flask-4045", "backend": "codedb", "files": ["src/flask/blueprints.py"], "tool_calls": 8, "wall_seconds": 12, "tokens": 17508, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codedb_CONTEXT", "files": ["src/flask/blueprints.py"], "tool_calls": 3, "wall_seconds": 2, "tokens": 14834, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "leanctx", "files": ["src/flask/blueprints.py"], "tool_calls": 4, "wall_seconds": 8, "tokens": 17017, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "fts5_trigram", "files": ["src/flask/blueprints.py"], "tool_calls": 13, "wall_seconds": 18, "tokens": 16288, "recall": true, "top_1": true}, + {"task": "pallets__flask-4045", "backend": "codegraph", "files": ["src/flask/blueprints.py", "src/flask/json/tag.py", "src/flask/wrappers.py", "src/flask/app.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 2235, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "pallets__flask-4045", "backend": "codegraph_CONTEXT", "files": ["src/flask/blueprints.py", "src/flask/helpers.py", "src/flask/app.py", "src/flask/scaffold.py"], "tool_calls": 1, "wall_seconds": 0.11, "tokens": 3788, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + + {"task": "psf__requests-2148", "backend": "codedb", "files": ["requests/models.py", "requests/adapters.py", "requests/exceptions.py"], "tool_calls": 14, "wall_seconds": 18, "tokens": 20439, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codedb_CONTEXT", "files": ["requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14516, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "leanctx", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 9, "wall_seconds": 28, "tokens": 32319, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "fts5_trigram", "files": ["requests/models.py", "requests/adapters.py"], "tool_calls": 11, "wall_seconds": 18, "tokens": 16427, "recall": true, "top_1": true}, + {"task": "psf__requests-2148", "backend": "codegraph", "files": ["requests/models.py", "requests/adapters.py", "requests/sessions.py", "requests/exceptions.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 1501, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "psf__requests-2148", "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/connection.py", "requests/packages/urllib3/util/ssl_.py"], "tool_calls": 1, "wall_seconds": 0.10, "tokens": 3440, "recall": false, "top_1": false, "measurement": "tool_output_only"}, + + {"task": "psf__requests-2674", "backend": "codedb", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 23, "wall_seconds": 18, "tokens": 24816, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codedb_CONTEXT", "files": ["requests/adapters.py", "requests/models.py", "requests/exceptions.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14725, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "leanctx", "files": ["requests/adapters.py"], "tool_calls": 6, "wall_seconds": 28, "tokens": 28060, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "fts5_trigram", "files": ["requests/adapters.py", "requests/exceptions.py"], "tool_calls": 8, "wall_seconds": 18, "tokens": 22767, "recall": true, "top_1": true}, + {"task": "psf__requests-2674", "backend": "codegraph", "files": ["requests/adapters.py", "requests/packages/urllib3/exceptions.py", "requests/sessions.py"], "tool_calls": 3, "wall_seconds": 0.16, "tokens": 1927, "recall": true, "top_1": true, "measurement": "tool_output_only"}, + {"task": "psf__requests-2674", "backend": "codegraph_CONTEXT", "files": ["requests/packages/urllib3/exceptions.py", "requests/packages/urllib3/util/timeout.py"], "tool_calls": 1, "wall_seconds": 0.10, "tokens": 3113, "recall": false, "top_1": false, "measurement": "tool_output_only"}, + + {"task": "mwaskom__seaborn-2848", "backend": "codedb", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 62, "wall_seconds": 120, "tokens": 89054, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "codedb_CONTEXT", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 2, "wall_seconds": 1, "tokens": 14791, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "leanctx", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 20, "wall_seconds": 45, "tokens": 43291, "recall": true, "top_1": false}, + {"task": "mwaskom__seaborn-2848", "backend": "fts5_trigram", "files": ["seaborn/_oldcore.py", "seaborn/relational.py"], "tool_calls": 23, "wall_seconds": 45, "tokens": 47720, "recall": true, "top_1": true}, + {"task": "mwaskom__seaborn-2848", "backend": "codegraph", "files": ["seaborn/axisgrid.py", "seaborn/_oldcore.py"], "tool_calls": 3, "wall_seconds": 0.18, "tokens": 2262, "recall": true, "top_1": false, "measurement": "tool_output_only"}, + {"task": "mwaskom__seaborn-2848", "backend": "codegraph_CONTEXT", "files": ["seaborn/_oldcore.py", "seaborn/axisgrid.py", "seaborn/_marks/base.py"], "tool_calls": 1, "wall_seconds": 0.12, "tokens": 6245, "recall": true, "top_1": true, "measurement": "tool_output_only"} + ], + "summary": { + "by_backend": { + "codedb": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 26.75, "avg_wall_seconds": 42.0, "avg_tokens": 37954.25}, + "codedb_CONTEXT": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 2.25, "avg_wall_seconds": 1.25, "avg_tokens": 14716.5}, + "leanctx": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 9.75, "avg_wall_seconds": 27.25, "avg_tokens": 30171.75}, + "fts5_trigram": {"recall": "4/4", "top_1": "4/4", "avg_tool_calls": 13.75, "avg_wall_seconds": 24.75, "avg_tokens": 25800.5}, + "codegraph": {"recall": "4/4", "top_1": "3/4", "avg_tool_calls": 3.0, "avg_wall_seconds": 0.165, "avg_tokens": 1981.25, "measurement": "tool_output_only"}, + "codegraph_CONTEXT": {"recall": "2/4", "top_1": "2/4", "avg_tool_calls": 1.0, "avg_wall_seconds": 0.1075, "avg_tokens": 4146.5, "measurement": "tool_output_only"} + }, + "headline": "Six backends, four SWE-bench Lite instances. Quality is broadly similar — five of six achieve 4/4 recall (codegraph_CONTEXT is the only outlier at 2/4, missing both `requests` tasks by surfacing urllib3 internals over the requests-layer wrapper). Top-1 splits: fts5_trigram 4/4; codedb / codedb_CONTEXT / leanctx / codegraph tie at 3/4 (the seaborn axisgrid/_oldcore call-trace ordering); codegraph_CONTEXT at 2/4. Efficiency cells for the codegraph rows reflect subprocess-only measurement under a fixed query plan, not a full LLM agent loop — they are not directly comparable to the other rows' agent-loop numbers.", + "hypothesis": "If a comparable LLM-driven agent loop were run against codegraph's primitive surface, recall would likely hold (4/4 found on the deterministic file-path oracle is shape-independent), but tool_calls and tokens would rise to LLM-loop levels. The interesting open question is whether codegraph_CONTEXT's `requests`-task miss is fixable by prompt engineering (it surfaces urllib3, the gold file is requests/adapters.py / requests/models.py) or whether it reflects a graph-relevance bias toward leaf libraries over wrapper APIs." + } +}