[CodeGraph] Add structural-discovery ROI eval for MAP research

## Source

Local source note: `/Users/azalio/Downloads/Telegram Desktop/codegraph_the_open_source_knowledge_graph_that_makes_ai_coding_t.md`, extracted from Medium article "CodeGraph: The Open-Source Knowledge Graph That Makes AI Coding Tools Dramatically Cheaper" (`https://medium.com/kd-agentic/codegraph-the-open-source-knowledge-graph-that-makes-ai-coding-tools-dramatically-cheaper-190f8b89f8a7`).

Source-specific idea used here: CodeGraph claims token/tool-call/cost/speed wins by replacing mechanical file exploration with a prebuilt local code map. The actionable part for MAP is to add a controlled evaluation path that can prove whether a structural map actually reduces broad exploration while preserving localization quality.

## Relevant source takeaways

- The article reports aggregate reductions in tokens, tool calls, cost, and latency across several repos, with larger gains on larger/more interconnected codebases.
- It emphasizes that savings come from eliminating file-by-file exploration, not from changing the model.
- It calls out per-project variance; small/simple repos may see limited gains. MAP should therefore measure before making a structural map part of default workflow behavior.
- It also notes that a code graph helps exploration, not reasoning-heavy design. MAP's benchmark should separate localization quality from downstream implementation correctness.

## Repo evidence

- `docs/ARCHITECTURE.md:25` documents per-subtask token accounting rolled into `token_accounting.json` with cost, cache-hit ratio, and advisory research ROI.
- `docs/USAGE.md:1734-1741` documents `mapify research-eval score`, which scores ResearchEvidence localization quality using file-level and line-overlap precision/recall/F1.
- #202 is closed and added advisory research ROI surfaces, but its body is about reporting research-agent/researcher cost and run-health metrics, not an A/B harness for structural-map vs grep-based discovery.
- #203 is closed and discourages repeated broad exploration after high-confidence research. It does not measure whether an alternate discovery provider reduced exploration.
- `src/mapify_cli/templates_src/agents/research-agent.md.jinja:30-38` says the research artifact is the compressed context that enters Actor's context window, and `:58-80` defines the output shape. That makes ResearchEvidence a good benchmark boundary.
- `src/mapify_cli/templates_src/codex/agents/researcher.toml.jinja:57-83` bounds research output and search strategy, but no benchmark compares different discovery methods under the same expected locations.

## Existing issue search

Commands/searches used:

- `gh issue list --state all --limit 100 --search "CodeGraph OR \"knowledge graph\" OR \"call graph\" OR tree-sitter OR symbols OR \"repo insight\" OR \"repository map\" OR \"token reduction\" OR \"research ROI\""` returned no direct CodeGraph/structural-map benchmark matches.
- `gh issue list --state all --limit 100 --search "affected_files research-agent token accounting"` returned no matches.
- `gh issue list --state all --limit 100 --search "tree-sitter"` returned no matches.
- #200 "Add localization-quality evaluation for research-agent outputs" is closed and covers scoring evidence quality, not tool-call/token comparison between discovery strategies.
- #202 "Report research ROI in token accounting and run-health artifacts" is closed and covers reporting research cost, not controlled benchmark cases.
- #289 "Token accounting dashboard" is closed and visualizes cost/trends, not an eval harness for code-map ROI.

## Why this is not already covered

MAP can already score localization quality and report token accounting, but there is no benchmark fixture that runs the same discovery task through two strategies and checks both quality and mechanical exploration cost. Without that, a structural-map provider could be added but remain unproven, or worse, optimize tokens while degrading evidence quality.

## Problem

A code-map integration would be tempting to flip on based on external claims. MAP needs its own eval gate: does structural discovery reduce broad search/read/tool calls for MAP-style ResearchEvidence without lowering file/line precision? The current research eval and token accounting pieces are adjacent, but not connected into this decision.

## Proposed slice

Add a deterministic benchmark/eval path for structural discovery ROI.

Concrete first slice:

- Extend `mapify research-eval` or add a sibling command that can compare two saved ResearchEvidence runs for the same fixture/task: baseline `glob_grep` vs structural-map/provider-backed discovery.
- Track at least: files scanned/read, broad search count when available, returned location count, precision/recall/F1 against expected locations, estimated tokens or recorded token usage if a transcript/token log is available, and wall-clock only as advisory.
- Add fixture tasks representing MAP-relevant discovery: import chain impact, caller/reference lookup, route/entrypoint lookup if fixture supports it, and ambiguous symbol names.
- Define pass criteria that forbid token-only wins: structural-map arm must meet a localization quality floor and must not return stale/nonexistent paths.
- Surface the result in docs as an implementation gate for any future default/auto use of structural discovery.

## Acceptance criteria

- A test fixture can compare `glob_grep` ResearchEvidence vs structural-map ResearchEvidence for the same expected locations.
- The scorer reports quality metrics and exploration-cost metrics separately; token/tool-call reduction cannot mask lower precision/recall.
- Stale/missing-path outputs fail or are clearly flagged.
- The benchmark can run without provider credentials and without external network access.
- Docs explain when the result should block enabling structural-map-first behavior.
- The implementation reuses existing `research_eval.py` where practical instead of introducing an unrelated metric stack.

## Guardrails

- Do not claim production cost savings without a recorded MAP benchmark.
- Do not use shadow-mode rollout; run explicit evals and gate behavior directly.
- Do not make the benchmark depend on live Claude/Codex credentials for its core pass/fail path.
- Do not optimize for token reduction alone; file/line evidence quality is non-negotiable.
- Do not paste long source text from the article into docs or issues; summarize and link.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CodeGraph] Add structural-discovery ROI eval for MAP research #311

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[CodeGraph] Add structural-discovery ROI eval for MAP research #311

Description

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions