Skip to content

[CodeGraph] Add structural-discovery ROI eval for MAP research #311

Description

@azalio

Source

Local source note: /Users/azalio/Downloads/Telegram Desktop/codegraph_the_open_source_knowledge_graph_that_makes_ai_coding_t.md, extracted from Medium article "CodeGraph: The Open-Source Knowledge Graph That Makes AI Coding Tools Dramatically Cheaper" (https://medium.com/kd-agentic/codegraph-the-open-source-knowledge-graph-that-makes-ai-coding-tools-dramatically-cheaper-190f8b89f8a7).

Source-specific idea used here: CodeGraph claims token/tool-call/cost/speed wins by replacing mechanical file exploration with a prebuilt local code map. The actionable part for MAP is to add a controlled evaluation path that can prove whether a structural map actually reduces broad exploration while preserving localization quality.

Relevant source takeaways

  • The article reports aggregate reductions in tokens, tool calls, cost, and latency across several repos, with larger gains on larger/more interconnected codebases.
  • It emphasizes that savings come from eliminating file-by-file exploration, not from changing the model.
  • It calls out per-project variance; small/simple repos may see limited gains. MAP should therefore measure before making a structural map part of default workflow behavior.
  • It also notes that a code graph helps exploration, not reasoning-heavy design. MAP's benchmark should separate localization quality from downstream implementation correctness.

Repo evidence

  • docs/ARCHITECTURE.md:25 documents per-subtask token accounting rolled into token_accounting.json with cost, cache-hit ratio, and advisory research ROI.
  • docs/USAGE.md:1734-1741 documents mapify research-eval score, which scores ResearchEvidence localization quality using file-level and line-overlap precision/recall/F1.
  • Report research ROI in token accounting and run-health artifacts #202 is closed and added advisory research ROI surfaces, but its body is about reporting research-agent/researcher cost and run-health metrics, not an A/B harness for structural-map vs grep-based discovery.
  • Teach Actor to consume research evidence without repeating broad exploration #203 is closed and discourages repeated broad exploration after high-confidence research. It does not measure whether an alternate discovery provider reduced exploration.
  • src/mapify_cli/templates_src/agents/research-agent.md.jinja:30-38 says the research artifact is the compressed context that enters Actor's context window, and :58-80 defines the output shape. That makes ResearchEvidence a good benchmark boundary.
  • src/mapify_cli/templates_src/codex/agents/researcher.toml.jinja:57-83 bounds research output and search strategy, but no benchmark compares different discovery methods under the same expected locations.

Existing issue search

Commands/searches used:

Why this is not already covered

MAP can already score localization quality and report token accounting, but there is no benchmark fixture that runs the same discovery task through two strategies and checks both quality and mechanical exploration cost. Without that, a structural-map provider could be added but remain unproven, or worse, optimize tokens while degrading evidence quality.

Problem

A code-map integration would be tempting to flip on based on external claims. MAP needs its own eval gate: does structural discovery reduce broad search/read/tool calls for MAP-style ResearchEvidence without lowering file/line precision? The current research eval and token accounting pieces are adjacent, but not connected into this decision.

Proposed slice

Add a deterministic benchmark/eval path for structural discovery ROI.

Concrete first slice:

  • Extend mapify research-eval or add a sibling command that can compare two saved ResearchEvidence runs for the same fixture/task: baseline glob_grep vs structural-map/provider-backed discovery.
  • Track at least: files scanned/read, broad search count when available, returned location count, precision/recall/F1 against expected locations, estimated tokens or recorded token usage if a transcript/token log is available, and wall-clock only as advisory.
  • Add fixture tasks representing MAP-relevant discovery: import chain impact, caller/reference lookup, route/entrypoint lookup if fixture supports it, and ambiguous symbol names.
  • Define pass criteria that forbid token-only wins: structural-map arm must meet a localization quality floor and must not return stale/nonexistent paths.
  • Surface the result in docs as an implementation gate for any future default/auto use of structural discovery.

Acceptance criteria

  • A test fixture can compare glob_grep ResearchEvidence vs structural-map ResearchEvidence for the same expected locations.
  • The scorer reports quality metrics and exploration-cost metrics separately; token/tool-call reduction cannot mask lower precision/recall.
  • Stale/missing-path outputs fail or are clearly flagged.
  • The benchmark can run without provider credentials and without external network access.
  • Docs explain when the result should block enabling structural-map-first behavior.
  • The implementation reuses existing research_eval.py where practical instead of introducing an unrelated metric stack.

Guardrails

  • Do not claim production cost savings without a recorded MAP benchmark.
  • Do not use shadow-mode rollout; run explicit evals and gate behavior directly.
  • Do not make the benchmark depend on live Claude/Codex credentials for its core pass/fail path.
  • Do not optimize for token reduction alone; file/line evidence quality is non-negotiable.
  • Do not paste long source text from the article into docs or issues; summarize and link.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions