You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Local source note: /Users/azalio/Downloads/Telegram Desktop/codegraph_the_open_source_knowledge_graph_that_makes_ai_coding_t.md, extracted from Medium article "CodeGraph: The Open-Source Knowledge Graph That Makes AI Coding Tools Dramatically Cheaper" (https://medium.com/kd-agentic/codegraph-the-open-source-knowledge-graph-that-makes-ai-coding-tools-dramatically-cheaper-190f8b89f8a7).
Source-specific idea used here: CodeGraph claims token/tool-call/cost/speed wins by replacing mechanical file exploration with a prebuilt local code map. The actionable part for MAP is to add a controlled evaluation path that can prove whether a structural map actually reduces broad exploration while preserving localization quality.
Relevant source takeaways
The article reports aggregate reductions in tokens, tool calls, cost, and latency across several repos, with larger gains on larger/more interconnected codebases.
It emphasizes that savings come from eliminating file-by-file exploration, not from changing the model.
It calls out per-project variance; small/simple repos may see limited gains. MAP should therefore measure before making a structural map part of default workflow behavior.
It also notes that a code graph helps exploration, not reasoning-heavy design. MAP's benchmark should separate localization quality from downstream implementation correctness.
Repo evidence
docs/ARCHITECTURE.md:25 documents per-subtask token accounting rolled into token_accounting.json with cost, cache-hit ratio, and advisory research ROI.
docs/USAGE.md:1734-1741 documents mapify research-eval score, which scores ResearchEvidence localization quality using file-level and line-overlap precision/recall/F1.
src/mapify_cli/templates_src/agents/research-agent.md.jinja:30-38 says the research artifact is the compressed context that enters Actor's context window, and :58-80 defines the output shape. That makes ResearchEvidence a good benchmark boundary.
src/mapify_cli/templates_src/codex/agents/researcher.toml.jinja:57-83 bounds research output and search strategy, but no benchmark compares different discovery methods under the same expected locations.
Existing issue search
Commands/searches used:
gh issue list --state all --limit 100 --search "CodeGraph OR \"knowledge graph\" OR \"call graph\" OR tree-sitter OR symbols OR \"repo insight\" OR \"repository map\" OR \"token reduction\" OR \"research ROI\"" returned no direct CodeGraph/structural-map benchmark matches.
gh issue list --state all --limit 100 --search "affected_files research-agent token accounting" returned no matches.
gh issue list --state all --limit 100 --search "tree-sitter" returned no matches.
MAP can already score localization quality and report token accounting, but there is no benchmark fixture that runs the same discovery task through two strategies and checks both quality and mechanical exploration cost. Without that, a structural-map provider could be added but remain unproven, or worse, optimize tokens while degrading evidence quality.
Problem
A code-map integration would be tempting to flip on based on external claims. MAP needs its own eval gate: does structural discovery reduce broad search/read/tool calls for MAP-style ResearchEvidence without lowering file/line precision? The current research eval and token accounting pieces are adjacent, but not connected into this decision.
Proposed slice
Add a deterministic benchmark/eval path for structural discovery ROI.
Concrete first slice:
Extend mapify research-eval or add a sibling command that can compare two saved ResearchEvidence runs for the same fixture/task: baseline glob_grep vs structural-map/provider-backed discovery.
Track at least: files scanned/read, broad search count when available, returned location count, precision/recall/F1 against expected locations, estimated tokens or recorded token usage if a transcript/token log is available, and wall-clock only as advisory.
Add fixture tasks representing MAP-relevant discovery: import chain impact, caller/reference lookup, route/entrypoint lookup if fixture supports it, and ambiguous symbol names.
Define pass criteria that forbid token-only wins: structural-map arm must meet a localization quality floor and must not return stale/nonexistent paths.
Surface the result in docs as an implementation gate for any future default/auto use of structural discovery.
Acceptance criteria
A test fixture can compare glob_grep ResearchEvidence vs structural-map ResearchEvidence for the same expected locations.
The scorer reports quality metrics and exploration-cost metrics separately; token/tool-call reduction cannot mask lower precision/recall.
Stale/missing-path outputs fail or are clearly flagged.
The benchmark can run without provider credentials and without external network access.
Docs explain when the result should block enabling structural-map-first behavior.
The implementation reuses existing research_eval.py where practical instead of introducing an unrelated metric stack.
Guardrails
Do not claim production cost savings without a recorded MAP benchmark.
Do not use shadow-mode rollout; run explicit evals and gate behavior directly.
Do not make the benchmark depend on live Claude/Codex credentials for its core pass/fail path.
Do not optimize for token reduction alone; file/line evidence quality is non-negotiable.
Do not paste long source text from the article into docs or issues; summarize and link.
Source
Local source note:
/Users/azalio/Downloads/Telegram Desktop/codegraph_the_open_source_knowledge_graph_that_makes_ai_coding_t.md, extracted from Medium article "CodeGraph: The Open-Source Knowledge Graph That Makes AI Coding Tools Dramatically Cheaper" (https://medium.com/kd-agentic/codegraph-the-open-source-knowledge-graph-that-makes-ai-coding-tools-dramatically-cheaper-190f8b89f8a7).Source-specific idea used here: CodeGraph claims token/tool-call/cost/speed wins by replacing mechanical file exploration with a prebuilt local code map. The actionable part for MAP is to add a controlled evaluation path that can prove whether a structural map actually reduces broad exploration while preserving localization quality.
Relevant source takeaways
Repo evidence
docs/ARCHITECTURE.md:25documents per-subtask token accounting rolled intotoken_accounting.jsonwith cost, cache-hit ratio, and advisory research ROI.docs/USAGE.md:1734-1741documentsmapify research-eval score, which scores ResearchEvidence localization quality using file-level and line-overlap precision/recall/F1.src/mapify_cli/templates_src/agents/research-agent.md.jinja:30-38says the research artifact is the compressed context that enters Actor's context window, and:58-80defines the output shape. That makes ResearchEvidence a good benchmark boundary.src/mapify_cli/templates_src/codex/agents/researcher.toml.jinja:57-83bounds research output and search strategy, but no benchmark compares different discovery methods under the same expected locations.Existing issue search
Commands/searches used:
gh issue list --state all --limit 100 --search "CodeGraph OR \"knowledge graph\" OR \"call graph\" OR tree-sitter OR symbols OR \"repo insight\" OR \"repository map\" OR \"token reduction\" OR \"research ROI\""returned no direct CodeGraph/structural-map benchmark matches.gh issue list --state all --limit 100 --search "affected_files research-agent token accounting"returned no matches.gh issue list --state all --limit 100 --search "tree-sitter"returned no matches.Why this is not already covered
MAP can already score localization quality and report token accounting, but there is no benchmark fixture that runs the same discovery task through two strategies and checks both quality and mechanical exploration cost. Without that, a structural-map provider could be added but remain unproven, or worse, optimize tokens while degrading evidence quality.
Problem
A code-map integration would be tempting to flip on based on external claims. MAP needs its own eval gate: does structural discovery reduce broad search/read/tool calls for MAP-style ResearchEvidence without lowering file/line precision? The current research eval and token accounting pieces are adjacent, but not connected into this decision.
Proposed slice
Add a deterministic benchmark/eval path for structural discovery ROI.
Concrete first slice:
mapify research-evalor add a sibling command that can compare two saved ResearchEvidence runs for the same fixture/task: baselineglob_grepvs structural-map/provider-backed discovery.Acceptance criteria
glob_grepResearchEvidence vs structural-map ResearchEvidence for the same expected locations.research_eval.pywhere practical instead of introducing an unrelated metric stack.Guardrails