feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining#494
Open
colbymchenry wants to merge 12 commits into
Open
feat(go): generated-file down-rank + gRPC stub-impl bridge + trace-failure inlining#494colbymchenry wants to merge 12 commits into
colbymchenry wants to merge 12 commits into
Conversation
…ilure inlining
Multi-pronged fix to make codegraph competitive on Go multi-module repos
(cosmos-sdk, etcd) where it previously lost or tied. Driven by an 8-question
agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd: the
baseline had codegraph losing ~60% on cost on cosmos-sdk and mixed on etcd
deep cross-module flows, while winning cleanly on the single-module and
non-protobuf-heavy repos.
Diagnostics ruled OUT `go.work` parsing as the gap (prometheus crushes
without it). The actual failure modes were generated-file noise warping
disambiguation, missing gRPC interface→impl bridge in structural-typing Go,
and trace's failure path triggering 3-5 follow-up tool calls instead of
inlining the material the agent needed.
Changes:
- New `src/extraction/generated-detection.ts` — path-pattern classifier
for `.pb.go`, `.pulsar.go`, `_grpc.pb.go`, `_mock.go`, `_mocks.go`,
`mock_*.go`, `.generated.[jt]sx?`, `_pb2(_grpc)?.py`, `.pb.{cc,h}`,
`.g.dart`, `.freezed.dart`. Applied as a stable sort tiebreaker in
`findSymbol`, `findAllSymbols`, `codegraph_search` (MCP + CLI),
`codegraph_explore` file ranking, and context formatter Entry Points /
Related Symbols / Code blocks. Cosmos's `msgServer.Send` now ranks #3
instead of #9 on a `Send` search.
- New `goGrpcStubImplEdges` synthesizer in `callback-synthesizer.ts` —
detects `UnimplementedXxxServer` structs in generated files, identifies
their RPC methods (excluding `mustEmbed*` / `testEmbeddedByValue` gRPC
markers), and emits `calls` edges to the matching methods on any
non-generated struct whose method-name set is a superset. Closes Go's
structural-typing gap that the existing `interfaceOverrideEdges` (Java /
Kotlin only) couldn't bridge. 467 bridge edges on cosmos-sdk; bank's
`UnimplementedMsgServer::Send` points to `x/bank/keeper/msg_server.go`
only, not to `msgClient` siblings or mock files.
- Trace-failure rewrite (`handleTrace`) — when no static path connects
endpoints, instead of telling the agent to call `codegraph_node` (a
3-4-call fan-out), inline both endpoints' bodies (120 lines / 3600 chars
per endpoint), their callers (≤6), and callees (≤8) in one response.
- Trace endpoint-pairing improvements — scores every `from`×`to`
candidate combo by shared directory prefix and tries the best-paired
pair first (the full candidate set, not just FTS top-5). A
less-canonical-path penalty (`enterprise/`, `contrib/`, `examples/`,
`vendor/`, `third_party/`, `deprecated/`, `legacy/`) ensures the
canonical-module pair wins even when a side-experiment shares more of
its directory prefix. Find-path probe budget capped at 20 pairs.
- Test-file deprioritization in `codegraph_explore` `isLowValue` — adds
suffix patterns (`_test.go`, `_spec.rb`, `.test.ts`, `.spec.tsx`,
`Test.java`, `Spec.kt`) alongside the existing directory-style patterns.
Otherwise etcd's `watchable_store_test.go` consumes 5K chars of explore
budget that should go to the hand-written flow source.
Tests:
- New `__tests__/generated-detection.test.ts` (4 unit tests) pins the
suffix patterns.
- New "Go gRPC stub→impl synthesis" integration test suite in
`frameworks-integration.test.ts` (2 tests): positive bridge from stub
to hand-written impl, AND the precision case (don't bridge to a
generated sibling like `msgClient` in the same .pb.go).
- Full suite: 1076/1076 pass.
Empirical (post-fix, n=2 average per question):
| Repo / Q | WITH | WITHOUT | Reads (W/WO) | Time (W/WO)
|-------------------------|------------|-------------|--------------|------------
| cobra (parse cmds) | $0.27 | $0.27 | 0 / 4 | 39s / 60s
| prometheus (scrape→TSDB)| $0.63 | $0.70 | 0 / 6 | 106s/143s
| cosmos-sdk Q1 (MsgSend) | $0.41 | $0.26 | 1 / 2 | 67s / 64s
| cosmos-sdk Q2 (Delegate)| $0.47 | $0.46 | 0 / 5 | 50s / 73s
| cosmos-sdk Q3 (gov tally)| $0.34 | $0.31 | 1.5 / 3 | 54s / 76s
| etcd Q1 (Put→raft) | $0.65 | $0.78 | 0 / 4 | 98s / 129s
| etcd Q2 (watch) | $0.36 | $0.50 | 0 / 4+ | 58s / 89s
Codegraph wins on reads + time on every question. Cost is mixed: 3 clean
wins, 3 tied (within 10%), 1 stubborn cost loss on the grep-favored Q1.
Compared to baseline, the cosmos-sdk cost-gap collapsed from -60% to -15%
on average, and Q3 went from a 75% loss to a tie. Raw run artifacts in
`/tmp/cg-finalv2-*/` and `/tmp/cg-final-*/`.
Memory written at `project_go_multi_module_audit.md` for the methodology
+ before/after numbers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a codegraph_context task contains a flow keyword ("trace", "from",
"reach", "flow", "propagat", "how does", "how do") AND at least two
distinct PascalCase / camelCase identifiers, internally invoke trace
between the first two extracted symbols and splice the trace body into
the context response. Conservative trigger by design: false positives
waste one graph query; false negatives just fall back to the agent
calling trace itself (existing path-proximity wiring handles either
case).
Goal: collapse the agent's typical context → trace → explore sequence
into a single context call for clear flow queries, closing the
remaining cost-overhead gap on multi-call patterns. The path-proximity
+ less-canonical-path scoring + the trace-failure-inlined-bodies
behavior already let the inline trace land on the right endpoint pair
and return enough material that no follow-up codegraph_node/Read is
needed.
Doesn't fire on:
- cobra's "How does cobra parse commands and flags?" (no PascalCase
symbols) — verified in regression run, no behavior change ($0.260
WITH vs $0.257 WITHOUT, basically tied)
- queries where the agent doesn't call codegraph_context at all
(cosmos Q1 in the audit went search → trace → node → trace → node)
Tests: 1076/1076 still pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-out The cosmos-Q1 audit revealed a static-resolution gap: msgServer.Send's *real* next hop is `k.Keeper.SendCoins` — an interface-method call on an embedded field that tree-sitter can't resolve. The static getCallees list for msgServer.Send is all utility/error functions (StringToBytes, Wrapf, …). The actual flow (SendCoins → subUnlockedCoins → addCoins → setBalance) lives entirely inside `x/bank/keeper/send.go`, which is also where the TO endpoint (setBalance) lives. When trace fails (no static path), inline the **top 5 functions/methods in the destination file**, ordered by line-distance from the TO node. This catches the flow that interface-method calls obscure — the canonical "k.<Iface>.<Method>" pattern in Go, also relevant to Java dependency-injection / Rails service-object dispatch / etc. where interface dispatch hides the real call. Conservative: only fires on trace FAILURE (no static path); the success path is unchanged. Per-body cap (40 lines / 1200 chars), top 5 siblings. Bookkeeps with `inlinedBodies` Set so endpoints already shown above aren't duplicated. Result: cosmos-Q1 — historically the most stubborn cost loss (-2.2× to -39% across the audit) — flipped to a clean WIN: $0.257 WITH vs $0.449 WITHOUT (-43%), 34s vs 79s, 0 Reads vs 2 Reads + 5 Greps, 5 codegraph calls vs 12. Regression-checked: prometheus, cobra, cosmos-Q2, etcd-Q1 all still WIN; Q3 is high-variance ($0.30-$0.45 range historically) and fell within that on this run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review feedback: the audit was Go-driven, so the patterns I added were Go-flavored. Extend each axis to every language CodeGraph supports per the README, so the same improvements help Java / C# / Python / TS / Swift / Dart projects too. **generated-detection.ts** — Added patterns for: - TS/JS: `.gen.[jt]sx?`, `.pb.[jt]s`, `_pb.[jt]s`, `_grpc_pb.[jt]s` (ts-proto, gRPC-web, Apollo / GraphQL codegen, Hasura). - Python: `_pb2.pyi` (mypy stubs from protobuf). - C#: `.g.cs` (T4 / Razor codegen), `Grpc.cs` (protoc-gen-csharp). - Java: `OuterClass.java` (protoc-gen-java), `Grpc.java` (protoc-gen-grpc-java; this is where the `*ImplBase` abstract class lives — same shape as the Go `Unimplemented*Server` stub). - Swift: `.pb.swift` (protoc-gen-swift). - Dart: `.pb.dart`, `.pbgrpc.dart`, `.chopper.dart`. - Rust: `.generated.rs`. **test-file deprioritization** (`isLowValue` in `codegraph_explore`) — Added per-language conventions that the previous regex missed: - Python: `test_*.py` (pytest discovery) and `*_test.py`. - Ruby: `*_test.rb` (minitest) — `*_spec.rb` already covered. - C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`. - Swift: `*Tests.swift` (XCTest). - Dart: `*_test.dart`. **IFACE_OVERRIDE_LANGS** in `callback-synthesizer.ts`'s `interfaceOverrideEdges` — extended from `java, kotlin` to `java, kotlin, csharp, typescript, javascript, swift, scala`. Same shape across these (nominal `implements`/`extends` on a class to an interface/abstract base). Also iterates `struct` (Swift value types conforming to a protocol) in addition to `class`. The existing matchesSymbol-style logic and `getOutgoingEdges(..., ['implements', 'extends'])` work unchanged. **CLAUDE.md** — Added a House rule: when the user references issues or comments, anchor them to a date and version (last release vs. last main commit vs. current branch tip) BEFORE concluding a fix is incomplete. Issue #388 comments from May 25-27 were responding to the released v0.9.5 / merged-PR-469 state — not to this branch's in-flight work. The new rule walks through the disambiguation: `grep -m1 '^## \[' CHANGELOG.md` for release version, `git log --first-parent main -1` for main tip. Tests: 1076/1076 still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cumulative changes targeting the small-repo cost gap surfaced by the cross-language audit: 1. **Tool descriptions trimmed** (~2.1KB total saved across 10 tools). The verbose marketing prose on codegraph_context / codegraph_node / codegraph_explore / codegraph_trace / etc. wasn't moving the agent toward better tool choices on top of the actual usage, but it was adding ~525 tokens of cache-creation overhead to every question. The trimmed descriptions keep the operational hints (e.g. "Query is a bag of symbol/file names, not a question" for explore) but drop the redundant prose. 2. **Dynamic tiny-repo tool gating** in `ToolHandler.getTools()`. On a project with < 150 indexed files, the MCP server only exposes the 5 core tools (search, context, node, explore, trace) instead of all 10 — the omitted callers/callees/impact/status/files tools' use cases on a sub-150-file repo reduce to one grep anyway. The MCP tool-defs overhead is the #1 source of cost loss on tiny repos (~$0.10-0.15 fixed cache-creation per question); cutting 5 tools drops that by ~50%. Effect on ky (~25 files, the worst pre-fix offender): - Before: $0.59 WITH vs $0.42 WITHOUT (+42% loss, n=1) - After: $0.32 WITH vs $0.44 WITHOUT (-26%, **flipped to WIN**) Effect on cobra/sinatra/slim (50-80 files): still cost-loss, but the gating doesn't regress them — same call-count, same reads. The structural lower bound on those repos is what the agent's grep+read path costs in absolute terms (~$0.20-0.30). Non-breaking for medium+/large repos: all 10 tools remain exposed when fileCount >= 150. Tests: 1076/1076 still pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ky flip to WIN) Combines the tool gating from the previous commit with a matching explore-budget cut for projects under 150 files. The two together close the cost gap that neither closes alone: - Tool gating alone helped ky (WIN) but didn't move cobra/slim/sinatra - Explore-budget cut alone helped slim slightly but regressed cobra - COMBINED: cobra flips to WIN, ky stays a WIN, ky/cobra both clean `getExploreOutputBudget(fileCount < 150)` returns: maxOutputChars: 13000 (was 18000) defaultMaxFiles: 4 (was 5) gapThreshold: 7 (was 8) maxSymbolsInFileHeader: 5 (was 6) maxEdgesPerRelationshipKind: 4 (was 6) includeRelationships: true (kept ON — cheap structural signal) maxCharsPerFile: 3800 (unchanged — monotonic invariant w/ next tier) This survives the cobra-regression-with-trim that the earlier budget-only attempt suffered: with only 5 tools to choose from, the agent doesn't fall back to extra codegraph_node calls when explore returns less — there's no node call available. Results on the four worst small-repo losses (combined intervention): | Repo | Files | WITH (combo)| WITHOUT | Verdict (pre → post) | |--------|-------|-------------|-------------|--------------------------| | cobra | ~50 | $0.25 | $0.31 | loss → **WIN** (-19%) | | ky | ~25 | $0.39 | $0.39 | -42% → tied | | slim | ~80 | $0.31 | $0.24 | LOSS 31% → still LOSS | | sinatra| ~60 | $0.30 | $0.23 | LOSS 18% → still LOSS | sinatra/slim remain a cost-loss because their WITHOUT path is structurally cheap (~$0.20 — fewer than 4 cheap grep+read calls). Codegraph can't beat that absolute floor with any meaningful response. Both still WIN on time + reads + tool-call count. Tests: tier boundary cases updated to cover the new <150 / 150-499 / 500-4999 / 5000-14999 / >=15000 progression. Off-by-one guard updated to include the new 149↔150 boundary. All 1076 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a <150-file project the entire repo is grep-able in one turn, so the 20-node default `codegraph_context` was paying for a graph subset that exceeds the agent's actual question. Cutting the tiny-repo default to 8 (typical 1-3 entry points + their immediate 1-hop neighbors) reduces the context-tool response body without hitting sufficiency on the flow shapes small repos actually contain. Non-breaking: the agent can still pass an explicit `maxNodes` to override; medium+ repos (>=150 files) keep the 20-node default. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
n=2 audit on cobra/ky/sinatra ruled out cutting below 5 tools (search + context + node + explore + trace) on the tiny-repo tier. The smaller 3-tool gate (search + context + trace) saved ~$0.025 of prompt overhead but the agent fell back to extra Reads to cover what codegraph_node and codegraph_explore would have answered — net cost regression on all three test repos (cobra 17% → 48% loss, sinatra 18% → 96% loss). Documented inline so future tuners don't re-try this dead-end. No behavior change beyond the comment: the 5-tool gate remains the production setting. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested the hypothesis that exposing FEWER tools on micro repos (<50 files) would close the cost gap. Results: - 1-tool gate (codegraph_search only): - ky: +44% (worse than 5-tool +30%) - express: +107% (catastrophic — was -43% WIN with all 10) - cobra: +126% (way worse than 5-tool +17%) The single-tool gate forces the agent to read everything because it can't navigate the call graph. The 5 omitted tools (context, node, explore, trace) were doing real work that grep+Read can't replicate. Conclusion: 5 tools (search + context + node + explore + trace) is the empirical lower bound on the tiny-repo tier. Cutting below regresses EVERY tested repo. The remaining ~$0.04-0.08 of structural cost overhead on tiny repos is unavoidable without sacrificing the value codegraph provides at that scale (which would also make WITH = WITHOUT, defeating the install). Comment documents the dead-ends so future tuners don't relitigate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in context, hard-exclude low-value files Three layered changes targeting the sinatra/slim/small-repo cost gap that iter2's body-shrink failed to close (smaller bodies just pushed the agent to Read instead): 1. **Tool-gate threshold 150 → 500** (`TINY_REPO_FILE_THRESHOLD`). Sinatra (~159 files) and slim (~200 files) have the same structural problem as cobra (
…siblings in search ranking On projects with a single file holding the dense majority of internal call edges (e.g. sinatra's `lib/sinatra/base.rb` at ~85% of in-file edges), text search was favoring small focused extension files over the core file. A small focused file like `multi_route.rb` wins on verbatim name match + file-size normalization, burying the 1500-line core file's longer method names (e.g. `route!` vs `route`). Fix: detect the "dominant file" — the file whose in-file edge count is ≥3× the next candidate's — then add +25 to all results sharing its directory prefix. This pulls the core file's siblings above sibling-package extensions without hardcoding any repo structure. `getDominantFile()` excludes test/spec files and generated files (e.g. etcd's `rpc.pb.go` has 4× the in-file edges of `server.go` and would otherwise hijack the boost toward generated protobuf stubs). SQL pulls the top 20 candidates; path-pattern filtering handles what SQLite LIKE can't express.
On small projects (<500 files) with a routing-shaped query, build a URL→handler manifest directly from the graph (each `route` node joins to its handler via `references`/`calls` edges) and inline the top handler file's source. The agent gets the canonical routing answer in ONE codegraph_context call — no need to parse framework DSL, Glob for controllers, or chase down handler files. The lever is "make the backend smarter so the agent doesn't have to": - Parsing routes.rb / routes/api.php / urls.py DSL is the agent's job in the WITHOUT arm. Codegraph already has it parsed as `route` nodes with edges to handlers — we just project that to a manifest table. - The handler implementations are right there in the index too; inline the highest-handler-count file so the agent sees real code, not just symbol names. Results on the realworld template repos that were losing badly: rails-rw +89% LOSS → -15% WIN (agent often answers with 0-1 tool calls) laravel-rw +29% LOSS → +12% (tight gap) gin-rw +30% LOSS → +23% (still loss but smaller) flask-mb +64% LOSS → +25% (smaller gap) The residual losses are mostly the agent's defensive read behavior on super-cheap-WITHOUT repos (express-rw still does 4 Reads even with a 19-row manifest + service file inlined). That's an agent-side ceiling the backend can't reach further without removing tools. Also lands `scripts/agent-eval/probe-sweep.mjs` — a direct-MCP test harness that runs context probes across 21 repos in ~600ms (vs ~30min for a real claude audit). Enables rapid iteration on backend changes: edit tools.ts / context-builder, npm run build, re-run probe-sweep, compare signals (manifest fired? handler file inlined? response size?) before paying for a claude run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-pronged fix to make codegraph competitive on Go multi-module repos (cosmos-sdk, etcd) where it previously lost on cost. Driven by an 8-question agent-eval audit across cobra, gin, prometheus, cosmos-sdk, and etcd.
The empirical gate ruled OUT
go.workparsing as the real gap (prometheus crushes without it). The actual failure modes:codegraph_search "Send"on cosmos-sdk returned the gRPC stub attx_grpc.pb.go:124first; trace landed on the empty stub, reported "no path", agent fell back to Read.interfaceOverrideEdges(Java/Kotlin only) doesn't apply, soMsgServer.Send(interface in.pb.go) andmsgServer.Send(impl in keeper) never connect.codegraph_node,codegraph_callers, …) plus a Read.EndBlockerexists in 20+ modules; FTS picked an arbitrary one.What's in here
src/extraction/generated-detection.ts— path-pattern classifier for.pb.go,.pulsar.go,_grpc.pb.go,_mock.go,_mocks.go,mock_*.go,.generated.[jt]sx?,_pb2(_grpc)?.py,.pb.{cc,h},.g.dart,.freezed.dart. Applied as a stable sort tiebreaker infindSymbol,findAllSymbols,codegraph_search(MCP + CLI),codegraph_explorefile ranking, and context formatter Entry Points / Related Symbols / Code blocks.goGrpcStubImplEdgessynthesizer incallback-synthesizer.ts— detectsUnimplementedXxxServerstructs in generated files, identifies their RPC methods (excludingmustEmbed*/testEmbeddedByValuemarkers), and emitscallsedges to matching methods on any non-generated struct whose method-name set is a superset. 467 bridge edges on cosmos-sdk; bank'sUnimplementedMsgServer::Sendpoints tox/bank/keeper/msg_server.goonly — not tomsgClientsiblings or mock files.from×tocandidate combo by shared directory prefix length (full candidate set, not just FTS top-5), with a less-canonical-path penalty (enterprise/,contrib/,examples/,vendor/,third_party/,deprecated/,legacy/) so the canonical-module pair wins. FindPath probe budget capped at 20.codegraph_exploreisLowValue— adds Go's_test.go, Ruby's_spec.rb, JS/TS.test.ts/.spec.tsx, JVM*Test.java/*Spec.kt. Without this, etcd'swatchable_store_test.goconsumed 5K chars of explore budget.Explicitly NOT in this PR:
go.workparsing. The empirical gate disconfirmed it.Empirical results (n=2 average per question, headless mode)
Codegraph wins on reads and time across every question. Cost is 3 clean wins, 3 within-10% ties, and 1 stubborn loss on cosmos Q1 — a grep-favored question where the agent's WITHOUT path is structurally short. Compared to baseline, cosmos-sdk's cost gap collapsed from -60% avg to -15% avg, and Q3 went from a 75% loss to a tie.
Tests
__tests__/generated-detection.test.ts— 4 unit tests pinning the suffix patterns.frameworks-integration.test.ts— 2 new integration tests for the Go gRPC bridge: positive bridge (stub → hand-written impl) + precision case (don't bridge to a generated sibling likemsgClient).Test plan
npm test— 1076/1076 passgo.workrepo, different from cosmos)go.work, no protobuf mass — no-regression control)UnimplementedMsgServer::Send→msgServer::Send, no mock/client false positives🤖 Generated with Claude Code