Status (2026-04-26): Tier 1a (consumer-side gRPC IDL Routes + HANDLES binding) is open as #293 against this issue. Tier 1b (producer-side typed-client GRPC_CALLS emission) and Tiers 2–4 will follow as separate PRs sequenced per §5 below.
Protocol-Aware Cross-Repo Intelligence
Status: Design proposal for upstream codebase-memory-mcp
Audience: cbm maintainer + reviewers
Scope: Extends pass_cross_repo from literal-string matching to protocol-aware matching across the four cross-language patterns that account for >95% of inter-service communication in modern codebases.
Compatibility: Strictly additive. No breaking changes to existing tools, edges, or APIs. Builds on PR #281 (rich get_architecture fields).
TL;DR
cbm already has the scaffolding for cross-repo intelligence (pass_cross_repo.c, CROSS_HTTP_CALLS / CROSS_ASYNC_CALLS / CROSS_GRPC_CALLS edge types, named-route matching). The current implementation only fires when a call site has a literal URL or topic string as its first argument. That covers idiomatic Python/Node code well, but misses the dominant pattern in modern strongly-typed stacks (Java/Spring, .NET, Kotlin, Go-with-codegen): typed clients and message handlers where the routing identifier is a generic type parameter, an interface ancestor, an attribute, or a config-resolved name — never a literal string at the call site.
This proposal adds four protocol-aware extraction tiers, each language-generic, behind a YAML-driven service-pattern registry so adding new frameworks is a config edit rather than a C patch. Tier 1 (gRPC .proto matching) is proposed first as a ~300-LOC working PR to validate the architecture; Tiers 2–4 follow as separate PRs.
Cross-language framework coverage matrix at the end. Acceptance gating: each tier ships independently, success measured by precision/recall against multi-language test fixtures.
1. Background
1.1 What works in cbm today
After PR #281 lands, get_architecture(aspects=["all"]) returns rich structural data (entry_points, routes, hotspots, layers, boundaries, languages). The per-repo extraction pipeline detects:
- Library identifiers in resolved qualified names (
service_patterns.c:631 — 252 patterns across HTTP, async, gRPC, config, route-registration kinds, covering Python/Node/Go/Java/Rust/PHP/Ruby/C# basics)
- Literal URL / topic strings at call sites (
pass_calls.c:emit_http_async_edge)
- Route registration via
app.get("/x", ...) and attribute-routed framework styles (pass_route_nodes.c)
- Cross-repo matching when both sides have a literal-route identifier (
pass_cross_repo.c:cbm_cross_repo_match)
The cross-repo-intelligence mode in index_repository matches __route__<METHOD>__<path> keys across project DBs and emits CROSS_HTTP_CALLS / CROSS_ASYNC_CALLS / CROSS_GRPC_CALLS edges.
1.2 What doesn't work
emit_http_async_edge (pass_calls.c, line ~232):
bool is_url = (url_or_topic && url_or_topic[0] != '\0' &&
(url_or_topic[0] == '/' || strstr(url_or_topic, "://") != NULL));
bool is_topic = (url_or_topic && ... && svc == CBM_SVC_ASYNC && ...);
if (!is_url && !is_topic) {
/* fall back to plain CALLS edge */
return;
}
If the first string argument is not a literal URL or topic, the edge falls through to CALLS — generic, unrouted, untaggable for cross-repo matching. Idiomatic code in major modern frameworks rarely passes a literal URL or topic at the call site:
// .NET / MassTransit — no topic string, message type is the identifier
await _publishEndpoint.Publish(new VoucherRedeemed(voucher.Id), ct);
// .NET / generated gRPC client — no URL string, service.method is the identifier
var resp = await _promoCodeClient.GetVoucherAsync(req, cancellationToken: ct);
// Java / Spring Cloud Stream — no topic string, message type is the identifier
streamBridge.send("output", new OrderShipped(order));
// Java / Feign — interface annotation IS the route, no literal at call site
return feignClient.getVoucher(id);
// Kotlin / Retrofit — same shape as Feign
return retrofitApi.getVoucher(id)
// Go with gRPC codegen — generated client method, no string at call site
resp, err := promoClient.GetVoucher(ctx, req)
// Python / FastAPI typed httpx client (oapi-codegen-derived) — same shape
resp = await client.get_voucher(id=id)
In each case, the producer-side identifier (message type FQN, gRPC service.method, Feign interface annotation) is statically present and resolvable — but not as a literal string argument. It lives in: a generic type parameter, the constructor type of an argument, a class-level attribute, or a method-level attribute.
The consumer side has the same identifier visible in a different syntactic position: an IConsumer<T> declaration, a *Base implementation, a @StreamListener<T> annotation, an attribute-routed controller method.
The matching problem is solvable. The producer/consumer identifier exists statically on both sides. cbm's current extractor just doesn't extract it.
1.3 Why this matters now
Cross-repo intelligence is one of the most-asked-for capabilities in code-graph tools. Industry tooling that does parts of this:
| Tool |
Approach |
Limitation |
| Backstage (Spotify) |
Service catalog from OpenAPI / AsyncAPI / .proto files |
Manual catalog maintenance, declarative not derived |
| Sourcegraph |
Cross-repo references via SCIP indexes |
Per-symbol, doesn't protocol-match (only name-matches) |
| Apollo Studio |
Federated GraphQL via @key directives |
GraphQL only |
| AsyncAPI tooling |
Typed async-message matching |
AsyncAPI-spec only, requires explicit AsyncAPI files |
| GitHub CodeQL |
Cross-repo dataflow for security |
Security-focused, heavyweight |
| stack-graphs (GitHub) |
Universal name resolution graph |
Within-repo only |
cbm is uniquely positioned: single binary, AST + LSP-grade extraction, sub-second incremental indexing, no external service dependencies. The cross-repo capability matters because it's the missing 20% of value that turns "smart code search" into "service-architecture truth source."
2. Cross-language pattern audit
The producer→consumer routing problem decomposes into four tiers. Each tier is generic across major language ecosystems. Concrete framework instances per tier:
Tier 1 — IDL-driven typed stubs (gRPC, GraphQL, OpenAPI, AsyncAPI)
The stable identifier lives in an IDL file shared between producer and consumer repos. Both sides reference generated types derived from the same IDL.
| Ecosystem |
Producer pattern |
Consumer pattern |
Stable identifier |
| gRPC (Go, Java, Python, Rust, TS, C#, Kotlin, Swift) |
*Client from .proto codegen |
*Base / *Servicer impl from .proto codegen |
service.method from .proto |
| GraphQL Federation (any GraphQL stack) |
typed query/mutation client |
resolver bound to type with @key directive |
type + key from .graphql |
| OpenAPI (NSwag/openapi-generator/oapi-codegen) |
generated typed client per language |
controller/handler matching path+method |
path + method from openapi.yaml |
| AsyncAPI |
generated publisher |
generated subscriber |
channel + message from asyncapi.yaml |
Detection: parse the IDL file, extract canonical IDs as routes; on producer side find references to generated client types; on consumer side find generated base-class implementations.
Genericity: 100%. gRPC alone covers 8+ languages. .proto/.graphql/.openapi files are language-agnostic by design.
Tier 2 — Typed message pub/sub (interface-ancestor + generic-type)
The stable identifier is a message type's fully-qualified name. Producer has Publish<T> / Send<T> / equivalent on a known interface; consumer implements IConsumer<T> / @MessageHandler / equivalent.
| Language |
Frameworks |
Producer shape |
Consumer shape |
| C# |
MassTransit, NServiceBus, Wolverine, Brighter, Rebus |
IPublishEndpoint.Publish<T> / ISendEndpoint.Send<T> / IBus.Publish<T> |
IConsumer<T> / IHandleMessages<T> |
| Java/Kotlin |
Spring Cloud Stream, Axon, Eventuate |
streamBridge.send(...) / @CommandHandler |
@StreamListener<T> / @EventHandler |
| Node/TS |
NestJS microservices, Moleculer, EventBus libs |
@MessagePattern<T> emit |
@EventPattern<T> handler |
| Python |
Faust, Celery (typed), aio-pika typed wrappers |
@app.agent send |
typed handler funcs |
| Go |
Watermill, NATS-typed, Wire |
typed publish via marshalers |
typed subscriber registration |
| Rust |
Lapin + serde, async-nats with typed deserialization |
typed publish |
typed subscribe |
Detection: pattern-match the producer interface (e.g., IPublishEndpoint, streamBridge) with its Publish<T> / Send<T> method, extract T from the generic param or the constructor argument's type. On consumer side, find classes implementing IConsumer<T> / IHandleMessages<T> / classes with @StreamListener<T> on a method, extract T. Match by FQN.
Genericity: highly cross-language. ~6 framework families, identical abstract pattern.
Tier 3 — Attribute / decorator-driven HTTP routes
Producer is a typed HTTP client whose interface methods carry route attributes; consumer is a controller/handler with matching route attributes. Both attribute values are literal strings — the easiest tier if extracted from the attribute, not the call site.
| Language |
Producer |
Consumer |
| C# |
Refit, RestEase ([Get("/x")] on interface) |
ASP.NET Core ([HttpGet("/x")] controller) |
| Java |
Feign (@RequestLine("GET /x")), Retrofit (@GET("/x")) |
Spring (@GetMapping("/x")), JAX-RS (@GET @Path("/x")) |
| Kotlin |
Retrofit |
Spring, Ktor route DSL |
| TypeScript |
tsoa, NestJS HttpService with openapi-derived clients |
NestJS (@Get("/x")), Hono, Express decorators |
| Python |
httpx-codegen, aiohttp wrappers from openapi |
FastAPI (@app.get("/x")), Litestar |
| Go |
huma generated, oapi-codegen clients |
huma, chi, gin, echo route registration |
| Rust |
utoipa generated |
actix-web, axum, rocket route attributes |
Detection: extract HTTP method + path from class-level / method-level attributes on both interfaces (producer) and concrete classes (consumer). Match.
Genericity: most universal — decorator-driven HTTP routing is the modern default in every serious web ecosystem.
Tier 4 — Config-resolved service discovery
The producer's call site has only a relative path or named-client reference; the actual base URL lives in a config file (appsettings.json, application.yaml, env vars, Kubernetes Service DNS, service-registry config). Consumer side uses Tier 3 attribute-driven detection.
| Ecosystem |
Producer |
Config source |
| C# |
IHttpClientFactory.CreateClient("name") |
appsettings*.json, services.AddHttpClient(...) |
| Spring |
@FeignClient(name="x", url="${promo.url}") |
application.yaml, env |
| Spring Cloud / Eureka / Consul |
service registry lookups |
registry config |
| Kubernetes |
Service DNS (http://promocode-service:80/x) |
Service / Ingress YAML |
| Node |
env-driven base URLs in axios/fetch wrappers |
.env, Helm values |
| Go |
viper-loaded named services |
YAML / env |
Detection: scan config files for named-service → base-URL mappings; trace CreateClient("name") / @FeignClient("name") to resolved URL; combine with the variable URL path within the calling method to reconstruct the full route.
Genericity: universal microservice pattern.
Tier 5 — Reflection / runtime-resolved DI (out of scope)
_serviceProvider.GetService(Type.GetType(configString))?.Invoke(...) is genuinely impossible to resolve statically. This tier is named for completeness but explicitly out of scope. Estimated <5% of cross-service calls in practice.
3. Proposed architecture
3.1 Plugin-based service-pattern registry
internal/cbm/service_patterns.c currently hardcodes 252 patterns in a C array. Adding a new framework requires a C patch + recompile. Proposal: externalize the pattern table to a YAML / JSON registry loaded at startup.
Format example (registry-format-1.yaml — actual schema TBD with maintainer):
patterns:
# Tier 2 — typed-message pub/sub
- id: masstransit-publish
languages: [csharp]
kind: ASYNC_CALLS
producer:
match:
type_implements: IPublishEndpoint
method_pattern: "Publish<T>(...)"
extract_id_from: generic_type_arg_or_first_arg_type
id_kind: message_fqn
broker: rabbitmq
consumer:
match:
class_implements: "IConsumer<T>"
extract_id_from: generic_type_arg
id_kind: message_fqn
- id: spring-cloud-stream-handler
languages: [java, kotlin]
kind: ASYNC_CALLS
producer:
match:
method_calls: "streamBridge.send"
first_arg_kind: string_literal
extract_id_from: first_arg
id_kind: channel_name
consumer:
match:
method_annotation: "@StreamListener"
extract_id_from: annotation_value
id_kind: channel_name
- id: refit-client
languages: [csharp]
kind: HTTP_CALLS
producer:
match:
interface_method_attribute: "[Get|Post|Put|Delete|Patch]"
extract_id_from: attribute_arg
id_kind: http_route
consumer:
match:
method_attribute: "[HttpGet|HttpPost|HttpPut|HttpDelete|HttpPatch]"
extract_id_from: attribute_arg
id_kind: http_route
Benefits:
- Adding Wolverine, Watermill, or any new framework is one YAML entry, not a code patch + release cycle
- Maintainer review surface drops dramatically (review YAML, not C)
- Community contributions become low-risk (a YAML PR can't crash the binary)
- Multi-language patterns compose naturally (one ID matches both Java and Kotlin via
languages: [java, kotlin])
Existing 252 patterns in service_patterns.c can be migrated to YAML in a separate cleanup PR (no behavior change, pure refactor) — out of scope for this proposal but a natural follow-on.
3.2 Pipeline integration
Two changes to the existing pipeline:
-
New pass: pass_idl_scan — runs once per repo before pass_definitions. Scans for IDL files (.proto, .graphql, openapi.yaml, asyncapi.yaml) and emits canonical Route nodes derived from them. Each Route gets a stable QN like __idl_route__grpc__promocode.PromoCodeManagerGrpcService/GetVoucher regardless of which language consumes it.
-
Extend pass_calls.c emit_classified_edge — when matching against the new YAML-driven patterns, support extracting identifiers from:
- Generic type parameters (
Publish<T>)
- Constructor argument types (
Publish(new T(...)))
- Interface-method attributes (
[Get("/x")])
- Class-level attributes (
@FeignClient(name="x"))
- Combined with config-resolved values (Tier 4)
-
Auto-trigger cross-repo pass for workspace siblings — when a repo is part of a workspace (e.g., cross-repo-intelligence mode is invoked once with target_projects: ["*"]), persist the workspace membership in the artifact, and on subsequent re-indexes auto-fire cross-repo matching against the same sibling set.
3.3 Cross-repo extension
The existing cbm_cross_repo_match already supports topic-based matching. Two extensions:
-
Add match_by_message_fqn — phase D (after HTTP / Async / Channel matching). For each ASYNC_CALLS edge with message_fqn property, find consumer-side IConsumer<message_fqn> registrations in target DBs and emit CROSS_ASYNC_CALLS edges.
-
Add match_by_grpc_method — phase E. For each gRPC client call with service.method identifier, find consumer-side *Base overrides of the same service.method and emit CROSS_GRPC_CALLS edges. Reuses the existing CROSS_GRPC_CALLS edge type and emission helper at pass_cross_repo.c:657.
Both extensions reuse the existing route-matching scaffolding (emit_cross_route_bidirectional). Pure additive code paths.
4. Tier 1 detailed spec — gRPC .proto matching
Proposed as the first PR to validate the architecture. Smallest scope, highest universality (8+ languages), zero framework variance (.proto syntax is standardized by Google).
4.1 Producer-side extraction
Detect references to generated gRPC client types. The detection signal is the type name pattern, not call-site strings:
- C#: classes/interfaces ending in
Client derived from Grpc.Core.ClientBase<T> (generated by Grpc.Tools)
- Go: structs with
*grpc.ClientConn field + methods matching .proto service methods
- Python: classes from
*_pb2_grpc.py ending in Stub
- Java/Kotlin: classes ending in
*Grpc.*Stub (generated by protoc-gen-grpc-java)
- TypeScript: classes from
*_pb_grpc.d.ts with the right shape
- Rust: tonic-generated
*Client structs
For each method call on a generated client type:
- Resolve the client type to its underlying
service.method pair (recoverable from .proto — see §4.3)
- Emit a CALLS edge with new properties:
{rpc_kind: "grpc", service: "promocode.PromoCodeManagerGrpcService", method: "GetVoucher"}
4.2 Consumer-side extraction
Detect classes implementing the generated gRPC server-base type:
- C#:
: PromoCodeManagerGrpcServiceBase
- Go: structs with method receivers matching the unimplemented server interface
- Python: classes inheriting
*Servicer
- Java:
extends *ImplBase
- Rust:
impl *Server for ...
For each override of a service method, emit a Route node with QN __idl_route__grpc__<service>/<method> and a HANDLES edge from the implementing class.
4.3 IDL parsing
pass_idl_scan reads .proto files (anywhere in the repo by default; configurable) and builds the canonical service.method → package.Service.method mapping. Tree-sitter has a maintained tree-sitter-proto grammar. ~150 LOC for the parser + AST walk + node emission.
4.4 Cross-repo matching
Add match_grpc_calls as Phase E in cbm_cross_repo_match:
/* For each producer-side CALLS edge with rpc_kind=grpc:
* 1. Look up service+method in target project's IDL-derived Routes
* 2. If found, find the HANDLES edge from a *Base class
* 3. Emit CROSS_GRPC_CALLS bidirectional edge
*/
Reuses emit_cross_route_bidirectional and the existing CROSS_GRPC_CALLS edge type. ~100 LOC.
4.5 Estimated diff size
| Component |
Files |
LOC |
pass_idl_scan.c (new) |
1 |
~150 |
service_patterns.c (gRPC client/server type recognizers) |
1 |
~50 |
pass_calls.c extension for typed-client RPC properties |
1 |
~60 |
pass_cross_repo.c Phase E |
1 |
~100 |
| Tests + fixtures (multi-language) |
several |
~200 |
| Total |
|
~560 LOC |
Larger than my earlier 300-LOC estimate — that didn't include tests. Production-grade with tests is ~560.
4.6 Test fixtures (multi-language)
Three tiny repos shipped under testdata/cross-repo/grpc/:
service-a-csharp/ — minimal .NET project consuming service-b's gRPC client
service-b-go/ — minimal Go gRPC server implementing the .proto from contracts/
service-c-python/ — minimal Python consumer of service-b's gRPC service
contracts/ — single .proto file shared by all three
Test asserts: after indexing all four (contracts/ first, then services), cbm_cross_repo_match emits CROSS_GRPC_CALLS edges from a-csharp and c-python to b-go's handler classes, with correct service+method properties.
4.7 Success criteria
| Metric |
Target |
| Precision on test fixtures |
100% (deterministic — gRPC has no naming ambiguity) |
| Recall on test fixtures |
100% — all known cross-service calls detected |
| Index-time overhead |
<5% additional time per repo with .proto files |
| Index-time overhead per repo without .proto |
0% (pass_idl_scan is no-op) |
| Memory overhead |
proportional to .proto count, ~1KB per service definition |
| Backwards compatibility |
All existing tests pass; existing CROSS_* edges unchanged |
5. Roadmap — Tiers 2–4
Each tier is a separate PR after Tier 1 lands. Sequence chosen by descending universality and ascending implementation complexity.
5.1 Tier 2 — typed message pub/sub (after Tier 1)
Scope: introduce the YAML-driven service-pattern registry; ship initial registry covering MassTransit (C#), Spring Cloud Stream (Java/Kotlin), and NestJS (TS) as proof of multi-language genericity. Add pass_message_synthesis that emits ASYNC_CALLS edges keyed by message_fqn instead of requiring a topic literal. Extend pass_cross_repo Phase D to match by message_fqn.
Estimated LOC: ~800 (registry loader, YAML schema, three framework definitions, new pass, cross-repo extension, tests).
Risk: brittleness on framework-version drift (MassTransit v8 vs v7 have slightly different interface shapes). Mitigation: registry entries can be version-tagged; pattern matching tolerates shape variance.
5.2 Tier 3 — attribute-driven routes (after Tier 2)
Scope: extend pass_route_nodes.c to extract routes from interface-method attributes (Refit / Retrofit / Feign) on the producer side. Match against existing controller-side attribute extraction. Most attribute-driven controller patterns are already detected by cbm — this tier closes the producer-side gap.
Estimated LOC: ~400.
Risk: low. Attribute syntax is declarative and stable across framework versions.
5.3 Tier 4 — config-resolved service discovery (after Tier 3)
Scope: extend pass_envscan to also parse appsettings.json, application.yaml, helm values, kustomize overlays. Build named-client → base-URL maps. Add light intra-method dataflow to resolve path = $"/api/{x}" patterns. Combine with named-client resolution to reconstruct full URLs.
Estimated LOC: ~1200. Largest tier — config parsing across multiple ecosystems is genuinely complex.
Risk: medium-high. Variable resolution can produce false positives; mitigation is confidence scoring on the emitted edges (high confidence when literal, lower when resolved through 2+ hops).
5.4 Combined coverage estimate
After Tiers 1–3 land (Tier 4 is bonus), realistic recall on cross-service edges in modern strongly-typed codebases:
| Code style |
Estimated recall |
| Go + gRPC + literal HTTP URLs |
95%+ (Tier 1 alone covers most) |
| Java/Spring + Feign + Cloud Stream |
90%+ (Tiers 1+2+3) |
| .NET / CQRS + MediatR + MassTransit + gRPC |
90%+ (Tiers 1+2; HttpClient gap = Tier 4) |
| TypeScript / NestJS + microservices |
85%+ (Tiers 1+2+3) |
| Python / FastAPI + Celery + httpx-codegen |
85%+ |
| Plain Python/Node with literal URLs (today's recall) |
unchanged, still works |
6. Risks and mitigations
| Risk |
Likelihood |
Mitigation |
| Tree-sitter pattern brittleness across language versions |
Medium |
YAML registry allows per-version patterns; tests cover N-1 and N versions of each framework |
| YAML registry becomes a maintenance burden |
Medium |
Limit official registry to top 10 frameworks per language; community contributions land via PR review with required test fixtures |
| False-positive cross-repo edges from name collisions |
Low |
Confidence scoring on each edge; collisions reported in cbm_cross_repo_result_t.collisions[] |
| Increased index time |
Low |
New passes are conditional (no .proto files = no IDL pass); benchmarks on every PR |
| Variable URL resolution (Tier 4) produces wrong routes |
Medium |
Confidence scoring; only emit cross-repo edge if resolved confidence > 0.7; consumer-side validation catches bad matches |
| Reflection / runtime-resolved DI is impossible |
High but acknowledged |
Explicitly out of scope (Tier 5); document as known limitation |
| Maintainer-burden objection |
Medium |
Plugin registry shifts most additions to YAML; core C surface area kept minimal |
| Patch size scares reviewers |
High for big-bang, Low for tier-by-tier |
Submit Tier 1 first as standalone PR; subsequent tiers reference Tier 1's architecture |
7. Open questions for the maintainer
-
Pattern-registry format preference: YAML, TOML, JSON, or compiled-in C tables with a build-time generator? YAML is most readable but adds a YAML parser to runtime; TOML or JSON minimize parser surface.
-
Where should IDL files be discovered: walk-the-repo by default, or require explicit idl_paths config? Walk-the-repo has zero-config UX cost but may pick up vendored proto files in node_modules or vendor/. Suggest default-walk + standard exclusion list.
-
Cross-repo auto-trigger model: store workspace membership in the per-repo artifact, or in a separate workspace-level artifact? Per-repo is simpler but duplicates state; workspace-level is cleaner but adds a new artifact kind.
-
Confidence scoring: should cross-repo edges carry a confidence property explicitly, or rely on the existing properties JSON blob? A first-class confidence field makes downstream consumers' job easier.
-
Existing pattern table migration: should the 252 patterns in service_patterns.c migrate to the YAML registry as part of this work, or stay in C with the registry only handling new patterns? Recommendation: keep C patterns as-is for v1, migrate in a separate cleanup PR after the YAML schema is proven stable.
-
Tier 4 dataflow scope: how aggressive should intra-method variable resolution be? Single-assignment + string-concat is safe; following data through helper methods gets harder. Suggest single-method scope for v1.
-
Test-fixture monorepo strategy: ship the multi-language fixtures in the cbm repo, or reference an external cbm-test-fixtures repo to keep the main repo small? The fixtures total ~5MB across 3-4 languages — manageable in-tree.
8. Why this is worth merging upstream
cbm's competitive position vs. Sourcegraph / Backstage / Apollo:
- Sourcegraph does cross-repo references but per-symbol, not protocol-aware. cbm + Tier 1-3 would be the only AST-based tool emitting structured
CROSS_GRPC_CALLS / CROSS_ASYNC_CALLS edges keyed by protocol identifiers.
- Backstage does service-graph from declarative IDL files but requires manual catalog upkeep. cbm + this proposal derives the service graph automatically from the same IDL files plus the consuming code.
- Apollo Studio does federated GraphQL via
@key matching. cbm + this proposal generalizes the same idea to gRPC, OpenAPI, AsyncAPI, and typed-message ecosystems.
Position: cbm becomes the only single-binary, AST+LSP-grade tool that derives a complete service interaction graph automatically from source. That's a defensible product position.
The capability is asked for in every code-graph tool's roadmap (often as "service mesh visualization" or "API surface discovery"). cbm has the structural advantage to ship it first.
9. Appendix — example YAML registry entries
Full registry entries for the ten frameworks Tier 2 should ship with:
patterns:
# ── C# / .NET ──────────────────────────────────────────────────
- id: masstransit-publish
languages: [csharp]
kind: ASYNC_CALLS
producer:
match: { type_implements: "IPublishEndpoint", method: "Publish" }
extract_id_from: generic_arg_or_first_arg_type
id_kind: message_fqn
broker: rabbitmq
consumer:
match: { class_implements: "IConsumer<T>" }
extract_id_from: generic_type_arg
id_kind: message_fqn
- id: masstransit-send
languages: [csharp]
kind: ASYNC_CALLS
producer:
match: { type_implements: "ISendEndpoint", method: "Send" }
extract_id_from: generic_arg_or_first_arg_type
id_kind: message_fqn
broker: rabbitmq
consumer: { same_as: masstransit-publish.consumer }
- id: nservicebus-publish
languages: [csharp]
kind: ASYNC_CALLS
producer:
match: { type_implements: "IMessageSession", method: "Publish" }
extract_id_from: first_arg_type
id_kind: message_fqn
consumer:
match: { class_implements: "IHandleMessages<T>" }
extract_id_from: generic_type_arg
id_kind: message_fqn
# ── Java / Kotlin / Spring ─────────────────────────────────────
- id: spring-cloud-stream-bridge
languages: [java, kotlin]
kind: ASYNC_CALLS
producer:
match: { type_implements: "StreamBridge", method: "send" }
extract_id_from: first_arg
id_kind: channel_name
consumer:
match: { method_annotation: "@StreamListener" }
extract_id_from: annotation_value
id_kind: channel_name
- id: axon-command
languages: [java, kotlin]
kind: ASYNC_CALLS
producer:
match: { type_implements: "CommandGateway", method: "send" }
extract_id_from: first_arg_type
id_kind: message_fqn
consumer:
match: { method_annotation: "@CommandHandler" }
extract_id_from: parameter_type
id_kind: message_fqn
# ── Node / TypeScript ──────────────────────────────────────────
- id: nestjs-message-pattern
languages: [typescript]
kind: ASYNC_CALLS
producer:
match: { method_annotation: "@MessagePattern" }
extract_id_from: annotation_value
id_kind: message_pattern
consumer:
match: { method_annotation: "@MessagePattern" }
extract_id_from: annotation_value
id_kind: message_pattern
- id: nestjs-event-pattern
languages: [typescript]
kind: ASYNC_CALLS
producer:
match: { method: "emit", type_implements: "ClientProxy" }
extract_id_from: first_arg
id_kind: event_pattern
consumer:
match: { method_annotation: "@EventPattern" }
extract_id_from: annotation_value
id_kind: event_pattern
# ── Python ──────────────────────────────────────────────────────
- id: faust-agent
languages: [python]
kind: ASYNC_CALLS
producer:
match: { method_call: "topic.send" }
extract_id_from: receiver_var_topic_name
id_kind: kafka_topic
consumer:
match: { decorator: "@app.agent" }
extract_id_from: decorator_arg
id_kind: kafka_topic
# ── Go ──────────────────────────────────────────────────────────
- id: watermill-publish
languages: [go]
kind: ASYNC_CALLS
producer:
match: { type_implements: "message.Publisher", method: "Publish" }
extract_id_from: first_arg
id_kind: topic_name
consumer:
match: { type_implements: "message.Subscriber", method: "Subscribe" }
extract_id_from: first_arg
id_kind: topic_name
# ── Rust ───────────────────────────────────────────────────────
- id: async-nats-publish
languages: [rust]
kind: ASYNC_CALLS
producer:
match: { method: "publish", type_implements: "Client" }
extract_id_from: first_arg
id_kind: nats_subject
consumer:
match: { method: "subscribe", type_implements: "Client" }
extract_id_from: first_arg
id_kind: nats_subject
Schema notes:
match block defines the AST-pattern selector (interface implementation, attribute presence, method-call shape)
extract_id_from names a strategy from a fixed enum (generic_type_arg, first_arg, first_arg_type, annotation_value, attribute_arg, receiver_var_topic_name, etc.)
id_kind declares the namespace of the extracted identifier (so kafka_topic from one framework matches kafka_topic from another, but never matches message_fqn)
broker is optional metadata that flows into the emitted edge
Protocol-Aware Cross-Repo Intelligence
Status: Design proposal for upstream
codebase-memory-mcpAudience: cbm maintainer + reviewers
Scope: Extends
pass_cross_repofrom literal-string matching to protocol-aware matching across the four cross-language patterns that account for >95% of inter-service communication in modern codebases.Compatibility: Strictly additive. No breaking changes to existing tools, edges, or APIs. Builds on PR #281 (rich
get_architecturefields).TL;DR
cbm already has the scaffolding for cross-repo intelligence (
pass_cross_repo.c,CROSS_HTTP_CALLS/CROSS_ASYNC_CALLS/CROSS_GRPC_CALLSedge types, named-route matching). The current implementation only fires when a call site has a literal URL or topic string as its first argument. That covers idiomatic Python/Node code well, but misses the dominant pattern in modern strongly-typed stacks (Java/Spring, .NET, Kotlin, Go-with-codegen): typed clients and message handlers where the routing identifier is a generic type parameter, an interface ancestor, an attribute, or a config-resolved name — never a literal string at the call site.This proposal adds four protocol-aware extraction tiers, each language-generic, behind a YAML-driven service-pattern registry so adding new frameworks is a config edit rather than a C patch. Tier 1 (gRPC
.protomatching) is proposed first as a ~300-LOC working PR to validate the architecture; Tiers 2–4 follow as separate PRs.Cross-language framework coverage matrix at the end. Acceptance gating: each tier ships independently, success measured by precision/recall against multi-language test fixtures.
1. Background
1.1 What works in cbm today
After PR #281 lands,
get_architecture(aspects=["all"])returns rich structural data (entry_points, routes, hotspots, layers, boundaries, languages). The per-repo extraction pipeline detects:service_patterns.c:631— 252 patterns across HTTP, async, gRPC, config, route-registration kinds, covering Python/Node/Go/Java/Rust/PHP/Ruby/C# basics)pass_calls.c:emit_http_async_edge)app.get("/x", ...)and attribute-routed framework styles (pass_route_nodes.c)pass_cross_repo.c:cbm_cross_repo_match)The
cross-repo-intelligencemode inindex_repositorymatches__route__<METHOD>__<path>keys across project DBs and emits CROSS_HTTP_CALLS / CROSS_ASYNC_CALLS / CROSS_GRPC_CALLS edges.1.2 What doesn't work
emit_http_async_edge(pass_calls.c, line ~232):If the first string argument is not a literal URL or topic, the edge falls through to
CALLS— generic, unrouted, untaggable for cross-repo matching. Idiomatic code in major modern frameworks rarely passes a literal URL or topic at the call site:In each case, the producer-side identifier (message type FQN, gRPC service.method, Feign interface annotation) is statically present and resolvable — but not as a literal string argument. It lives in: a generic type parameter, the constructor type of an argument, a class-level attribute, or a method-level attribute.
The consumer side has the same identifier visible in a different syntactic position: an
IConsumer<T>declaration, a*Baseimplementation, a@StreamListener<T>annotation, an attribute-routed controller method.The matching problem is solvable. The producer/consumer identifier exists statically on both sides. cbm's current extractor just doesn't extract it.
1.3 Why this matters now
Cross-repo intelligence is one of the most-asked-for capabilities in code-graph tools. Industry tooling that does parts of this:
@keydirectivescbm is uniquely positioned: single binary, AST + LSP-grade extraction, sub-second incremental indexing, no external service dependencies. The cross-repo capability matters because it's the missing 20% of value that turns "smart code search" into "service-architecture truth source."
2. Cross-language pattern audit
The producer→consumer routing problem decomposes into four tiers. Each tier is generic across major language ecosystems. Concrete framework instances per tier:
Tier 1 — IDL-driven typed stubs (gRPC, GraphQL, OpenAPI, AsyncAPI)
The stable identifier lives in an IDL file shared between producer and consumer repos. Both sides reference generated types derived from the same IDL.
*Clientfrom .proto codegen*Base/*Servicerimpl from .proto codegenservice.methodfrom .proto@keydirectiveDetection: parse the IDL file, extract canonical IDs as routes; on producer side find references to generated client types; on consumer side find generated base-class implementations.
Genericity: 100%. gRPC alone covers 8+ languages. .proto/.graphql/.openapi files are language-agnostic by design.
Tier 2 — Typed message pub/sub (interface-ancestor + generic-type)
The stable identifier is a message type's fully-qualified name. Producer has
Publish<T>/Send<T>/ equivalent on a known interface; consumer implementsIConsumer<T>/@MessageHandler/ equivalent.IPublishEndpoint.Publish<T>/ISendEndpoint.Send<T>/IBus.Publish<T>IConsumer<T>/IHandleMessages<T>streamBridge.send(...)/@CommandHandler@StreamListener<T>/@EventHandler@MessagePattern<T>emit@EventPattern<T>handler@app.agentsendDetection: pattern-match the producer interface (e.g.,
IPublishEndpoint,streamBridge) with itsPublish<T>/Send<T>method, extractTfrom the generic param or the constructor argument's type. On consumer side, find classes implementingIConsumer<T>/IHandleMessages<T>/ classes with@StreamListener<T>on a method, extractT. Match by FQN.Genericity: highly cross-language. ~6 framework families, identical abstract pattern.
Tier 3 — Attribute / decorator-driven HTTP routes
Producer is a typed HTTP client whose interface methods carry route attributes; consumer is a controller/handler with matching route attributes. Both attribute values are literal strings — the easiest tier if extracted from the attribute, not the call site.
[Get("/x")]on interface)[HttpGet("/x")]controller)@RequestLine("GET /x")), Retrofit (@GET("/x"))@GetMapping("/x")), JAX-RS (@GET @Path("/x"))@Get("/x")), Hono, Express decorators@app.get("/x")), LitestarDetection: extract HTTP method + path from class-level / method-level attributes on both interfaces (producer) and concrete classes (consumer). Match.
Genericity: most universal — decorator-driven HTTP routing is the modern default in every serious web ecosystem.
Tier 4 — Config-resolved service discovery
The producer's call site has only a relative path or named-client reference; the actual base URL lives in a config file (
appsettings.json,application.yaml, env vars, Kubernetes Service DNS, service-registry config). Consumer side uses Tier 3 attribute-driven detection.IHttpClientFactory.CreateClient("name")appsettings*.json,services.AddHttpClient(...)@FeignClient(name="x", url="${promo.url}")application.yaml, envhttp://promocode-service:80/x).env, Helm valuesDetection: scan config files for named-service → base-URL mappings; trace
CreateClient("name")/@FeignClient("name")to resolved URL; combine with the variable URL path within the calling method to reconstruct the full route.Genericity: universal microservice pattern.
Tier 5 — Reflection / runtime-resolved DI (out of scope)
_serviceProvider.GetService(Type.GetType(configString))?.Invoke(...)is genuinely impossible to resolve statically. This tier is named for completeness but explicitly out of scope. Estimated <5% of cross-service calls in practice.3. Proposed architecture
3.1 Plugin-based service-pattern registry
internal/cbm/service_patterns.ccurrently hardcodes 252 patterns in a C array. Adding a new framework requires a C patch + recompile. Proposal: externalize the pattern table to a YAML / JSON registry loaded at startup.Format example (registry-format-1.yaml — actual schema TBD with maintainer):
Benefits:
languages: [java, kotlin])Existing 252 patterns in
service_patterns.ccan be migrated to YAML in a separate cleanup PR (no behavior change, pure refactor) — out of scope for this proposal but a natural follow-on.3.2 Pipeline integration
Two changes to the existing pipeline:
New pass:
pass_idl_scan— runs once per repo beforepass_definitions. Scans for IDL files (.proto, .graphql, openapi.yaml, asyncapi.yaml) and emits canonical Route nodes derived from them. Each Route gets a stable QN like__idl_route__grpc__promocode.PromoCodeManagerGrpcService/GetVoucherregardless of which language consumes it.Extend
pass_calls.c emit_classified_edge— when matching against the new YAML-driven patterns, support extracting identifiers from:Publish<T>)Publish(new T(...)))[Get("/x")])@FeignClient(name="x"))Auto-trigger cross-repo pass for workspace siblings — when a repo is part of a workspace (e.g.,
cross-repo-intelligencemode is invoked once withtarget_projects: ["*"]), persist the workspace membership in the artifact, and on subsequent re-indexes auto-fire cross-repo matching against the same sibling set.3.3 Cross-repo extension
The existing
cbm_cross_repo_matchalready supports topic-based matching. Two extensions:Add
match_by_message_fqn— phase D (after HTTP / Async / Channel matching). For each ASYNC_CALLS edge withmessage_fqnproperty, find consumer-sideIConsumer<message_fqn>registrations in target DBs and emit CROSS_ASYNC_CALLS edges.Add
match_by_grpc_method— phase E. For each gRPC client call withservice.methodidentifier, find consumer-side*Baseoverrides of the sameservice.methodand emit CROSS_GRPC_CALLS edges. Reuses the existing CROSS_GRPC_CALLS edge type and emission helper atpass_cross_repo.c:657.Both extensions reuse the existing route-matching scaffolding (
emit_cross_route_bidirectional). Pure additive code paths.4. Tier 1 detailed spec — gRPC
.protomatchingProposed as the first PR to validate the architecture. Smallest scope, highest universality (8+ languages), zero framework variance (.proto syntax is standardized by Google).
4.1 Producer-side extraction
Detect references to generated gRPC client types. The detection signal is the type name pattern, not call-site strings:
Clientderived fromGrpc.Core.ClientBase<T>(generated byGrpc.Tools)*grpc.ClientConnfield + methods matching .proto service methods*_pb2_grpc.pyending inStub*Grpc.*Stub(generated byprotoc-gen-grpc-java)*_pb_grpc.d.tswith the right shape*ClientstructsFor each method call on a generated client type:
service.methodpair (recoverable from .proto — see §4.3){rpc_kind: "grpc", service: "promocode.PromoCodeManagerGrpcService", method: "GetVoucher"}4.2 Consumer-side extraction
Detect classes implementing the generated gRPC server-base type:
: PromoCodeManagerGrpcServiceBase*Servicerextends *ImplBaseimpl *Server for ...For each override of a service method, emit a Route node with QN
__idl_route__grpc__<service>/<method>and a HANDLES edge from the implementing class.4.3 IDL parsing
pass_idl_scanreads.protofiles (anywhere in the repo by default; configurable) and builds the canonicalservice.method→package.Service.methodmapping. Tree-sitter has a maintainedtree-sitter-protogrammar. ~150 LOC for the parser + AST walk + node emission.4.4 Cross-repo matching
Add
match_grpc_callsas Phase E incbm_cross_repo_match:Reuses
emit_cross_route_bidirectionaland the existing CROSS_GRPC_CALLS edge type. ~100 LOC.4.5 Estimated diff size
pass_idl_scan.c(new)service_patterns.c(gRPC client/server type recognizers)pass_calls.cextension for typed-client RPC propertiespass_cross_repo.cPhase ELarger than my earlier 300-LOC estimate — that didn't include tests. Production-grade with tests is ~560.
4.6 Test fixtures (multi-language)
Three tiny repos shipped under
testdata/cross-repo/grpc/:service-a-csharp/— minimal .NET project consumingservice-b's gRPC clientservice-b-go/— minimal Go gRPC server implementing the .proto fromcontracts/service-c-python/— minimal Python consumer ofservice-b's gRPC servicecontracts/— single .proto file shared by all threeTest asserts: after indexing all four (
contracts/first, then services),cbm_cross_repo_matchemits CROSS_GRPC_CALLS edges from a-csharp and c-python to b-go's handler classes, with correct service+method properties.4.7 Success criteria
5. Roadmap — Tiers 2–4
Each tier is a separate PR after Tier 1 lands. Sequence chosen by descending universality and ascending implementation complexity.
5.1 Tier 2 — typed message pub/sub (after Tier 1)
Scope: introduce the YAML-driven service-pattern registry; ship initial registry covering MassTransit (C#), Spring Cloud Stream (Java/Kotlin), and NestJS (TS) as proof of multi-language genericity. Add
pass_message_synthesisthat emits ASYNC_CALLS edges keyed bymessage_fqninstead of requiring a topic literal. Extendpass_cross_repoPhase D to match bymessage_fqn.Estimated LOC: ~800 (registry loader, YAML schema, three framework definitions, new pass, cross-repo extension, tests).
Risk: brittleness on framework-version drift (MassTransit v8 vs v7 have slightly different interface shapes). Mitigation: registry entries can be version-tagged; pattern matching tolerates shape variance.
5.2 Tier 3 — attribute-driven routes (after Tier 2)
Scope: extend
pass_route_nodes.cto extract routes from interface-method attributes (Refit / Retrofit / Feign) on the producer side. Match against existing controller-side attribute extraction. Most attribute-driven controller patterns are already detected by cbm — this tier closes the producer-side gap.Estimated LOC: ~400.
Risk: low. Attribute syntax is declarative and stable across framework versions.
5.3 Tier 4 — config-resolved service discovery (after Tier 3)
Scope: extend
pass_envscanto also parse appsettings.json, application.yaml, helm values, kustomize overlays. Build named-client → base-URL maps. Add light intra-method dataflow to resolvepath = $"/api/{x}"patterns. Combine with named-client resolution to reconstruct full URLs.Estimated LOC: ~1200. Largest tier — config parsing across multiple ecosystems is genuinely complex.
Risk: medium-high. Variable resolution can produce false positives; mitigation is confidence scoring on the emitted edges (high confidence when literal, lower when resolved through 2+ hops).
5.4 Combined coverage estimate
After Tiers 1–3 land (Tier 4 is bonus), realistic recall on cross-service edges in modern strongly-typed codebases:
6. Risks and mitigations
cbm_cross_repo_result_t.collisions[]7. Open questions for the maintainer
Pattern-registry format preference: YAML, TOML, JSON, or compiled-in C tables with a build-time generator? YAML is most readable but adds a YAML parser to runtime; TOML or JSON minimize parser surface.
Where should IDL files be discovered: walk-the-repo by default, or require explicit
idl_pathsconfig? Walk-the-repo has zero-config UX cost but may pick up vendored proto files innode_modulesorvendor/. Suggest default-walk + standard exclusion list.Cross-repo auto-trigger model: store workspace membership in the per-repo artifact, or in a separate workspace-level artifact? Per-repo is simpler but duplicates state; workspace-level is cleaner but adds a new artifact kind.
Confidence scoring: should cross-repo edges carry a
confidenceproperty explicitly, or rely on the existingpropertiesJSON blob? A first-class confidence field makes downstream consumers' job easier.Existing pattern table migration: should the 252 patterns in
service_patterns.cmigrate to the YAML registry as part of this work, or stay in C with the registry only handling new patterns? Recommendation: keep C patterns as-is for v1, migrate in a separate cleanup PR after the YAML schema is proven stable.Tier 4 dataflow scope: how aggressive should intra-method variable resolution be? Single-assignment + string-concat is safe; following data through helper methods gets harder. Suggest single-method scope for v1.
Test-fixture monorepo strategy: ship the multi-language fixtures in the cbm repo, or reference an external
cbm-test-fixturesrepo to keep the main repo small? The fixtures total ~5MB across 3-4 languages — manageable in-tree.8. Why this is worth merging upstream
cbm's competitive position vs. Sourcegraph / Backstage / Apollo:
CROSS_GRPC_CALLS/CROSS_ASYNC_CALLSedges keyed by protocol identifiers.@keymatching. cbm + this proposal generalizes the same idea to gRPC, OpenAPI, AsyncAPI, and typed-message ecosystems.Position: cbm becomes the only single-binary, AST+LSP-grade tool that derives a complete service interaction graph automatically from source. That's a defensible product position.
The capability is asked for in every code-graph tool's roadmap (often as "service mesh visualization" or "API surface discovery"). cbm has the structural advantage to ship it first.
9. Appendix — example YAML registry entries
Full registry entries for the ten frameworks Tier 2 should ship with:
Schema notes:
matchblock defines the AST-pattern selector (interface implementation, attribute presence, method-call shape)extract_id_fromnames a strategy from a fixed enum (generic_type_arg,first_arg,first_arg_type,annotation_value,attribute_arg,receiver_var_topic_name, etc.)id_kinddeclares the namespace of the extracted identifier (sokafka_topicfrom one framework matcheskafka_topicfrom another, but never matchesmessage_fqn)brokeris optional metadata that flows into the emitted edge