Protocol-aware cross-repo intelligence: gRPC + typed-message + attribute-route matching (design proposal)

> **Status (2026-04-26):** Tier 1a (consumer-side gRPC IDL Routes + HANDLES binding) is open as #293 against this issue. Tier 1b (producer-side typed-client `GRPC_CALLS` emission) and Tiers 2–4 will follow as separate PRs sequenced per §5 below.

---

# Protocol-Aware Cross-Repo Intelligence

**Status:** Design proposal for upstream `codebase-memory-mcp`
**Audience:** cbm maintainer + reviewers
**Scope:** Extends `pass_cross_repo` from literal-string matching to protocol-aware matching across the four cross-language patterns that account for >95% of inter-service communication in modern codebases.
**Compatibility:** Strictly additive. No breaking changes to existing tools, edges, or APIs. Builds on PR #281 (rich `get_architecture` fields).

---

## TL;DR

cbm already has the scaffolding for cross-repo intelligence (`pass_cross_repo.c`, `CROSS_HTTP_CALLS` / `CROSS_ASYNC_CALLS` / `CROSS_GRPC_CALLS` edge types, named-route matching). The current implementation only fires when a call site has a **literal URL or topic string** as its first argument. That covers idiomatic Python/Node code well, but misses the dominant pattern in modern strongly-typed stacks (Java/Spring, .NET, Kotlin, Go-with-codegen): **typed clients and message handlers where the routing identifier is a generic type parameter, an interface ancestor, an attribute, or a config-resolved name** — never a literal string at the call site.

This proposal adds four protocol-aware extraction tiers, each language-generic, behind a YAML-driven service-pattern registry so adding new frameworks is a config edit rather than a C patch. Tier 1 (gRPC `.proto` matching) is proposed first as a ~300-LOC working PR to validate the architecture; Tiers 2–4 follow as separate PRs.

Cross-language framework coverage matrix at the end. Acceptance gating: each tier ships independently, success measured by precision/recall against multi-language test fixtures.

---

## 1. Background

### 1.1 What works in cbm today

After PR #281 lands, `get_architecture(aspects=["all"])` returns rich structural data (entry_points, routes, hotspots, layers, boundaries, languages). The per-repo extraction pipeline detects:

- Library identifiers in resolved qualified names (`service_patterns.c:631` — 252 patterns across HTTP, async, gRPC, config, route-registration kinds, covering Python/Node/Go/Java/Rust/PHP/Ruby/C# basics)
- Literal URL / topic strings at call sites (`pass_calls.c:emit_http_async_edge`)
- Route registration via `app.get("/x", ...)` and attribute-routed framework styles (`pass_route_nodes.c`)
- Cross-repo matching when both sides have a literal-route identifier (`pass_cross_repo.c:cbm_cross_repo_match`)

The `cross-repo-intelligence` mode in `index_repository` matches `__route__<METHOD>__<path>` keys across project DBs and emits CROSS_HTTP_CALLS / CROSS_ASYNC_CALLS / CROSS_GRPC_CALLS edges.

### 1.2 What doesn't work

`emit_http_async_edge` (pass_calls.c, line ~232):

```c
bool is_url = (url_or_topic && url_or_topic[0] != '\0' &&
               (url_or_topic[0] == '/' || strstr(url_or_topic, "://") != NULL));
bool is_topic = (url_or_topic && ... && svc == CBM_SVC_ASYNC && ...);
if (!is_url && !is_topic) {
    /* fall back to plain CALLS edge */
    return;
}
```

If the first string argument is not a literal URL or topic, the edge falls through to `CALLS` — generic, unrouted, untaggable for cross-repo matching. Idiomatic code in major modern frameworks rarely passes a literal URL or topic at the call site:

```csharp
// .NET / MassTransit — no topic string, message type is the identifier
await _publishEndpoint.Publish(new VoucherRedeemed(voucher.Id), ct);

// .NET / generated gRPC client — no URL string, service.method is the identifier
var resp = await _promoCodeClient.GetVoucherAsync(req, cancellationToken: ct);

// Java / Spring Cloud Stream — no topic string, message type is the identifier
streamBridge.send("output", new OrderShipped(order));

// Java / Feign — interface annotation IS the route, no literal at call site
return feignClient.getVoucher(id);

// Kotlin / Retrofit — same shape as Feign
return retrofitApi.getVoucher(id)

// Go with gRPC codegen — generated client method, no string at call site
resp, err := promoClient.GetVoucher(ctx, req)

// Python / FastAPI typed httpx client (oapi-codegen-derived) — same shape
resp = await client.get_voucher(id=id)
```

In each case, the producer-side identifier (message type FQN, gRPC service.method, Feign interface annotation) is statically present and resolvable — but not as a literal string argument. It lives in: a generic type parameter, the constructor type of an argument, a class-level attribute, or a method-level attribute.

The consumer side has the same identifier visible in a different syntactic position: an `IConsumer<T>` declaration, a `*Base` implementation, a `@StreamListener<T>` annotation, an attribute-routed controller method.

**The matching problem is solvable. The producer/consumer identifier exists statically on both sides. cbm's current extractor just doesn't extract it.**

### 1.3 Why this matters now

Cross-repo intelligence is one of the most-asked-for capabilities in code-graph tools. Industry tooling that does parts of this:

| Tool | Approach | Limitation |
|---|---|---|
| Backstage (Spotify) | Service catalog from OpenAPI / AsyncAPI / .proto files | Manual catalog maintenance, declarative not derived |
| Sourcegraph | Cross-repo references via SCIP indexes | Per-symbol, doesn't protocol-match (only name-matches) |
| Apollo Studio | Federated GraphQL via `@key` directives | GraphQL only |
| AsyncAPI tooling | Typed async-message matching | AsyncAPI-spec only, requires explicit AsyncAPI files |
| GitHub CodeQL | Cross-repo dataflow for security | Security-focused, heavyweight |
| stack-graphs (GitHub) | Universal name resolution graph | Within-repo only |

cbm is uniquely positioned: single binary, AST + LSP-grade extraction, sub-second incremental indexing, no external service dependencies. The cross-repo capability matters because it's the missing 20% of value that turns "smart code search" into "service-architecture truth source."

---

## 2. Cross-language pattern audit

The producer→consumer routing problem decomposes into four tiers. Each tier is generic across major language ecosystems. Concrete framework instances per tier:

### Tier 1 — IDL-driven typed stubs (gRPC, GraphQL, OpenAPI, AsyncAPI)

The stable identifier lives in an IDL file shared between producer and consumer repos. Both sides reference generated types derived from the same IDL.

| Ecosystem | Producer pattern | Consumer pattern | Stable identifier |
|---|---|---|---|
| **gRPC** (Go, Java, Python, Rust, TS, C#, Kotlin, Swift) | `*Client` from .proto codegen | `*Base` / `*Servicer` impl from .proto codegen | `service.method` from .proto |
| **GraphQL Federation** (any GraphQL stack) | typed query/mutation client | resolver bound to type with `@key` directive | type + key from .graphql |
| **OpenAPI** (NSwag/openapi-generator/oapi-codegen) | generated typed client per language | controller/handler matching path+method | path + method from openapi.yaml |
| **AsyncAPI** | generated publisher | generated subscriber | channel + message from asyncapi.yaml |

**Detection**: parse the IDL file, extract canonical IDs as routes; on producer side find references to generated client types; on consumer side find generated base-class implementations.

**Genericity**: 100%. gRPC alone covers 8+ languages. .proto/.graphql/.openapi files are language-agnostic by design.

### Tier 2 — Typed message pub/sub (interface-ancestor + generic-type)

The stable identifier is a message type's fully-qualified name. Producer has `Publish<T>` / `Send<T>` / equivalent on a known interface; consumer implements `IConsumer<T>` / `@MessageHandler` / equivalent.

| Language | Frameworks | Producer shape | Consumer shape |
|---|---|---|---|
| C# | MassTransit, NServiceBus, Wolverine, Brighter, Rebus | `IPublishEndpoint.Publish<T>` / `ISendEndpoint.Send<T>` / `IBus.Publish<T>` | `IConsumer<T>` / `IHandleMessages<T>` |
| Java/Kotlin | Spring Cloud Stream, Axon, Eventuate | `streamBridge.send(...)` / `@CommandHandler` | `@StreamListener<T>` / `@EventHandler` |
| Node/TS | NestJS microservices, Moleculer, EventBus libs | `@MessagePattern<T>` emit | `@EventPattern<T>` handler |
| Python | Faust, Celery (typed), aio-pika typed wrappers | `@app.agent` send | typed handler funcs |
| Go | Watermill, NATS-typed, Wire | typed publish via marshalers | typed subscriber registration |
| Rust | Lapin + serde, async-nats with typed deserialization | typed publish | typed subscribe |

**Detection**: pattern-match the producer interface (e.g., `IPublishEndpoint`, `streamBridge`) with its `Publish<T>` / `Send<T>` method, extract `T` from the generic param or the constructor argument's type. On consumer side, find classes implementing `IConsumer<T>` / `IHandleMessages<T>` / classes with `@StreamListener<T>` on a method, extract `T`. Match by FQN.

**Genericity**: highly cross-language. ~6 framework families, identical abstract pattern.

### Tier 3 — Attribute / decorator-driven HTTP routes

Producer is a typed HTTP client whose interface methods carry route attributes; consumer is a controller/handler with matching route attributes. Both attribute values are literal strings — the easiest tier *if extracted from the attribute, not the call site*.

| Language | Producer | Consumer |
|---|---|---|
| C# | Refit, RestEase (`[Get("/x")]` on interface) | ASP.NET Core (`[HttpGet("/x")]` controller) |
| Java | Feign (`@RequestLine("GET /x")`), Retrofit (`@GET("/x")`) | Spring (`@GetMapping("/x")`), JAX-RS (`@GET @Path("/x")`) |
| Kotlin | Retrofit | Spring, Ktor route DSL |
| TypeScript | tsoa, NestJS HttpService with openapi-derived clients | NestJS (`@Get("/x")`), Hono, Express decorators |
| Python | httpx-codegen, aiohttp wrappers from openapi | FastAPI (`@app.get("/x")`), Litestar |
| Go | huma generated, oapi-codegen clients | huma, chi, gin, echo route registration |
| Rust | utoipa generated | actix-web, axum, rocket route attributes |

**Detection**: extract HTTP method + path from class-level / method-level attributes on both interfaces (producer) and concrete classes (consumer). Match.

**Genericity**: most universal — decorator-driven HTTP routing is the modern default in every serious web ecosystem.

### Tier 4 — Config-resolved service discovery

The producer's call site has only a relative path or named-client reference; the actual base URL lives in a config file (`appsettings.json`, `application.yaml`, env vars, Kubernetes Service DNS, service-registry config). Consumer side uses Tier 3 attribute-driven detection.

| Ecosystem | Producer | Config source |
|---|---|---|
| C# | `IHttpClientFactory.CreateClient("name")` | `appsettings*.json`, `services.AddHttpClient(...)` |
| Spring | `@FeignClient(name="x", url="${promo.url}")` | `application.yaml`, env |
| Spring Cloud / Eureka / Consul | service registry lookups | registry config |
| Kubernetes | Service DNS (`http://promocode-service:80/x`) | Service / Ingress YAML |
| Node | env-driven base URLs in axios/fetch wrappers | `.env`, Helm values |
| Go | viper-loaded named services | YAML / env |

**Detection**: scan config files for named-service → base-URL mappings; trace `CreateClient("name")` / `@FeignClient("name")` to resolved URL; combine with the variable URL path within the calling method to reconstruct the full route.

**Genericity**: universal microservice pattern.

### Tier 5 — Reflection / runtime-resolved DI (out of scope)

`_serviceProvider.GetService(Type.GetType(configString))?.Invoke(...)` is genuinely impossible to resolve statically. This tier is named for completeness but explicitly out of scope. Estimated <5% of cross-service calls in practice.

---

## 3. Proposed architecture

### 3.1 Plugin-based service-pattern registry

`internal/cbm/service_patterns.c` currently hardcodes 252 patterns in a C array. Adding a new framework requires a C patch + recompile. Proposal: externalize the pattern table to a YAML / JSON registry loaded at startup.

Format example (registry-format-1.yaml — actual schema TBD with maintainer):

```yaml
patterns:
  # Tier 2 — typed-message pub/sub
  - id: masstransit-publish
    languages: [csharp]
    kind: ASYNC_CALLS
    producer:
      match:
        type_implements: IPublishEndpoint
        method_pattern: "Publish<T>(...)"
      extract_id_from: generic_type_arg_or_first_arg_type
      id_kind: message_fqn
      broker: rabbitmq
    consumer:
      match:
        class_implements: "IConsumer<T>"
      extract_id_from: generic_type_arg
      id_kind: message_fqn

  - id: spring-cloud-stream-handler
    languages: [java, kotlin]
    kind: ASYNC_CALLS
    producer:
      match:
        method_calls: "streamBridge.send"
        first_arg_kind: string_literal
      extract_id_from: first_arg
      id_kind: channel_name
    consumer:
      match:
        method_annotation: "@StreamListener"
      extract_id_from: annotation_value
      id_kind: channel_name

  - id: refit-client
    languages: [csharp]
    kind: HTTP_CALLS
    producer:
      match:
        interface_method_attribute: "[Get|Post|Put|Delete|Patch]"
      extract_id_from: attribute_arg
      id_kind: http_route
    consumer:
      match:
        method_attribute: "[HttpGet|HttpPost|HttpPut|HttpDelete|HttpPatch]"
      extract_id_from: attribute_arg
      id_kind: http_route
```

Benefits:
- Adding Wolverine, Watermill, or any new framework is one YAML entry, not a code patch + release cycle
- Maintainer review surface drops dramatically (review YAML, not C)
- Community contributions become low-risk (a YAML PR can't crash the binary)
- Multi-language patterns compose naturally (one ID matches both Java and Kotlin via `languages: [java, kotlin]`)

Existing 252 patterns in `service_patterns.c` can be migrated to YAML in a separate cleanup PR (no behavior change, pure refactor) — out of scope for this proposal but a natural follow-on.

### 3.2 Pipeline integration

Two changes to the existing pipeline:

1. **New pass: `pass_idl_scan`** — runs once per repo before `pass_definitions`. Scans for IDL files (.proto, .graphql, openapi.yaml, asyncapi.yaml) and emits canonical Route nodes derived from them. Each Route gets a stable QN like `__idl_route__grpc__promocode.PromoCodeManagerGrpcService/GetVoucher` regardless of which language consumes it.

2. **Extend `pass_calls.c emit_classified_edge`** — when matching against the new YAML-driven patterns, support extracting identifiers from:
   - Generic type parameters (`Publish<T>`)
   - Constructor argument types (`Publish(new T(...))`)
   - Interface-method attributes (`[Get("/x")]`)
   - Class-level attributes (`@FeignClient(name="x")`)
   - Combined with config-resolved values (Tier 4)

3. **Auto-trigger cross-repo pass for workspace siblings** — when a repo is part of a workspace (e.g., `cross-repo-intelligence` mode is invoked once with `target_projects: ["*"]`), persist the workspace membership in the artifact, and on subsequent re-indexes auto-fire cross-repo matching against the same sibling set.

### 3.3 Cross-repo extension

The existing `cbm_cross_repo_match` already supports topic-based matching. Two extensions:

1. **Add `match_by_message_fqn`** — phase D (after HTTP / Async / Channel matching). For each ASYNC_CALLS edge with `message_fqn` property, find consumer-side `IConsumer<message_fqn>` registrations in target DBs and emit CROSS_ASYNC_CALLS edges.

2. **Add `match_by_grpc_method`** — phase E. For each gRPC client call with `service.method` identifier, find consumer-side `*Base` overrides of the same `service.method` and emit CROSS_GRPC_CALLS edges. Reuses the existing CROSS_GRPC_CALLS edge type and emission helper at `pass_cross_repo.c:657`.

Both extensions reuse the existing route-matching scaffolding (`emit_cross_route_bidirectional`). Pure additive code paths.

---

## 4. Tier 1 detailed spec — gRPC `.proto` matching

Proposed as the **first PR** to validate the architecture. Smallest scope, highest universality (8+ languages), zero framework variance (.proto syntax is standardized by Google).

### 4.1 Producer-side extraction

Detect references to generated gRPC client types. The detection signal is the **type name pattern**, not call-site strings:

- C#: classes/interfaces ending in `Client` derived from `Grpc.Core.ClientBase<T>` (generated by `Grpc.Tools`)
- Go: structs with `*grpc.ClientConn` field + methods matching .proto service methods
- Python: classes from `*_pb2_grpc.py` ending in `Stub`
- Java/Kotlin: classes ending in `*Grpc.*Stub` (generated by `protoc-gen-grpc-java`)
- TypeScript: classes from `*_pb_grpc.d.ts` with the right shape
- Rust: tonic-generated `*Client` structs

For each method call on a generated client type:
1. Resolve the client type to its underlying `service.method` pair (recoverable from .proto — see §4.3)
2. Emit a CALLS edge with new properties: `{rpc_kind: "grpc", service: "promocode.PromoCodeManagerGrpcService", method: "GetVoucher"}`

### 4.2 Consumer-side extraction

Detect classes implementing the generated gRPC server-base type:

- C#: `: PromoCodeManagerGrpcServiceBase`
- Go: structs with method receivers matching the unimplemented server interface
- Python: classes inheriting `*Servicer`
- Java: `extends *ImplBase`
- Rust: `impl *Server for ...`

For each override of a service method, emit a Route node with QN `__idl_route__grpc__<service>/<method>` and a HANDLES edge from the implementing class.

### 4.3 IDL parsing

`pass_idl_scan` reads `.proto` files (anywhere in the repo by default; configurable) and builds the canonical `service.method` → `package.Service.method` mapping. Tree-sitter has a maintained `tree-sitter-proto` grammar. ~150 LOC for the parser + AST walk + node emission.

### 4.4 Cross-repo matching

Add `match_grpc_calls` as Phase E in `cbm_cross_repo_match`:

```c
/* For each producer-side CALLS edge with rpc_kind=grpc:
 *   1. Look up service+method in target project's IDL-derived Routes
 *   2. If found, find the HANDLES edge from a *Base class
 *   3. Emit CROSS_GRPC_CALLS bidirectional edge
 */
```

Reuses `emit_cross_route_bidirectional` and the existing CROSS_GRPC_CALLS edge type. ~100 LOC.

### 4.5 Estimated diff size

| Component | Files | LOC |
|---|---|---|
| `pass_idl_scan.c` (new) | 1 | ~150 |
| `service_patterns.c` (gRPC client/server type recognizers) | 1 | ~50 |
| `pass_calls.c` extension for typed-client RPC properties | 1 | ~60 |
| `pass_cross_repo.c` Phase E | 1 | ~100 |
| Tests + fixtures (multi-language) | several | ~200 |
| **Total** | | **~560 LOC** |

Larger than my earlier 300-LOC estimate — that didn't include tests. Production-grade with tests is ~560.

### 4.6 Test fixtures (multi-language)

Three tiny repos shipped under `testdata/cross-repo/grpc/`:
- `service-a-csharp/` — minimal .NET project consuming `service-b`'s gRPC client
- `service-b-go/` — minimal Go gRPC server implementing the .proto from `contracts/`
- `service-c-python/` — minimal Python consumer of `service-b`'s gRPC service
- `contracts/` — single .proto file shared by all three

Test asserts: after indexing all four (`contracts/` first, then services), `cbm_cross_repo_match` emits CROSS_GRPC_CALLS edges from a-csharp and c-python to b-go's handler classes, with correct service+method properties.

### 4.7 Success criteria

| Metric | Target |
|---|---|
| Precision on test fixtures | 100% (deterministic — gRPC has no naming ambiguity) |
| Recall on test fixtures | 100% — all known cross-service calls detected |
| Index-time overhead | <5% additional time per repo with .proto files |
| Index-time overhead per repo without .proto | 0% (pass_idl_scan is no-op) |
| Memory overhead | proportional to .proto count, ~1KB per service definition |
| Backwards compatibility | All existing tests pass; existing CROSS_* edges unchanged |

---

## 5. Roadmap — Tiers 2–4

Each tier is a separate PR after Tier 1 lands. Sequence chosen by descending universality and ascending implementation complexity.

### 5.1 Tier 2 — typed message pub/sub (after Tier 1)

**Scope**: introduce the YAML-driven service-pattern registry; ship initial registry covering MassTransit (C#), Spring Cloud Stream (Java/Kotlin), and NestJS (TS) as proof of multi-language genericity. Add `pass_message_synthesis` that emits ASYNC_CALLS edges keyed by `message_fqn` instead of requiring a topic literal. Extend `pass_cross_repo` Phase D to match by `message_fqn`.

**Estimated LOC**: ~800 (registry loader, YAML schema, three framework definitions, new pass, cross-repo extension, tests).

**Risk**: brittleness on framework-version drift (MassTransit v8 vs v7 have slightly different interface shapes). Mitigation: registry entries can be version-tagged; pattern matching tolerates shape variance.

### 5.2 Tier 3 — attribute-driven routes (after Tier 2)

**Scope**: extend `pass_route_nodes.c` to extract routes from interface-method attributes (Refit / Retrofit / Feign) on the producer side. Match against existing controller-side attribute extraction. Most attribute-driven controller patterns are already detected by cbm — this tier closes the producer-side gap.

**Estimated LOC**: ~400.

**Risk**: low. Attribute syntax is declarative and stable across framework versions.

### 5.3 Tier 4 — config-resolved service discovery (after Tier 3)

**Scope**: extend `pass_envscan` to also parse appsettings.json, application.yaml, helm values, kustomize overlays. Build named-client → base-URL maps. Add light intra-method dataflow to resolve `path = $"/api/{x}"` patterns. Combine with named-client resolution to reconstruct full URLs.

**Estimated LOC**: ~1200. Largest tier — config parsing across multiple ecosystems is genuinely complex.

**Risk**: medium-high. Variable resolution can produce false positives; mitigation is confidence scoring on the emitted edges (high confidence when literal, lower when resolved through 2+ hops).

### 5.4 Combined coverage estimate

After Tiers 1–3 land (Tier 4 is bonus), realistic recall on cross-service edges in modern strongly-typed codebases:

| Code style | Estimated recall |
|---|---|
| Go + gRPC + literal HTTP URLs | 95%+ (Tier 1 alone covers most) |
| Java/Spring + Feign + Cloud Stream | 90%+ (Tiers 1+2+3) |
| .NET / CQRS + MediatR + MassTransit + gRPC | 90%+ (Tiers 1+2; HttpClient gap = Tier 4) |
| TypeScript / NestJS + microservices | 85%+ (Tiers 1+2+3) |
| Python / FastAPI + Celery + httpx-codegen | 85%+ |
| Plain Python/Node with literal URLs (today's recall) | unchanged, still works |

---

## 6. Risks and mitigations

| Risk | Likelihood | Mitigation |
|---|---|---|
| Tree-sitter pattern brittleness across language versions | Medium | YAML registry allows per-version patterns; tests cover N-1 and N versions of each framework |
| YAML registry becomes a maintenance burden | Medium | Limit official registry to top 10 frameworks per language; community contributions land via PR review with required test fixtures |
| False-positive cross-repo edges from name collisions | Low | Confidence scoring on each edge; collisions reported in `cbm_cross_repo_result_t.collisions[]` |
| Increased index time | Low | New passes are conditional (no .proto files = no IDL pass); benchmarks on every PR |
| Variable URL resolution (Tier 4) produces wrong routes | Medium | Confidence scoring; only emit cross-repo edge if resolved confidence > 0.7; consumer-side validation catches bad matches |
| Reflection / runtime-resolved DI is impossible | High but acknowledged | Explicitly out of scope (Tier 5); document as known limitation |
| Maintainer-burden objection | Medium | Plugin registry shifts most additions to YAML; core C surface area kept minimal |
| Patch size scares reviewers | High for big-bang, Low for tier-by-tier | Submit Tier 1 first as standalone PR; subsequent tiers reference Tier 1's architecture |

---

## 7. Open questions for the maintainer

1. **Pattern-registry format preference**: YAML, TOML, JSON, or compiled-in C tables with a build-time generator? YAML is most readable but adds a YAML parser to runtime; TOML or JSON minimize parser surface.

2. **Where should IDL files be discovered**: walk-the-repo by default, or require explicit `idl_paths` config? Walk-the-repo has zero-config UX cost but may pick up vendored proto files in `node_modules` or `vendor/`. Suggest default-walk + standard exclusion list.

3. **Cross-repo auto-trigger model**: store workspace membership in the per-repo artifact, or in a separate workspace-level artifact? Per-repo is simpler but duplicates state; workspace-level is cleaner but adds a new artifact kind.

4. **Confidence scoring**: should cross-repo edges carry a `confidence` property explicitly, or rely on the existing `properties` JSON blob? A first-class confidence field makes downstream consumers' job easier.

5. **Existing pattern table migration**: should the 252 patterns in `service_patterns.c` migrate to the YAML registry as part of this work, or stay in C with the registry only handling new patterns? Recommendation: keep C patterns as-is for v1, migrate in a separate cleanup PR after the YAML schema is proven stable.

6. **Tier 4 dataflow scope**: how aggressive should intra-method variable resolution be? Single-assignment + string-concat is safe; following data through helper methods gets harder. Suggest single-method scope for v1.

7. **Test-fixture monorepo strategy**: ship the multi-language fixtures in the cbm repo, or reference an external `cbm-test-fixtures` repo to keep the main repo small? The fixtures total ~5MB across 3-4 languages — manageable in-tree.

---

## 8. Why this is worth merging upstream

cbm's competitive position vs. Sourcegraph / Backstage / Apollo:

- **Sourcegraph** does cross-repo references but per-symbol, not protocol-aware. cbm + Tier 1-3 would be the only AST-based tool emitting structured `CROSS_GRPC_CALLS` / `CROSS_ASYNC_CALLS` edges keyed by protocol identifiers.
- **Backstage** does service-graph from declarative IDL files but requires manual catalog upkeep. cbm + this proposal derives the service graph automatically from the same IDL files plus the consuming code.
- **Apollo Studio** does federated GraphQL via `@key` matching. cbm + this proposal generalizes the same idea to gRPC, OpenAPI, AsyncAPI, and typed-message ecosystems.

Position: **cbm becomes the only single-binary, AST+LSP-grade tool that derives a complete service interaction graph automatically from source.** That's a defensible product position.

The capability is asked for in every code-graph tool's roadmap (often as "service mesh visualization" or "API surface discovery"). cbm has the structural advantage to ship it first.

---

## 9. Appendix — example YAML registry entries

Full registry entries for the ten frameworks Tier 2 should ship with:

```yaml
patterns:
  # ── C# / .NET ──────────────────────────────────────────────────
  - id: masstransit-publish
    languages: [csharp]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "IPublishEndpoint", method: "Publish" }
      extract_id_from: generic_arg_or_first_arg_type
      id_kind: message_fqn
      broker: rabbitmq
    consumer:
      match: { class_implements: "IConsumer<T>" }
      extract_id_from: generic_type_arg
      id_kind: message_fqn

  - id: masstransit-send
    languages: [csharp]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "ISendEndpoint", method: "Send" }
      extract_id_from: generic_arg_or_first_arg_type
      id_kind: message_fqn
      broker: rabbitmq
    consumer: { same_as: masstransit-publish.consumer }

  - id: nservicebus-publish
    languages: [csharp]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "IMessageSession", method: "Publish" }
      extract_id_from: first_arg_type
      id_kind: message_fqn
    consumer:
      match: { class_implements: "IHandleMessages<T>" }
      extract_id_from: generic_type_arg
      id_kind: message_fqn

  # ── Java / Kotlin / Spring ─────────────────────────────────────
  - id: spring-cloud-stream-bridge
    languages: [java, kotlin]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "StreamBridge", method: "send" }
      extract_id_from: first_arg
      id_kind: channel_name
    consumer:
      match: { method_annotation: "@StreamListener" }
      extract_id_from: annotation_value
      id_kind: channel_name

  - id: axon-command
    languages: [java, kotlin]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "CommandGateway", method: "send" }
      extract_id_from: first_arg_type
      id_kind: message_fqn
    consumer:
      match: { method_annotation: "@CommandHandler" }
      extract_id_from: parameter_type
      id_kind: message_fqn

  # ── Node / TypeScript ──────────────────────────────────────────
  - id: nestjs-message-pattern
    languages: [typescript]
    kind: ASYNC_CALLS
    producer:
      match: { method_annotation: "@MessagePattern" }
      extract_id_from: annotation_value
      id_kind: message_pattern
    consumer:
      match: { method_annotation: "@MessagePattern" }
      extract_id_from: annotation_value
      id_kind: message_pattern

  - id: nestjs-event-pattern
    languages: [typescript]
    kind: ASYNC_CALLS
    producer:
      match: { method: "emit", type_implements: "ClientProxy" }
      extract_id_from: first_arg
      id_kind: event_pattern
    consumer:
      match: { method_annotation: "@EventPattern" }
      extract_id_from: annotation_value
      id_kind: event_pattern

  # ── Python ──────────────────────────────────────────────────────
  - id: faust-agent
    languages: [python]
    kind: ASYNC_CALLS
    producer:
      match: { method_call: "topic.send" }
      extract_id_from: receiver_var_topic_name
      id_kind: kafka_topic
    consumer:
      match: { decorator: "@app.agent" }
      extract_id_from: decorator_arg
      id_kind: kafka_topic

  # ── Go ──────────────────────────────────────────────────────────
  - id: watermill-publish
    languages: [go]
    kind: ASYNC_CALLS
    producer:
      match: { type_implements: "message.Publisher", method: "Publish" }
      extract_id_from: first_arg
      id_kind: topic_name
    consumer:
      match: { type_implements: "message.Subscriber", method: "Subscribe" }
      extract_id_from: first_arg
      id_kind: topic_name

  # ── Rust ───────────────────────────────────────────────────────
  - id: async-nats-publish
    languages: [rust]
    kind: ASYNC_CALLS
    producer:
      match: { method: "publish", type_implements: "Client" }
      extract_id_from: first_arg
      id_kind: nats_subject
    consumer:
      match: { method: "subscribe", type_implements: "Client" }
      extract_id_from: first_arg
      id_kind: nats_subject
```

Schema notes:
- `match` block defines the AST-pattern selector (interface implementation, attribute presence, method-call shape)
- `extract_id_from` names a strategy from a fixed enum (`generic_type_arg`, `first_arg`, `first_arg_type`, `annotation_value`, `attribute_arg`, `receiver_var_topic_name`, etc.)
- `id_kind` declares the namespace of the extracted identifier (so `kafka_topic` from one framework matches `kafka_topic` from another, but never matches `message_fqn`)
- `broker` is optional metadata that flows into the emitted edge


Tool	Approach	Limitation
Backstage (Spotify)	Service catalog from OpenAPI / AsyncAPI / .proto files	Manual catalog maintenance, declarative not derived
Sourcegraph	Cross-repo references via SCIP indexes	Per-symbol, doesn't protocol-match (only name-matches)
Apollo Studio	Federated GraphQL via `@key` directives	GraphQL only
AsyncAPI tooling	Typed async-message matching	AsyncAPI-spec only, requires explicit AsyncAPI files
GitHub CodeQL	Cross-repo dataflow for security	Security-focused, heavyweight
stack-graphs (GitHub)	Universal name resolution graph	Within-repo only

Ecosystem	Producer pattern	Consumer pattern	Stable identifier
gRPC (Go, Java, Python, Rust, TS, C#, Kotlin, Swift)	`*Client` from .proto codegen	`Base` / `Servicer` impl from .proto codegen	`service.method` from .proto
GraphQL Federation (any GraphQL stack)	typed query/mutation client	resolver bound to type with `@key` directive	type + key from .graphql
OpenAPI (NSwag/openapi-generator/oapi-codegen)	generated typed client per language	controller/handler matching path+method	path + method from openapi.yaml
AsyncAPI	generated publisher	generated subscriber	channel + message from asyncapi.yaml

Language	Frameworks	Producer shape	Consumer shape
C#	MassTransit, NServiceBus, Wolverine, Brighter, Rebus	`IPublishEndpoint.Publish<T>` / `ISendEndpoint.Send<T>` / `IBus.Publish<T>`	`IConsumer<T>` / `IHandleMessages<T>`
Java/Kotlin	Spring Cloud Stream, Axon, Eventuate	`streamBridge.send(...)` / `@CommandHandler`	`@StreamListener<T>` / `@EventHandler`
Node/TS	NestJS microservices, Moleculer, EventBus libs	`@MessagePattern<T>` emit	`@EventPattern<T>` handler
Python	Faust, Celery (typed), aio-pika typed wrappers	`@app.agent` send	typed handler funcs
Go	Watermill, NATS-typed, Wire	typed publish via marshalers	typed subscriber registration
Rust	Lapin + serde, async-nats with typed deserialization	typed publish	typed subscribe

Language	Producer	Consumer
C#	Refit, RestEase (`[Get("/x")]` on interface)	ASP.NET Core (`[HttpGet("/x")]` controller)
Java	Feign (`@RequestLine("GET /x")`), Retrofit (`@GET("/x")`)	Spring (`@GetMapping("/x")`), JAX-RS (`@GET @Path("/x")`)
Kotlin	Retrofit	Spring, Ktor route DSL
TypeScript	tsoa, NestJS HttpService with openapi-derived clients	NestJS (`@Get("/x")`), Hono, Express decorators
Python	httpx-codegen, aiohttp wrappers from openapi	FastAPI (`@app.get("/x")`), Litestar
Go	huma generated, oapi-codegen clients	huma, chi, gin, echo route registration
Rust	utoipa generated	actix-web, axum, rocket route attributes

Ecosystem	Producer	Config source
C#	`IHttpClientFactory.CreateClient("name")`	`appsettings*.json`, `services.AddHttpClient(...)`
Spring	`@FeignClient(name="x", url="${promo.url}")`	`application.yaml`, env
Spring Cloud / Eureka / Consul	service registry lookups	registry config
Kubernetes	Service DNS (`http://promocode-service:80/x`)	Service / Ingress YAML
Node	env-driven base URLs in axios/fetch wrappers	`.env`, Helm values
Go	viper-loaded named services	YAML / env

Component	Files	LOC
`pass_idl_scan.c` (new)	1	~150
`service_patterns.c` (gRPC client/server type recognizers)	1	~50
`pass_calls.c` extension for typed-client RPC properties	1	~60
`pass_cross_repo.c` Phase E	1	~100
Tests + fixtures (multi-language)	several	~200
Total		~560 LOC

Metric	Target
Precision on test fixtures	100% (deterministic — gRPC has no naming ambiguity)
Recall on test fixtures	100% — all known cross-service calls detected
Index-time overhead	<5% additional time per repo with .proto files
Index-time overhead per repo without .proto	0% (pass_idl_scan is no-op)
Memory overhead	proportional to .proto count, ~1KB per service definition
Backwards compatibility	All existing tests pass; existing CROSS_* edges unchanged

Code style	Estimated recall
Go + gRPC + literal HTTP URLs	95%+ (Tier 1 alone covers most)
Java/Spring + Feign + Cloud Stream	90%+ (Tiers 1+2+3)
.NET / CQRS + MediatR + MassTransit + gRPC	90%+ (Tiers 1+2; HttpClient gap = Tier 4)
TypeScript / NestJS + microservices	85%+ (Tiers 1+2+3)
Python / FastAPI + Celery + httpx-codegen	85%+
Plain Python/Node with literal URLs (today's recall)	unchanged, still works

Risk	Likelihood	Mitigation
Tree-sitter pattern brittleness across language versions	Medium	YAML registry allows per-version patterns; tests cover N-1 and N versions of each framework
YAML registry becomes a maintenance burden	Medium	Limit official registry to top 10 frameworks per language; community contributions land via PR review with required test fixtures
False-positive cross-repo edges from name collisions	Low	Confidence scoring on each edge; collisions reported in `cbm_cross_repo_result_t.collisions[]`
Increased index time	Low	New passes are conditional (no .proto files = no IDL pass); benchmarks on every PR
Variable URL resolution (Tier 4) produces wrong routes	Medium	Confidence scoring; only emit cross-repo edge if resolved confidence > 0.7; consumer-side validation catches bad matches
Reflection / runtime-resolved DI is impossible	High but acknowledged	Explicitly out of scope (Tier 5); document as known limitation
Maintainer-burden objection	Medium	Plugin registry shifts most additions to YAML; core C surface area kept minimal
Patch size scares reviewers	High for big-bang, Low for tier-by-tier	Submit Tier 1 first as standalone PR; subsequent tiers reference Tier 1's architecture

Protocol-aware cross-repo intelligence: gRPC + typed-message + attribute-route matching (design proposal) #292

Description

Protocol-Aware Cross-Repo Intelligence

TL;DR

1. Background

1.1 What works in cbm today

1.2 What doesn't work

1.3 Why this matters now

2. Cross-language pattern audit

Tier 1 — IDL-driven typed stubs (gRPC, GraphQL, OpenAPI, AsyncAPI)

Tier 2 — Typed message pub/sub (interface-ancestor + generic-type)

Tier 3 — Attribute / decorator-driven HTTP routes

Tier 4 — Config-resolved service discovery

Tier 5 — Reflection / runtime-resolved DI (out of scope)

3. Proposed architecture

3.1 Plugin-based service-pattern registry

3.2 Pipeline integration

3.3 Cross-repo extension

4. Tier 1 detailed spec — gRPC .proto matching

4.1 Producer-side extraction

4.2 Consumer-side extraction

4.3 IDL parsing

4.4 Cross-repo matching

4.5 Estimated diff size

4.6 Test fixtures (multi-language)

4.7 Success criteria

5. Roadmap — Tiers 2–4

5.1 Tier 2 — typed message pub/sub (after Tier 1)

5.2 Tier 3 — attribute-driven routes (after Tier 2)

5.3 Tier 4 — config-resolved service discovery (after Tier 3)

5.4 Combined coverage estimate

6. Risks and mitigations

7. Open questions for the maintainer

8. Why this is worth merging upstream

9. Appendix — example YAML registry entries

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

4. Tier 1 detailed spec — gRPC `.proto` matching