Skip to content

feat(multicluster): URL-path routing across N ClickHouse clusters (#132)#134

Open
BorisTyshkevich wants to merge 12 commits into
mainfrom
feature-multicluster
Open

feat(multicluster): URL-path routing across N ClickHouse clusters (#132)#134
BorisTyshkevich wants to merge 12 commits into
mainfrom
feature-multicluster

Conversation

@BorisTyshkevich
Copy link
Copy Markdown
Collaborator

@BorisTyshkevich BorisTyshkevich commented May 27, 2026

closes #132

One MCP process now serves N clusters via /mcp/{cluster}, with only the ClickHouseConfig.Host template differing per cluster ({cluster} placeholder substituted at request time). All other CH settings (port, TLS, mode, regexes, limits) are shared.

Architecture (per the v1 readiness decisions):

  • Cache key is sha256(bearer) + "\x00" + cluster — token-bound, not claim-bound, so a forged claim set cannot reuse a victim's catalog.
  • Catalog cache with positive + negative entries; negatives memoise CH/ sidecar auth-class rejections only (401/403, CH codes 497/516/519).
  • 70% fill threshold triggers random eviction of denied entries; positives are never displaced.
  • Discovery concurrency is hard-capped at 16 (no config knob in v1); saturation falls back to static-only tools with a warn log + counter.
  • Shared OAuth audience across clusters; per-cluster audience is v2.
  • Per-cluster PRM routes registered for claude.ai / ChatGPT auto-discovery.
  • OpenAPI in MC mode is rejected at config load.
  • multicluster.* is restart-only; reload logs a warn on any field delta.

BorisTyshkevich and others added 12 commits May 27, 2026 11:52
One MCP process now serves N clusters via /mcp/{cluster}, with only the
ClickHouseConfig.Host template differing per cluster ({cluster} placeholder
substituted at request time). All other CH settings (port, TLS, mode,
regexes, limits) are shared.

Architecture (per the v1 readiness decisions):
- Cache key is sha256(bearer) + "\x00" + cluster — token-bound, not
  claim-bound, so a forged claim set cannot reuse a victim's catalog.
- Catalog cache with positive + negative entries; negatives memoise CH/
  sidecar auth-class rejections only (401/403, CH codes 497/516/519).
- 70% fill threshold triggers random eviction of denied entries; positives
  are never displaced.
- Discovery concurrency is hard-capped at 16 (no config knob in v1);
  saturation falls back to static-only tools with a warn log + counter.
- Shared OAuth audience across clusters; per-cluster audience is v2.
- Per-cluster PRM routes registered for claude.ai / ChatGPT auto-discovery.
- OpenAPI in MC mode is rejected at config load.
- multicluster.* is restart-only; reload logs a warn on any field delta.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rving host

The per-cluster 401 challenge built the resource_metadata URL from
resourceBaseURL (i.e. public_resource_url). When public_resource_url is a
shared/foreign audience hosted elsewhere — the intended shape for reusing one
cluster's CH-side ch-jwt-verify sidecar audience across an MC deployment — the
challenge advertised a metadata URL on a host that doesn't serve it, so
claude.ai's RFC 9728 discovery dead-ended with mcp_registration_failed before
any login screen.

The RFC 9728 metadata document is served by THIS process at
/mcp/{cluster}/.well-known/oauth-protected-resource on the request host; the
`resource` field inside it is the logical resource identifier (which may be a
foreign audience). Decouple the two: build the resource_metadata fetch URL from
the serving host, and mirror the public https scheme onto it (schemeAndHost
under-reports http behind the TLS-terminating edge proxy because MC mode forbids
OpenAPI, which is otel's usual https knob). When public_resource_url equals the
serving host (the canonical single-audience MC deployment) the URL is
byte-identical to before.

Found via #134 e2e against multi-mcp.demo.altinity.cloud serving /mcp/otel
against otel's existing sidecar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Basic)

## What changed

Added a new top-level `oauth.broker: true` flag that replaces the previous
`mode: forward` / `mode: gating + broker_upstream: true` dual-config pattern.
The old fields still work (with a deprecation log warning) so existing helm
values continue to function without changes.

### New config shape (canonical going forward)

```yaml
oauth:
  broker: true                          # new — replaces mode + broker_upstream
  issuer: https://mcp.example.com
  signing_secret: "..."
  audience: https://mcp.example.com
  client_id: "..."                       # upstream IdP client
  client_secret: "..."
  auth_url: https://accounts.google.com/o/oauth2/v2/auth
  token_url: https://oauth2.googleapis.com/token
  # mode: and broker_upstream: no longer needed
```

### CH auth auto-detection (broker:true only)

On the **first** authenticated request to a given `host:port` endpoint,
altinity-mcp probes ClickHouse to discover whether it accepts:

- **Bearer** — `Authorization: Bearer <token>` forwarded directly (Antalya's
  `token_processors` validates the JWT).
- **Basic** — `Authorization: Basic base64(email:JWT)` where email is
  unverified-decoded from the JWT payload; the colocated `ch-jwt-verify`
  sidecar validates signature/iss/aud/exp/scope and the email-vs-user match.

The probe works by attempting `NewClient` (which pings CH internally) with
Bearer first. If CH returns an auth error (HTTP 401/403 or CH exception
codes 497/516/519), it falls back to Basic. Success caches the method in
`ClickHouseJWEServer.chOAuthMethodCache` (`sync.Map` keyed by `host:port`).
Subsequent requests skip the probe entirely. The cache is naturally cleared
on config reload because reload creates a fresh server instance.

### Legacy mode paths (unchanged behaviour)

- `mode: forward` → Bearer, seeded deterministically into cache, no probe.
- `mode: gating` (no broker) → Basic, seeded deterministically, no probe.
- `mode: gating + broker_upstream: true` → Basic, same as above.

### Key files

| File | Change |
|---|---|
| `go-mcp-oauth-sdk/oauth/config.go` | Added `Broker bool` field + deprecation notes on `Mode`/`BrokerUpstream` |
| `go.mod` | SDK bumped to `v0.1.1-0.next` + local `replace` directive |
| `cmd/altinity-mcp/oauth_server.go` | `oauthBrokerMode()` now returns true for `cfg.Broker` |
| `cmd/altinity-mcp/main.go` | Deprecation warning + `broker:true` validation block + gating guard skip |
| `pkg/server/server.go` | `chOAuthMethodCache sync.Map` field on `ClickHouseJWEServer` |
| `pkg/server/server_client.go` | Rewrote OAuth path: mode-switch → `newClientWithOAuth` + probe logic |
| `cmd/altinity-mcp/main_test.go` | 7 new `broker:true` validation test cases |
| `CLAUDE.md` | Updated OAuth section to document new canonical config |

## E2E test instructions for agent

Run the `test-mcp-connector` skill against **both** the `otel` and `antalya`
deployments after bumping their image tags to a build from this branch.

### otel deployment (currently mode:gating + broker_upstream:true → migrated)

Expected new config in `$MCP_DEPLOY_DIR/otel/mcp-values.yaml`:
```yaml
oauth:
  broker: true
  # remove: mode: gating
  # remove: broker_upstream: true
```

Expected behaviour:
1. On first request, Bearer is probed → CH rejects (ch-jwt-verify sidecar
   needs Basic) → Basic is tried → succeeds → method cached as Basic.
2. `whoami` tool returns authenticated user's email.
3. `execute_query` (e.g. `SELECT 1`) returns a result without error.
4. Startup logs: deprecation warning should NOT appear (using broker:true).

### antalya deployment (currently mode:forward → migrated)

Expected new config in `$MCP_DEPLOY_DIR/antalya/mcp-values.yaml`:
```yaml
oauth:
  broker: true
  # remove: mode: forward
```

Expected behaviour:
1. On first request, Bearer is probed → CH accepts (Antalya token_processors) →
   method cached as Bearer.
2. `whoami` returns authenticated user's email.
3. `execute_query` returns results without error.

### Regression test: old config still works (do not change helm values)

After image bump but **without** changing helm values:
- `otel` with `mode: gating + broker_upstream: true` → startup log contains
  deprecation warning; `whoami` and `execute_query` still work.
- `antalya` with `mode: forward` → startup log contains deprecation warning;
  `whoami` and `execute_query` still work.

### Pass criteria (per deployment)

- `whoami` returns a non-empty email matching the authenticated Google account.
- `execute_query` with `SELECT 1` returns `[[1]]` or equivalent without error.
- No panic or unhandled error in pod logs (`kubectl logs`).
- For migrated configs: no deprecation warning in startup logs.
- For old configs: deprecation warning present but no functional regression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ode framing

## What changed

Removed the "two modes" (gating vs forward) conceptual model from all
user-facing documentation and replaced it with the canonical `broker: true`
flag that was implemented in the previous commit.

### docs/oauth_authorization.md (full rewrite, ~930→~580 lines)

- Lead with `broker: true` as the single config knob.
- Replaced "Forward mode" and "Gating mode" sections with "Bearer
  (token_processors)" and "Basic (ch-jwt-verify sidecar)" as CH-side
  implementation details — both now auto-detected at runtime.
- Merged the "Choosing a mode" decision tree; operators no longer need to
  choose.
- Updated all provider-specific config examples (Keycloak, Azure AD,
  Google, Cognito) to use `broker: true` instead of `mode: forward`.
- Added a "Deprecated fields" table at the bottom documenting the old
  `mode:` + `broker_upstream:` fields and their `broker: true` equivalent.
- Trimmed or removed sections that were mode-specific and no longer apply.

### docs/oauth_troubleshooting.md (updated)

- Removed "forward mode" framing throughout.
- Updated "Static ClickHouse credentials" section to reference `broker: true`.
- Updated HTTP 403 Bearer-not-supported entry to explain auto-fallback.

### README.md

- Features line: "Forward Bearer tokens" → "Broker OAuth flows; auto-detect
  Bearer vs Basic".
- Env var example: `MCP_OAUTH_MODE=gating` → `MCP_OAUTH_BROKER=true`.

### cmd/altinity-mcp/main.go

- Condensed the long comment block over the deprecated gating-mode
  validation branch.
- Shortened the `broker_upstream=true` log.Warn message to avoid repeating
  the old experimental deployment notes.

### cmd/altinity-mcp/oauth_server.go

- Removed "(forward and gating+broker_upstream alike)" from the
  `brokerModeIDTokenRefreshThresholdSeconds` comment.

### pkg/server/server_client.go

- Renamed "oauth gating:" log prefix to "oauth:" — the function is used
  for both CH auth paths and the old prefix would mislead operators.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… access token

In broker mode the bearer returned to the client (and forwarded to ClickHouse)
was always the upstream id_token, whose aud is the client_id. A CH-side
ch-jwt-verify sidecar validates aud=<resource> (e.g. https://otel-mcp…/), so the
id_token can never satisfy it — broker-against-Auth0 with a sidecar was unusable
(claude.ai reached login but every query would 516, and the foreign-audience
shape also tripped resource-origin checks).

Auth0 (and other RFC-8707-via-`audience` IdPs) mint a JWT *access* token with
aud=<API> only when the /authorize request carries an `audience` param. So:
  - brokerUpstreamAudience() returns oauth.audience when broker mode is active
    and the issuer is NOT Google; "" otherwise.
  - /authorize attaches audience=<that> to the upstream redirect.
  - /token forwards the access_token (aud=<API>, carries email via the API's
    Action) instead of the id_token when an audience was requested.

Google is explicitly excluded (IsGoogleIssuer): it ignores `audience`, never
issues arbitrary-aud JWT access tokens, and its CH token_processors path
validates the id_token directly — so antalya/github keep the id_token+Bearer
flow unchanged. Unit test covers Auth0/Google × broker/forward/gating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r fix)

Pulls the SDK fix that emits Authorization: Bearer for broker mode (not just
forward mode), so broker:true CH-auth auto-detection can actually succeed on
the Bearer/token_processor path (Google IdP). Without it, broker deployments
silently fell back to Basic and Google-IdP queries failed with CH code 516.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…only switch

- Bump go-mcp-oauth-sdk to bdefa85.
- Remove OAuth mode / broker_upstream from config, runtime validation, tests,
  docs, and the Helm multicluster example. Broker behaviour is now solely
  server.oauth.broker: true|false.
- ClickHouse auth is auto-detected (Bearer first, Basic on auth rejection) in
  all cases — no mode-seeded shortcuts.
- Delete the obsolete docs/proposal_gated_user_mapping.md.
- Keep removed-key detection for server.oauth.mode / server.oauth.broker_upstream
  so existing values get a migration warning at load.

Reviewed, full test suite green, builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…HEME 'BASIC'

ClickHouse rejects `IDENTIFIED WITH http_authenticator SERVER '...'` with
SYNTAX_ERROR; the correct grammar is `IDENTIFIED WITH http SERVER '...' SCHEME
'BASIC'`. Fix the wrong form in CLAUDE.md (sidecar user provisioning) and the
mcp-multicluster.yaml example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… limit

The chart default set clickhouse.limit: 1000, which logs a deprecation warning
at startup (limit is a deprecated alias for the PR #125 server-side cap). Switch
the default to max_result_rows: 1000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ant file

docs/tool_annotations.md duplicated the "MCP safety hints" topic already in
tools.md. Fold its still-useful bits into tools.md § MCP safety hints — the
per-hint definitions, openWorldHint (was missing from the table; default false
since MCP only touches ClickHouse, overridable via COMMENT annotations), the
"only meaningful when readOnlyHint=false" relationship, and the MCP
ToolAnnotations schema reference — then delete tool_annotations.md (unreferenced).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
broker:true/false is already covered above; the trailing removed-fields note
duplicated it. Config linter (RemovedConfigKeys) is the source of truth for
rejected legacy keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Multi-cluster routing via URL path

1 participant