oauth: CH-side JWT verifier sidecar; drop cluster_secret + ClaimsToHeaders#128
Merged
Conversation
…sToHeaders
Refactor gating mode so MCP is no longer a trust anchor for ClickHouse-side
identity. MCP becomes a pure forwarder: it unverified-decodes the JWT's email
claim, rewrites the inbound bearer to `Authorization: Basic base64(email:JWT)`,
and forwards to ClickHouse over HTTP. ClickHouse's <http_authentication>
delegates to a new colocated sidecar (cmd/ch-jwt-verify/) that validates the
JWT signature against the upstream JWKS, enforces RFC 8707 aud/exp/scope, and
applies identity policy (verified-email, domain allow-lists, user-vs-email
match). Forward mode (Antalya token_processors) is unchanged.
Net effect: a compromised MCP pod can no longer impersonate users to CH.
MCP-side removals (~1751 LOC):
- cluster_secret + cluster_name impersonation path
- ClaimsToHeaders claim->header mapping
- ClickHouseHeaderName custom-header option
- MCP-side identity policy (AllowedEmailDomains/AllowedHostedDomains/AllowUnverifiedEmail)
- Per-request MCP-side JWT validation (sidecar is the gate)
- validateClusterSecretConfig startup check
- pkg/oauth/identity.go, pkg/clickhouse/cluster_secret_test.go,
pkg/server/oauth_gating_embedded_test.go, docs/oauth_next_refactor.md
Added (~489 LOC + new packages):
- cmd/ch-jwt-verify/ sidecar: main + verify + settings + config + tests.
/verify endpoint parses Basic, validates JWT (shared pkg/oauth/jwt.go +
jwks.go), enforces identity policy, maps scopes -> CH session settings.
Cache keyed by SHA256(JWT) with positive/negative TTLs.
- pkg/server/server_client.go wire-format switch: forward->Bearer,
gating->Basic email:JWT, forces HTTP protocol under gating-mode OAuth.
- pkg/oauth/policy.go retains EmailDomain/ContainsDomain/HasRequiredScopes
for sidecar reuse.
- helm/ch-jwt-verify/ chart: ConfigMap (sidecar YAML config) +
ConfigMap (CH http_authentication_servers XML drop-in) + reusable
container fragment template. Deploys as colocated container in the CH
StatefulSet pod (loopback trust model).
Tests:
- All previous OAuth tests adapted to the new contract (no per-request
MCP validation; gating mode requires a JWT with an email claim).
- New cmd/ch-jwt-verify/verify_test.go covers signature, aud/exp/nbf,
email-verified, scope, lowercase-equal user-vs-email, scope->settings,
negative cache, missing-auth.
Docs:
- docs/oauth_authorization.md rewritten gating-mode section, trust-
boundary argument, identity-policy section, mode-comparison table.
- CLAUDE.md OAuth-modes section updated.
- docs/oauth_next_refactor.md deleted (obsoleted).
Deployment migration (per cluster):
- Drop cluster_secret/cluster_name from helm values, set protocol: http
- Deploy ch-jwt-verify sidecar in CH pod (helm/ch-jwt-verify/)
- Drop config.d http_authentication_servers XML (rendered by chart)
- Re-CREATE OAuth users: IDENTIFIED WITH http_authenticator SERVER 'ch_jwt_verify'
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the altinity-mcp image build for the sidecar binary. Same shape as scripts/build-mcp-image.sh: cross-compile a static Go binary per arch, legacy `docker build` with DOCKER_BUILDKIT=0 (sandbox docker proxy blocks the privileged container buildkit needs), then `docker manifest` for the multi-arch tag. Usage: ARCHES=arm64 scripts/build-ch-jwt-verify-image.sh sidecar -> ghcr.io/altinity/ch-jwt-verify:sidecar-<sha>-arm64 The Dockerfile is intentionally minimal (alpine + ca-certificates + the static binary at /bin/ch-jwt-verify; ENTRYPOINT runs the binary directly). No entrypoint script — the binary takes --config and signal handling natively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `docs/ch-jwt-verify.md` is the canonical reference for the
ClickHouse-side JWT verifier sidecar: rationale (trust-boundary
argument), wire contract (Basic-auth headers, JSON settings response),
full config schema with every knob explained, deployment topology
(colocated vs standalone trade-offs), the 127.0.0.1-vs-0.0.0.0 binding
gotcha, caching behavior, Helm chart usage, troubleshooting catalogue.
`docs/oauth_authorization.md` rewritten to:
- Drop the change-log narrative ("Updated 2026-05-15", "post-#109",
"feature/sidecar refactor").
- Strip private identifiers (specific cluster names, internal
hostnames, Auth0 tenant/resource/client IDs, deploy-dir paths,
specific email domains).
- Lead with the current, sidecar-based architecture instead of
documenting the now-deleted cluster_secret path.
- Tighten the CIMD/forward and gating sections; link out to
ch-jwt-verify.md for the sidecar spec.
- Keep all generic provider runbooks (Keycloak, Azure AD, Google,
AWS Cognito), the proxy/nginx guidance, and the
ClickHouse token_processors examples.
The Auth0-tenant setup checklist (resource-server creation, third-party
client grants, post-login Action) is operational knowledge that belongs
in private operator docs, not in the public spec. Removed accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. listen.tcp 127.0.0.1:9999 -> 0.0.0.0:9999. kubelet probes target the
pod IP (eth0) not loopback inside the netns; a 127.0.0.1 bind silently
fails the readiness probe and CrashLoops the container.
2. SQL example in README and configmap.yaml comment now says
`IDENTIFIED WITH http SERVER '...' SCHEME 'BASIC'` instead of the
non-existent `http_authenticator` grammar token. CH rejects
http_authenticator with SYNTAX_ERROR.
3. _helpers.tpl readiness probe was unconditional, but the chart
supports either tcp or unix listen modes. Probe now guarded by
`if .Values.listen.tcp` so unix-socket deployments don't render an
invalid probe targeting port 0. Also added a livenessProbe to match
the readiness behavior.
Also documents the clickhouse-operator quirk in README: the operator
auto-injects volumeMount{name: default, mountPath: /var/lib/clickhouse}
on every container in the podTemplate, but the actual data PVC is named
after the volumeClaimTemplate (e.g. default-1-1), so the injected mount
references a missing volume and the pod fails admission. Workaround:
declare an emptyDir volume named "default" in the podTemplate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST-only /verify: ClickHouse 24.x+ POSTs to http_authentication servers; allowing GET created a divergent code path with no upstream consumer and risked credentials in proxy URLs / access logs. Now 405 + Allow: POST on any other method. Bounded cache (cacheMaxEntries = 10000): - Insertion-time eviction in storeCache: drops expired entries first (cheap and correct under typical churn), then drops the closest-to-expiry entry if still at cap. O(n) walk under the mutex — trivial at cache-cap scale. - Optional background reaper (StartReaper) walks every 5 min and prunes expired entries. Wired into main via signal-derived context so it exits on shutdown. Carry Email through cache hits so log-friendliness is consistent across the cache miss / hit boundary. Previously cache hits dropped Email. Documented: - RequireEmailVerified semantics: intentionally only fires when email claim is present (sub-based deployments don't need email at all). - Single sync.Mutex is fine for a per-CH-pod sidecar serving loopback traffic; contention is bounded. Tests: - TestVerifierRejectsNonPOST (GET/PUT/DELETE/PATCH all 405) - TestCacheCapEvicts (storeCache respects cacheCap) - TestPruneExpired (background reaper drops past-TTL entries) - TestCacheHitPreservesEmail (cache hit returns same Email as miss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-ergonomics: when a values file carries a key this codebase no longer honors (cluster_secret, cluster_name, claims_to_headers, clickhouse_header_name, allowed_email_domains, allowed_hosted_domains, allow_unverified_email), YAML unmarshal silently drops it and the operator sees no warning that their override is now a no-op. Re-parse the raw bytes into a generic map after the typed unmarshal and emit a WARN line per detected key, naming the replacement. emailFromUnverifiedJWT unit tests in pkg/server/server_client_test.go (table-driven via a fakeJWT helper): - standard email claim - namespaced */email fallback - standard takes precedence over namespaced - empty/whitespace email falls back to namespaced - no email claim at all - non-string email - two-segment / five-segment (JWE) / empty rejected - malformed base64 / malformed JSON payload - padded base64URL fallback (some IdPs emit padding) - leading/trailing whitespace trimmed Drop the explicit `(*OAuthClaims)(nil)` set on the request context; ClaimsFromContext returns nil for both unset and typed-nil values, so the assignment was redundant. Comment in oauth_server.go documents the no-nil-deref reasoning. Comment in oauth_regression_test.go points security reviewers at the cmd/ch-jwt-verify/ tests for the moved JWT-validation coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
helm/ch-jwt-verify/values.yaml: SQL example used the wrong CH grammar token (http_authenticator) that's documented elsewhere as rejected. Replaced with the correct `IDENTIFIED WITH http SERVER '...' SCHEME 'BASIC'` shape. Caught by external review; the README and configmap template were already fixed in 4457913 but values.yaml was missed. cmd/ch-jwt-verify/verify.go: - Cache key uses the full SHA256 digest (64 hex chars) instead of an 8-byte prefix. The prefix was vulnerable to birthday-bound collisions that could DoS a legit user via the negative cache (no access-grant risk — signatures still re-verified on miss — but a real reliability bug at long-lived-pod scale). Memory cost at 10k entries: ~640 KiB extra, immaterial. - Negative-cache entry stores the original error (not err.Error()) so errors.Is(err, oauth.ErrEmailNotVerified) still works after a cache hit. Sentinels keep identity; metrics / status-code distinguishing paths added later don't need to round-trip through string parsing. - Comment on the json.Encode `_ =` documenting that verifyResponse is simple types so Encode can't realistically fail. pkg/oauth/forward.go: dropped the unused *Claims parameter from BuildClickHouseHeaders. The gating-mode wire format no longer derives headers from claims, and the unused-but-shape-stable signature was a small code smell. Updated all call sites. pkg/config/config.go: deprecated-key warnings now flow through the caller's structured logger instead of stderr. - LoadConfigFromFile populates Config.RemovedKeyWarnings []string (yaml/json `-` tagged so it round-trips as zero). - Renamed the package-level slice to RemovedConfigKeys (exported) and the helper to removedKeyWarnings to match Go naming conventions. - cmd/altinity-mcp/main.go iterates over the warnings on both initial load and reloadConfig, emitting them via log.Warn(). docs/oauth_authorization.md: dropped the "10 min – 1 h typical" TTL range that conflicted with actual IdP defaults (Auth0 24h, Google 1h, Okta ~1h). The TTL is wholly the IdP's choice; the doc now just says so. Tests added: - TestNegativeCachePreservesErrorIdentity (sentinel survives cache hit) - TestRemovedKeyWarnings (helper detects all 7 removed keys + handles empty/unparseable/clean inputs) - TestLoadConfigFromFile_PopulatesRemovedKeyWarnings (e2e via the public LoadConfigFromFile entry point) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-replica sidecar hardening, ahead of scaling otel CH to 3.
pkg/oauth/jwks.go:
- oauth.jwks_cache_ttl was previously a no-op config field; now plumbed
through to the underlying JWKS fetcher. Round-2 added the validation
but left the wire-up incomplete.
- JWKS and metadata TTLs get ±10 % uniform jitter at scheduling time
to spread refresh attempts across replicas. Without it three pods
boot together and re-fetch in lockstep — a thundering herd against
the IdP that shows up as periodic latency cliffs on /verify.
- Transient errors (network blip, 5xx, post-rotation kid miss) are
tagged with oauth.ErrTransient and bypass the negative cache
upstream. The previous behaviour pinned a legitimate token as
forbidden for negative_ttl on one replica's bad luck. See
docs/ch-jwt-verify.md "Multi-replica behavior".
pkg/oauth/{config,errors,jwt}.go, verifier_test.go: supporting plumbing
+ tests for the three above.
cmd/ch-jwt-verify/main.go: new /readyz endpoint distinct from /healthz.
/healthz stays unconditional (liveness — flapping IdP must not restart
the container). /readyz reports failure when the most-recent JWKS fetch
errored with no later success; cold start (no attempts yet) is treated
as boot-grace OK so the kubelet doesn't keep the pod NotReady forever
waiting on the first /verify.
cmd/ch-jwt-verify/{verify,verify_test}.go: JWKSHealth() pass-through
from the inner oauth.Verifier so /readyz can read the triple.
helm/ch-jwt-verify/templates/configmap.yaml + values.yaml: rendered
<ch_jwt_verify> XML now carries <connection_timeout_ms>,
<receive_timeout_ms>, <send_timeout_ms> (defaults 1000/3000/1000 —
3s receive sized to cover a cold JWKS round-trip on a freshly scheduled
replica; CH does not retry, so one shot per query).
helm/ch-jwt-verify/templates/_helpers.tpl: probe helpers now point
readinessProbe.httpGet.path at /readyz; livenessProbe stays on /healthz.
docs/ch-jwt-verify.md: documents multi-replica behavior, the two probes,
and the transient-vs-permanent error split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The POST-only restriction in c7fd7bd cited the CH 24.x+ docs claim that <http_authentication_servers> are invoked via POST, but the live Antalya 26.1 build (otel cluster, 26.1.11.20001) actually GETs the verify URL. Returning 405 there breaks delegation entirely: CH silently treats the auth server as unhealthy and falls back to local password validation, which reports WRONG_PASSWORD without ever forwarding to the sidecar. No /verify log line appears for any auth attempt — confirmed against three gating users (Boris, azvonov, a throwaway test user) on a freshly-rolled 3-replica cluster; rolling the sidecar back to 49ecb42 (which accepts GET+POST) restored end-to-end auth immediately. The original credential-in-URL concern that motivated POST-only does not apply: this handler reads credentials only from the Authorization header (forwarded by CH per <forward_headers>), discards the request body and query string, and the listener binds 127.0.0.1 only. TestVerifierRejectsNonPOST renamed to TestVerifierRejectsUnsupportedMethods (drops GET from the rejected set; Allow header now "GET, POST"). New TestVerifierAcceptsGET pins the 401 vs 405 behaviour so a future POST-only attempt fails CI rather than silently breaking gating mode again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A minimal RS256 JWT signer + JWKS server with operational verbs the HA
test plan needs. NOT FOR PRODUCTION. Bind to an in-cluster Service and
point the sidecar's oauth.issuer / jwks_url / audience at it; mint
arbitrary tokens and chaos-engineer the JWKS endpoint without touching
Auth0.
Endpoints:
GET /.well-known/openid-configuration
GET /.well-known/jwks.json
POST /sign?email=…&exp=…&kid=…&aud=…&iat_off=…&nbf_off=…&email_verified=…
POST /rotate?kid=…
POST /retire?kid=… (drop from JWKS but keep private key)
POST /jwks/break?on=true
POST /jwks/slow?ms=N
GET /healthz
Used 2026-05-22 to validate the round-3 sidecar against:
- T1 JIT JWKS refresh on unknown kid (PASS)
- T3 transient JWKS failure → no negative-cache poisoning (PASS)
- T5 sidecar process kill HA gap (PASS, sub-200ms)
- T8 clock-skew tolerance (PASS)
- Audience byte-equal check (one real finding:
trailing-slash insensitivity — token aud
`https://otel-mcp.test` accepted when configured is
`https://otel-mcp.test/`. RFC 8707 deviation, separate fix.)
- Concurrency (50 workers × 30s, 100% 200)
- Churn (20k unique tokens, cache cap=10k, mem stable at 16 MB RSS)
Built and pushed as
ghcr.io/altinity/synthetic-idp:synthetic-idp-fbdd04c-arm64 (arm64-only by
default since the otel demo cluster has only arm64 nodes; override with
ARCHES=).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the in-tree oauth, jwe_auth, and oauth/broker packages with the shared github.com/altinity/go-mcp-oauth-sdk v0.1.0 module. Runtime behavior unchanged — StrictJWTOnly defaults to false, matching mcp's existing soft-pass semantics for opaque tokens.
Drop cmd/ch-jwt-verify, helm/ch-jwt-verify, Dockerfile.ch-jwt-verify, the build script, and docs/ch-jwt-verify.md. The sidecar now lives in github.com/altinity/altinity-oauth-helper. Cross-references in docs/oauth_authorization.md point at the new repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three comments still referenced paths that no longer exist: - pkg/config/config.go deprecated-key replacement pointed at docs/ch-jwt-verify.md - pkg/server/server_auth_oauth.go had a TODO referencing the dropped pkg/oauth/broker package - cmd/altinity-mcp/oauth_regression_test.go pointed at the moved cmd/ch-jwt-verify/verify_test.go All three now point at github.com/altinity/altinity-oauth-helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 21 helpers in oauth_server.go (encodeOAuthJWE, decodeOAuthJWE, PKCE, canonicalResourceURL, scope sanitisers, RFC 6749 error parsers, Google issuer detection, etc.) are byte-for-byte duplicates of go-mcp-oauth-sdk/broker. Drop the local definitions, call broker.* at the use sites, retarget the unit tests onto the SDK exports. oauth_server.go: -354 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synthetic-idp binary's sole consumer was ch-jwt-verify stress tests, and that sidecar now lives in altinity-oauth-helper. Move the test IdP alongside its consumer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure whitespace + struct-tag alignment from `gofmt -w cmd/ pkg/ internal/`. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict in go.mod: take urfave/cli v3.9.0 from main; x/crypto and x/net promoted to direct deps in first require block; remove now-redundant x/net from indirect block and standalone x/crypto indirect stanza. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Shrink MCP's trust radius: MCP no longer impersonates users to ClickHouse via the shared
cluster_secret. Instead, each query carriesAuthorization: Basic base64(email:JWT)and ClickHouse's<http_authentication>delegates to a new colocatedch-jwt-verifysidecar that validates the JWT against the upstream IdP's JWKS. A compromised MCP pod can no longer forge identity to CH.E2E-verified in production: gating-mode MCP behind Auth0 with the sidecar colocated in the ClickHouse pod, returning
currentUser()as the OAuth user's verified email.What changes
Gating mode wire format. Forward mode keeps
Authorization: Bearer <jwt>(Antalyatoken_processorsre-validates). Gating mode now sendsAuthorization: Basic base64(email:JWT):emailclaim (or the namespaced*/emailfallback) and setsAuth.Username = email,Auth.Password = JWT. The CH go-driver assembles the Basic header.Protocol = HTTPunder OAuth-enabled mode (TCP native has no<http_authentication>equivalent).New binary:
cmd/ch-jwt-verify/(sidecar).POST /verifywithAuthorization: Basic …→ 200 + optional{"settings": …}on success, any non-200 on failure.pkg/oauth/jwt.go+jwks.gofor JWKS fetch/cache and signature verification.iss,audbyte-equality (RFC 8707),exp/nbf/iatwith clock skew, required scopes, identity policy (verified-email, domain allow-listing), and user-vs-email match (configurablelowercase_equal/exact).settings_from_scope:config.SHA256(JWT)with positive/negative TTLs.Removed from MCP (~1751 LOC):
cluster_secret+cluster_name(pkg/clickhouse/client.go,pkg/config/config.go,cmd/altinity-mcp/main.go)ClaimsToHeadersclaim → header mapping +ClickHouseHeaderNamecustom-header option (pkg/oauth/forward.go,pkg/oauth/config.go)AllowedEmailDomains,AllowedHostedDomains,AllowUnverifiedEmail(moved to sidecar config)pkg/server/server_client.go,cmd/altinity-mcp/oauth_server.go) — sidecar is the gatevalidateClusterSecretConfigstartup checkpkg/oauth/identity.go,pkg/clickhouse/cluster_secret_test.go,pkg/server/oauth_gating_embedded_test.go,docs/oauth_next_refactor.mdAdded (~489 LOC + new packages):
cmd/ch-jwt-verify/— main + verify + settings + config + testspkg/oauth/policy.go—EmailDomain/ContainsDomain/HasRequiredScopes(extracted for sidecar reuse)helm/ch-jwt-verify/— ConfigMap chart (sidecar YAML config + CH<http_authentication_servers>XML drop-in + reusable container fragment template). Deploys colocated in the CH StatefulSet pod (loopback trust model).Dockerfile.ch-jwt-verify+scripts/build-ch-jwt-verify-image.shdocs/ch-jwt-verify.md— full reference doc for the sidecardocs/oauth_authorization.mdrewritten to be generic and currentStartup-validation tightening (
cmd/altinity-mcp/main.go):Per-cluster migration plan
<http_authentication_servers>to CHconfig.d/.CREATE USER OR REPLACE <email> ON CLUSTER ... IDENTIFIED WITH http SERVER 'ch_jwt_verify' SCHEME 'BASIC' DEFAULT ROLE ...(grammar token ishttp, nothttp_authenticator— CH rejects the latter with SYNTAX_ERROR).cluster_secret/cluster_name/CLICKHOUSE_CLUSTER_SECRETfrom helm values, switchprotocol: http/port: 8123, bump image tag.Test plan
go test ./...— all packages pass (cmd/altinity-mcp, cmd/ch-jwt-verify, pkg/oauth, pkg/server, pkg/clickhouse, pkg/config)go vet ./...— cleancurrentUser()as the OAuth user's verified emailhelm/ch-jwt-verify/lints cleanly (helm lint)Caveats / follow-ups (not in this PR)
volumeMount{name: default, mountPath: /var/lib/clickhouse}on every container; the actual data PVC volume is named with the volumeClaimTemplate name (e.g.default-1-1), so the injected mount fails validation. Workaround: declare anemptyDirvolume nameddefaultin the podTemplate.kubectl patch chi/<name>edits are wiped by any operator-side reconcile push. A proper CHIT that the operator preserves on reconcile is the long-term fix.CREATEd with the newIDENTIFIED WITH httpclause before they can reach the upgraded MCP. Until that's done they cannot query through the new path.Image artifacts
ghcr.io/altinity/altinity-mcp:sidecar-<sha>-arm64(public)ghcr.io/altinity/ch-jwt-verify:sidecar-<sha>-arm64(private by default — flip to public via GitHub UI, or use anImagePullSecretin the deploying namespace)🤖 Generated with Claude Code