feat(auth): add Kanopy CorpSecure auth mode for operator UI#8
feat(auth): add Kanopy CorpSecure auth mode for operator UI#8cbullinger wants to merge 6 commits into
Conversation
Adds OPERATOR_AUTH_MODE=kanopy as an alternative to the existing GitHub PAT auth. When kanopy mode is active, the CorpSecure proxy handles Okta authentication and forwards a signed JWT on X-Kanopy-Internal-Authorization; the app verifies it against the JWKS endpoint and maps Okta group membership to operator/writer roles. GitHub PAT auth is unchanged. Also adds the Drone pipeline and Kanopy Helm values files to deploy the copier as a Kanopy app alongside the existing Cloud Run deployment (phased migration). New env vars (kanopy mode): OPERATOR_AUTH_MODE=kanopy OPERATOR_AUTH_KANOPY_GROUP — Okta group for RoleOperator OPERATOR_AUTH_KANOPY_JWKS_URL — override JWKS endpoint (staging vs prod) New files: services/operator_auth_kanopy.go — JWKS cache + JWT validation + dev bypass .drone.yml — publish to ECR, deploy staging/prod kanopy/values.yaml — base Helm values kanopy/values.staging.yaml — staging overlay kanopy/values.prod.yaml — prod overlay Related: DOCSP-54727
There was a problem hiding this comment.
Pull request overview
Adds a new operator UI authentication option (OPERATOR_AUTH_MODE=kanopy) for deployments behind Kanopy’s CorpSecure proxy, while preserving the existing GitHub PAT-based flow as the default. This includes backend JWT validation + role mapping, frontend UI behavior tweaks for non-PAT auth, and deployment configuration to run the service on Kanopy (Helm) via Drone.
Changes:
- Introduces Kanopy CorpSecure JWT verification (JWKS fetch/cache) and switches operator UI auth behavior based on
OPERATOR_AUTH_MODE. - Updates the operator UI frontend to hide PAT inputs and treat Kanopy auth as “already authenticated”.
- Adds Drone pipeline + Kanopy Helm values for staging/prod deploys.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
services/web/operator/index.html |
Hide PAT section and treat Kanopy mode as authenticated based on /operator/api/status. |
services/operator_ui.go |
Route wrapper dispatches between GitHub PAT auth and Kanopy JWT auth; adjusts repo-permission behavior in kanopy mode. |
services/operator_suggest_rule.go |
Switches LLM suggestion rate limiting from “per PAT” to “per user” (PAT hash or login hash). |
services/operator_repo_filter.go |
Bypasses per-repo filtering when GitHub permission cache is unavailable (kanopy mode). |
services/operator_auth_kanopy.go |
Implements JWKS cache + JWT verification + dev bypass helper for Kanopy mode. |
kanopy/values.yaml |
Base Helm values for Kanopy web-app chart deployment (service, probes, mesh, SA). |
kanopy/values.staging.yaml |
Staging overlay values including Kanopy auth env vars + staging JWKS override + secrets mapping. |
kanopy/values.prod.yaml |
Prod overlay values including Kanopy auth env vars + replica count + secrets mapping. |
configs/environment.go |
Adds config/env vars and validation for OPERATOR_AUTH_MODE and Kanopy-specific settings. |
.drone.yml |
Adds Drone pipeline to publish image to ECR and deploy to Kanopy staging/prod via Helm. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Handle io.ReadAll error in fetchAndParseJWKS rather than silently proceeding with a potentially partial body - Move DEV_BYPASS_AUTH Kubernetes tripwire from per-request panic to package init() with log.Fatal so misconfigured pods are refused at startup, not on the first request - Add tag event to Drone publish step so the image exists in ECR before deploy-prod runs on v* tag pipelines
High: reject service-mesh principals (scp=mesh-internal / spiffe:// sub / empty email) before role assignment. With mesh.enabled=true, CorpSecure injects X-Kanopy-Internal-Authorization on all pod-to-pod requests; without this check any mesh workload could read audit events and delivery logs as RoleWriter. Medium: add operator_auth_kanopy_test.go — 14 tests covering role mapping, mesh-token rejection, alg-confusion, expired tokens, unknown kid, and JWKS server failure. Injectable kanopyJWKSCache (field on operatorUI, not a package global) makes tests hermetic via httptest. Medium: drop random-key fallback when kid is not found — with multiple keys present during rotation, map iteration is non-deterministic and would spuriously reject valid tokens. Fail cleanly with a clear error instead. Low: replace write-lock-during-HTTP-fetch with singleflight so a slow JWKS endpoint degrades individual request latency rather than serialising all auth. Write lock now covers only the in-memory cache update. Low: fix trailing whitespace in kanopy/values.prod.yaml.
cbullinger
left a comment
There was a problem hiding this comment.
Adversarial review — 1 must-fix, 2 design-decision callouts, 1 UX nit.
Must-fix (P0)
log.Fatal+import "log"ininit()violates two AGENT.md conventions ("returnerror, neverlog.Fatal"; "all logging vialog/slog"). See inline comment onoperator_auth_kanopy.go:272.
Design decisions requiring explicit sign-off (P2)
|| f.cache == nilinrepoFilter.bypass()simultaneously removes all four writer-scoping checks. Notably, kanopy writers now see config-change audit events that are explicitly operator-only in GitHub mode, and the full workflow source→target topology. See inline comment onoperator_repo_filter.go:36.- These are intentional tradeoffs (no PAT to call GitHub's permission API), but each surface needs a conscious reviewer accept — the PR description only calls out row visibility, not the config-change event escalation.
UX (P3)
handleRepoPermissionreturningallowed: truefor all repos in kanopy mode causes writer-role users to see enabled replay buttons that 403 on click. See inline comment onoperator_ui.go:318.
Copilot comment assessment (all 3 resolved ✓)
io.ReadAllerror ignored →readErris checked after the HTTP status gate. ✓devBypassUsercrash on first request → correctly moved toinit()for startup rejection. ✓ (Thelog.Fatalinside that fix is the P0 above.)publishstep missingtagtrigger → fixed;tagadded towhen.event. ✓
Verdict: needs-work — P0 must be addressed before merge; P2 items need a reviewer to explicitly accept each access surface in a comment.
| if inCluster { | ||
| log.Fatal("[kanopy auth] DEV_BYPASS_AUTH=1 is set inside a Kubernetes pod — refusing to start. Unset it from the deployment config.") | ||
| } | ||
| } |
There was a problem hiding this comment.
P0 — AGENT.md convention violation: log.Fatal + import "log"
Two rules from AGENT.md are broken here:
- "Return
error, neverlog.Fatal." - "All logging via
log/slog. Neverlog.*."
init() was reached for because it can't return an error. The idiomatic fix for this project is to move the check into RegisterOperatorRoutes() (or a dedicated validateKanopyConfig() called from there), return a startup error up to app.go, and use slog.Error(...) before returning — letting the caller handle exit. This also removes the import "log" from the package entirely.
| // see all rows — there is no PAT available to call the GitHub permission API. | ||
| func (f *repoFilter) bypass() bool { | ||
| return f == nil || f.user == nil || f.user.Role == RoleOperator | ||
| return f == nil || f.user == nil || f.user.Role == RoleOperator || f.cache == nil |
There was a problem hiding this comment.
P2 — || f.cache == nil silently disables all four writer-scoping checks at once
bypass() == true short-circuits every downstream check. The effects are broader than the comment here implies — needs explicit reviewer sign-off on each:
- Audit rows — all rows visible to writers, regardless of repo membership.
- Config-change audit events —
allowAuditEventexplicitly designates events with neithersource_reponortarget_repoas operator-only: "writers don't get to see admin actions" (line 73).bypass()short-circuits that gate, so kanopy writers see operator-only config-change events. - Webhook traces — traces without a
Repofield are normally dropped for writers (they can carry sensitive Detail content); that drop is bypassed. - Workflow topology (
handleWorkflows, line 794) — the comment there calls this "the largest single repo-topology leak in the operator UI." The topology filter is now bypassed for all kanopy writers.
None of this is exploitable from outside (CorpSecure guarantees a valid MongoDB employee), but points 2 and 4 are material escalations above what writer-role grants in GitHub mode. Reviewers should consciously accept all four.
|
|
||
| // In kanopy mode there is no GitHub PAT to call the permissions API with. | ||
| // Writers see all repos (CorpSecure already guarantees a valid MongoDB employee). | ||
| if o.authMode == "kanopy" { |
There was a problem hiding this comment.
P3 — Writers get allowed: true for all repos, causing misleading replay-button state
/api/repo-permission is called by the frontend to decide which replay buttons to enable. In kanopy mode, writers receive allowed: true for every repo — so all replay buttons light up. Writers can't actually replay (wrapOperatorOnly correctly returns 403), but they see affordances that don't work.
Consider scoping the shortcut to operators only:
if o.authMode == "kanopy" {
user := operatorUserFromCtx(r)
allowed := user != nil && user.Role == RoleOperator
for _, repo := range repos { ... result[repo] = repoPerm{Allowed: allowed} }
...
}Or surface auth_mode + role from /api/me and let the frontend suppress replay buttons for non-operators, skipping the permissions call entirely in kanopy mode.
Summary
OPERATOR_AUTH_MODE=kanopyas an alternative to GitHub PAT auth for the operator UI. When active, Kanopy's CorpSecure proxy handles Okta authentication and forwards a signed JWT onX-Kanopy-Internal-Authorization; the app verifies it against the CorpSecure JWKS endpoint and maps Okta group membership tooperator/writerroles.OPERATOR_AUTH_MODE=github, the default) is fully unchanged.New env vars (kanopy mode)
OPERATOR_AUTH_MODE=kanopyOPERATOR_AUTH_KANOPY_GROUPRoleOperator(e.g.10gen-github-copier-operators)OPERATOR_AUTH_KANOPY_JWKS_URLvalues.staging.yamlNew files
services/operator_auth_kanopy.goDEV_BYPASS_AUTH=1).drone.ymlv*tagkanopy/values.yamlkanopy/values.staging.yamlenvSecretsreferenceskanopy/values.prod.yamlenvSecretsreferencesKey design decisions
crypto/rsa+math/big; JWT verification uses the existinggolang-jwt/jwt/v5.repoFilter.bypass()returnstruewhenghCache == nil(kanopy mode), so writers see all rows — there is no GitHub PAT available to call the repo permissions API.DEV_BYPASS_AUTH=1grants a synthetic operator identity in local dev; a Kubernetes tripwire (KUBERNETES_SERVICE_HOST) refuses to boot the kanopy auth module if the bypass is set inside a real pod.Before deploying to Kanopy
GITHUB_APP_IDandINSTALLATION_IDinkanopy/values.staging.yamlandvalues.prod.yaml.10gen-github-copier-operatorsOkta group and add initial members.Test plan
OPERATOR_AUTH_MODE=github(default): existing GitHub PAT auth works unchanged — login, role assignment, per-repo filtering, replay permission check.OPERATOR_AUTH_MODE=kanopywithDEV_BYPASS_AUTH=1: UI loads without PAT section, user appears in topbar as operator, all tabs load.OPERATOR_AUTH_MODE=kanopymissingOPERATOR_AUTH_KANOPY_GROUP: startup validation rejects with a clear error.OPERATOR_AUTH_MODE=invalid: startup validation rejects with a clear error.github-copier.docs.staging.corp.mongodb.com/operator/is protected by CorpSecure, operator group members get full access, other employees get writer access.Related: DOCSP-54727