Skip to content

feat(auth): add Kanopy CorpSecure auth mode for operator UI#8

Open
cbullinger wants to merge 6 commits into
mainfrom
feat/kanopy-auth
Open

feat(auth): add Kanopy CorpSecure auth mode for operator UI#8
cbullinger wants to merge 6 commits into
mainfrom
feat/kanopy-auth

Conversation

@cbullinger

@cbullinger cbullinger commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds OPERATOR_AUTH_MODE=kanopy as an alternative to GitHub PAT auth for the operator UI. When active, Kanopy's CorpSecure proxy handles Okta authentication and forwards a signed JWT on X-Kanopy-Internal-Authorization; the app verifies it against the CorpSecure JWKS endpoint and maps Okta group membership to operator/writer roles.
  • Adds the Drone CI/CD pipeline and Kanopy Helm values to deploy the copier as a Kanopy app alongside the existing Cloud Run deployment (phased migration — both run in parallel until Cloud Run is decommissioned).
  • GitHub PAT auth (OPERATOR_AUTH_MODE=github, the default) is fully unchanged.

New env vars (kanopy mode)

Var Description
OPERATOR_AUTH_MODE=kanopy Enable Kanopy auth
OPERATOR_AUTH_KANOPY_GROUP Okta group whose members get RoleOperator (e.g. 10gen-github-copier-operators)
OPERATOR_AUTH_KANOPY_JWKS_URL Override JWKS endpoint — set to staging URL in values.staging.yaml

New files

File Purpose
services/operator_auth_kanopy.go JWKS cache (TTL + failure backoff), JWT verification, dev bypass (DEV_BYPASS_AUTH=1)
.drone.yml Drone pipeline: publish Docker image to ECR, deploy staging on push to main, deploy prod on v* tag
kanopy/values.yaml Base Helm values (image, port, probes, mesh, service account)
kanopy/values.staging.yaml Staging overlay — hostname, env vars, envSecrets references
kanopy/values.prod.yaml Prod overlay — hostname, 2 replicas, env vars, envSecrets references

Key design decisions

  • JWKS cache: single global instance per process, 10-min TTL, 30-sec failure backoff (serves stale keys rather than hammering a downed endpoint), double-checked locking with write-lock held during fetch.
  • No new Go dependencies: RSA JWK parsing is implemented directly using crypto/rsa + math/big; JWT verification uses the existing golang-jwt/jwt/v5.
  • Per-repo scoping in kanopy mode: repoFilter.bypass() returns true when ghCache == nil (kanopy mode), so writers see all rows — there is no GitHub PAT available to call the repo permissions API.
  • Rate limiting: the LLM suggest-rule limiter falls back to hashed login in kanopy mode (no PAT available).
  • Dev bypass: DEV_BYPASS_AUTH=1 grants a synthetic operator identity in local dev; a Kubernetes tripwire (KUBERNETES_SERVICE_HOST) refuses to boot the kanopy auth module if the bypass is set inside a real pod.

Before deploying to Kanopy

  1. Fill in GITHUB_APP_ID and INSTALLATION_ID in kanopy/values.staging.yaml and values.prod.yaml.
  2. Provision secrets on each cluster before first deploy:
    helm ksec set github-copier-secrets \
      GITHUB_APP_PRIVATE_KEY_B64="$(base64 < your-app.private-key.pem)" \
      WEBHOOK_SECRET="..." \
      MONGO_URI="mongodb+srv://..." \
      ANTHROPIC_API_KEY="sk-ant-..."
    
  3. Create the 10gen-github-copier-operators Okta group and add initial members.

Test plan

  • OPERATOR_AUTH_MODE=github (default): existing GitHub PAT auth works unchanged — login, role assignment, per-repo filtering, replay permission check.
  • OPERATOR_AUTH_MODE=kanopy with DEV_BYPASS_AUTH=1: UI loads without PAT section, user appears in topbar as operator, all tabs load.
  • OPERATOR_AUTH_MODE=kanopy missing OPERATOR_AUTH_KANOPY_GROUP: startup validation rejects with a clear error.
  • OPERATOR_AUTH_MODE=invalid: startup validation rejects with a clear error.
  • Staging Kanopy deploy: github-copier.docs.staging.corp.mongodb.com/operator/ is protected by CorpSecure, operator group members get full access, other employees get writer access.

Related: DOCSP-54727

Adds OPERATOR_AUTH_MODE=kanopy as an alternative to the existing GitHub
PAT auth. When kanopy mode is active, the CorpSecure proxy handles Okta
authentication and forwards a signed JWT on X-Kanopy-Internal-Authorization;
the app verifies it against the JWKS endpoint and maps Okta group membership
to operator/writer roles. GitHub PAT auth is unchanged.

Also adds the Drone pipeline and Kanopy Helm values files to deploy the
copier as a Kanopy app alongside the existing Cloud Run deployment (phased
migration).

New env vars (kanopy mode):
  OPERATOR_AUTH_MODE=kanopy
  OPERATOR_AUTH_KANOPY_GROUP   — Okta group for RoleOperator
  OPERATOR_AUTH_KANOPY_JWKS_URL — override JWKS endpoint (staging vs prod)

New files:
  services/operator_auth_kanopy.go  — JWKS cache + JWT validation + dev bypass
  .drone.yml                         — publish to ECR, deploy staging/prod
  kanopy/values.yaml                 — base Helm values
  kanopy/values.staging.yaml         — staging overlay
  kanopy/values.prod.yaml            — prod overlay

Related: DOCSP-54727

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new operator UI authentication option (OPERATOR_AUTH_MODE=kanopy) for deployments behind Kanopy’s CorpSecure proxy, while preserving the existing GitHub PAT-based flow as the default. This includes backend JWT validation + role mapping, frontend UI behavior tweaks for non-PAT auth, and deployment configuration to run the service on Kanopy (Helm) via Drone.

Changes:

  • Introduces Kanopy CorpSecure JWT verification (JWKS fetch/cache) and switches operator UI auth behavior based on OPERATOR_AUTH_MODE.
  • Updates the operator UI frontend to hide PAT inputs and treat Kanopy auth as “already authenticated”.
  • Adds Drone pipeline + Kanopy Helm values for staging/prod deploys.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/web/operator/index.html Hide PAT section and treat Kanopy mode as authenticated based on /operator/api/status.
services/operator_ui.go Route wrapper dispatches between GitHub PAT auth and Kanopy JWT auth; adjusts repo-permission behavior in kanopy mode.
services/operator_suggest_rule.go Switches LLM suggestion rate limiting from “per PAT” to “per user” (PAT hash or login hash).
services/operator_repo_filter.go Bypasses per-repo filtering when GitHub permission cache is unavailable (kanopy mode).
services/operator_auth_kanopy.go Implements JWKS cache + JWT verification + dev bypass helper for Kanopy mode.
kanopy/values.yaml Base Helm values for Kanopy web-app chart deployment (service, probes, mesh, SA).
kanopy/values.staging.yaml Staging overlay values including Kanopy auth env vars + staging JWKS override + secrets mapping.
kanopy/values.prod.yaml Prod overlay values including Kanopy auth env vars + replica count + secrets mapping.
configs/environment.go Adds config/env vars and validation for OPERATOR_AUTH_MODE and Kanopy-specific settings.
.drone.yml Adds Drone pipeline to publish image to ECR and deploy to Kanopy staging/prod via Helm.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/operator_auth_kanopy.go
Comment thread services/operator_auth_kanopy.go Outdated
Comment thread .drone.yml
cbullinger and others added 5 commits June 9, 2026 18:27
- Handle io.ReadAll error in fetchAndParseJWKS rather than silently
  proceeding with a potentially partial body
- Move DEV_BYPASS_AUTH Kubernetes tripwire from per-request panic to
  package init() with log.Fatal so misconfigured pods are refused at
  startup, not on the first request
- Add tag event to Drone publish step so the image exists in ECR before
  deploy-prod runs on v* tag pipelines
High: reject service-mesh principals (scp=mesh-internal / spiffe:// sub /
empty email) before role assignment. With mesh.enabled=true, CorpSecure
injects X-Kanopy-Internal-Authorization on all pod-to-pod requests;
without this check any mesh workload could read audit events and delivery
logs as RoleWriter.

Medium: add operator_auth_kanopy_test.go — 14 tests covering role mapping,
mesh-token rejection, alg-confusion, expired tokens, unknown kid, and JWKS
server failure. Injectable kanopyJWKSCache (field on operatorUI, not a
package global) makes tests hermetic via httptest.

Medium: drop random-key fallback when kid is not found — with multiple keys
present during rotation, map iteration is non-deterministic and would
spuriously reject valid tokens. Fail cleanly with a clear error instead.

Low: replace write-lock-during-HTTP-fetch with singleflight so a slow JWKS
endpoint degrades individual request latency rather than serialising all
auth. Write lock now covers only the in-memory cache update.

Low: fix trailing whitespace in kanopy/values.prod.yaml.

@cbullinger cbullinger left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review — 1 must-fix, 2 design-decision callouts, 1 UX nit.

Must-fix (P0)

  • log.Fatal + import "log" in init() violates two AGENT.md conventions ("return error, never log.Fatal"; "all logging via log/slog"). See inline comment on operator_auth_kanopy.go:272.

Design decisions requiring explicit sign-off (P2)

  • || f.cache == nil in repoFilter.bypass() simultaneously removes all four writer-scoping checks. Notably, kanopy writers now see config-change audit events that are explicitly operator-only in GitHub mode, and the full workflow source→target topology. See inline comment on operator_repo_filter.go:36.
  • These are intentional tradeoffs (no PAT to call GitHub's permission API), but each surface needs a conscious reviewer accept — the PR description only calls out row visibility, not the config-change event escalation.

UX (P3)

  • handleRepoPermission returning allowed: true for all repos in kanopy mode causes writer-role users to see enabled replay buttons that 403 on click. See inline comment on operator_ui.go:318.

Copilot comment assessment (all 3 resolved ✓)

  1. io.ReadAll error ignored → readErr is checked after the HTTP status gate. ✓
  2. devBypassUser crash on first request → correctly moved to init() for startup rejection. ✓ (The log.Fatal inside that fix is the P0 above.)
  3. publish step missing tag trigger → fixed; tag added to when.event. ✓

Verdict: needs-work — P0 must be addressed before merge; P2 items need a reviewer to explicitly accept each access surface in a comment.

if inCluster {
log.Fatal("[kanopy auth] DEV_BYPASS_AUTH=1 is set inside a Kubernetes pod — refusing to start. Unset it from the deployment config.")
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 — AGENT.md convention violation: log.Fatal + import "log"

Two rules from AGENT.md are broken here:

  1. "Return error, never log.Fatal."
  2. "All logging via log/slog. Never log.*."

init() was reached for because it can't return an error. The idiomatic fix for this project is to move the check into RegisterOperatorRoutes() (or a dedicated validateKanopyConfig() called from there), return a startup error up to app.go, and use slog.Error(...) before returning — letting the caller handle exit. This also removes the import "log" from the package entirely.

// see all rows — there is no PAT available to call the GitHub permission API.
func (f *repoFilter) bypass() bool {
return f == nil || f.user == nil || f.user.Role == RoleOperator
return f == nil || f.user == nil || f.user.Role == RoleOperator || f.cache == nil

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — || f.cache == nil silently disables all four writer-scoping checks at once

bypass() == true short-circuits every downstream check. The effects are broader than the comment here implies — needs explicit reviewer sign-off on each:

  1. Audit rows — all rows visible to writers, regardless of repo membership.
  2. Config-change audit eventsallowAuditEvent explicitly designates events with neither source_repo nor target_repo as operator-only: "writers don't get to see admin actions" (line 73). bypass() short-circuits that gate, so kanopy writers see operator-only config-change events.
  3. Webhook traces — traces without a Repo field are normally dropped for writers (they can carry sensitive Detail content); that drop is bypassed.
  4. Workflow topology (handleWorkflows, line 794) — the comment there calls this "the largest single repo-topology leak in the operator UI." The topology filter is now bypassed for all kanopy writers.

None of this is exploitable from outside (CorpSecure guarantees a valid MongoDB employee), but points 2 and 4 are material escalations above what writer-role grants in GitHub mode. Reviewers should consciously accept all four.

Comment thread services/operator_ui.go

// In kanopy mode there is no GitHub PAT to call the permissions API with.
// Writers see all repos (CorpSecure already guarantees a valid MongoDB employee).
if o.authMode == "kanopy" {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 — Writers get allowed: true for all repos, causing misleading replay-button state

/api/repo-permission is called by the frontend to decide which replay buttons to enable. In kanopy mode, writers receive allowed: true for every repo — so all replay buttons light up. Writers can't actually replay (wrapOperatorOnly correctly returns 403), but they see affordances that don't work.

Consider scoping the shortcut to operators only:

if o.authMode == "kanopy" {
    user := operatorUserFromCtx(r)
    allowed := user != nil && user.Role == RoleOperator
    for _, repo := range repos { ... result[repo] = repoPerm{Allowed: allowed} }
    ...
}

Or surface auth_mode + role from /api/me and let the frontend suppress replay buttons for non-operators, skipping the permissions call entirely in kanopy mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants