Skip to content

Implement: Managed (external) Postgres mode for the Helm chart (#173)#305

Merged
ealt merged 10 commits into
mainfrom
impl/issue-173-managed-postgres
Jun 10, 2026
Merged

Implement: Managed (external) Postgres mode for the Helm chart (#173)#305
ealt merged 10 commits into
mainfrom
impl/issue-173-managed-postgres

Conversation

@ealt

@ealt ealt commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 13c. Adds a postgres.mode={embedded,external} switch to the 13a Helm chart so an operator can run the task-store-server against a managed Postgres they provisioned out-of-band (RDS / Cloud SQL / AlloyDB / Supabase / Neon / Crunchy Bridge / …) instead of the in-cluster StatefulSet.

  • embedded stays the default and renders byte-for-byte identical to 13a (verified by diffing helm template) — a 13a→13c upgrade with an untouched values.yaml triggers no rollout.
  • external accepts a URL-form DSN inline (postgres.external.connectionString) or from a Secret (postgres.external.existingSecret / whole-chart secrets.existingSecret). values.schema.json fails closed when neither is set (no fictional default host), rejects weak tls.mode, requires a CA bundle for verify-ca/verify-full, requires the tlsAlreadyEncodedInSecret ack whenever the chart can't append the TLS suffix itself, rejects tls.enabled for embedded (stock Postgres has no server-side TLS), and rejects external + lease mode (unsupported in v0).
  • Opt-in postgres.tls appends ?sslmode=…[&sslrootcert=…] and mounts the provider CA bundle. setup-experiment-helm.sh skips the embedded-Postgres readiness wait in external mode (otherwise mode-agnostic). PostgresStore / _postgres_schema.py / the wire protocol are untouched — managed Postgres is a deployment-substrate change only (no conformance impact).

Plan ↔ chart reconciliation. The 13c plan predates the 13a merge (#298) and assumed a chart shape 13a didn't ship. This PR deviates where 13a's reality is simpler — no postgres.embedded.* rename/shim (AGENTS.md greenfield rule), no envFromvalueFrom switch / "upgrade trap" (13a already composes EDEN_STORE_URL into the Secret + consumes it via envFrom), and three derived DSNs (store/readonly/control-plane) rather than one. Full reconciliation in the CHANGELOG entry.

What this does NOT cover

  • Separate readonly-role + control-plane-database provisioning in external mode — v0 points all three DSN keys at the one operator DSN; EDEN_READONLY_PASSWORD is blanked so no DCL runs against the managed instance → #299. This also makes external + lease mode unsupported (lease mode itself deferred behind #281); the schema rejects the combination.
  • Connection pooler + multi-replica task-store-server + read-replica routing — future amendments, load-bearing only past the single-replica threshold 13c sits at → #300.
  • Standalone schema migration tool (Alembic/sqitch)ensure_schema-on-first-connect stays canonical for v0's single schema version; forward constraint (recognize schema_version as v1-applied) documented in helm.md §4.5 → #301.
  • helm-smoke-managed-postgres branch-protection requirement — not yet required (same posture newly-added smoke jobs took), tracked with the 13a helm jobs under #286.
  • Compose deployment — unchanged (keeps its in-stack Postgres for local dev).

Fresh-operator walkthrough

Changed operator-facing surfaces: chart values (postgres.mode/external/tls), secret.yaml + task-store-server env wiring, setup-experiment-helm.sh, and the new migration runbook.

  • Walkthrough performed against the values/template surface via the helm template/helm lint matrix (embedded byte-identical; external inline/existingSecret/whole-chart-secret; TLS verify-full CA mount; all schema fail-closed cases A1–A3/B1–B3 + tls+embedded + secrets+tls-ack + external+lease). Each runbook YAML/helm snippet was lint-checked against the schema (incl. the corrected existingSecret+TLS examples and the rollback reset command).
  • Notes: passed cleanly. kind is not installed locally, so the two kind-based smokes (ci-smoke.sh, ci-smoke-managed-postgres.sh) run in CI (helm-smoke, helm-smoke-managed-postgres); both scripts pass bash -n.

Test plan

  • scripts/git-hooks/pre-push (rename-discipline, complexity-gate, ruff, pyright, markdownlint) — clean.
  • helm lint + helm template matrix across both modes + the schema fail-closed cases (see helm-lint job; replicated locally).
  • Embedded-mode helm template byte-for-byte identical to main.
  • Codex review (synchronous, foreground) — 8 rounds, all findings fixed, final pass clean. Record committed under docs/plans/review/eden-phase-13c-managed-postgres/impl/.
  • CI helm-lint + helm-smoke + helm-smoke-managed-postgres (kind-based smokes run in CI only).

Related issues

Closes #173. Deferrals: #299, #300, #301 (filed this PR); references #281, #286.

🤖 Generated with Claude Code

ealt and others added 10 commits June 9, 2026 22:38
Add postgres.mode={embedded,external} to the 13a chart. embedded stays the
default and renders byte-for-byte identical (no upgrade rollout). external
points the task-store-server at an operator-provisioned managed Postgres
(RDS/Cloud SQL/AlloyDB/Supabase/Neon/…) via a URL-form DSN supplied inline or
from an existing Secret; values.schema.json fails closed when neither is set
and rejects weak TLS modes / missing CA bundles. Opt-in postgres.tls appends
the sslmode/sslrootcert suffix and mounts the CA bundle. setup-experiment-helm.sh
skips the embedded-Postgres readiness wait in external mode. New helm-lint
external matrix + helm-smoke-managed-postgres CI job (sibling Postgres on the
kind network). Runbook + helm.md cover migration, TLS, providers, pooler
threshold, and the schema-version forward constraint.

Deviates from the pre-13a plan where 13a's reality is simpler (no embedded.*
rename/shim; no envFrom->valueFrom switch; three DSNs not one) — see CHANGELOG
"Plan <-> chart reconciliation".

Closes #173

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Runbook external.existingSecret+TLS examples (Path A + Path B) were rejected
  by values.schema.json (missing tlsAlreadyEncodedInSecret + un-encoded DSN);
  encode TLS in the DSN, set the ack flag, note the inline-connectionString
  alternative. Verified the corrected combo passes helm lint.
- Link the pooler/multi-replica/read-replica (#300) and migration-tool (#301)
  deferrals in the CHANGELOG per the AGENTS.md deferral-tracking rule.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nectionString

When both postgres.external.connectionString AND external.existingSecret were
set, secret.yaml still wrote the inline DSN into the chart-managed Secret's
three DSN keys while the task-store-server sourced EDEN_STORE_URL from the
existingSecret — two conflicting sources, leaking a stale inline password into
the readonly/control-plane keys. Gate the inline write on existingSecret being
absent so existingSecret takes precedence (the values.yaml comment already
promised this). Verified: with both set, the chart Secret omits all DSN keys
and the inline DSN appears nowhere in the render.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. Schema: reject mode=external + whole-chart secrets.existingSecret + inline
   connectionString (the chart Secret doesn't render in that case, so the inline
   DSN was silently dropped). Move the "require external.*" gate to a top-level
   clause that fires only when secrets.existingSecret is empty, and require
   external.connectionString be empty when secrets.existingSecret is set — so the
   whole-chart Secret's EDEN_STORE_URL is the authoritative DSN with no foot-gun.
2. task-store-server: blank EDEN_READONLY_PASSWORD in external mode so an
   operator's whole-chart existingSecret carrying it does not trigger
   ensure_readonly_role's CREATE/ALTER ROLE/GRANT against the managed instance
   (fails for accounts without role-management privilege; contradicts the runbook).
3. Runbook Path B: do the mode swap via `helm upgrade --reuse-values` with the
   app-tier replicas pinned to 0 (not a setup-script re-run, which reconciles
   them back to 1), so no new writes hit the restored DB before verification;
   bring the app tier up with a second upgrade only after the event count matches.

Verified: schema matrix (A1-A3/B1-B3 + tls), readonly blanked in render, embedded
still byte-identical. New CI fail-closed assertion for the B2 case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. Schema: reject postgres.tls.enabled=true unless mode=external — the embedded
   StatefulSet runs stock Postgres with no server-side TLS, so appending sslmode
   to its DSN would make the task-store-server fail to connect.
2. Schema: require tlsAlreadyEncodedInSecret when mode=external + tls.enabled +
   a whole-chart secrets.existingSecret (the chart-managed Secret doesn't render
   there, so the chart can't append the suffix — same fail-closed posture as the
   postgres.external.existingSecret guard).
3. ci-smoke-managed-postgres.sh: track CREATED_PG_CONTAINER and refuse to reuse
   a pre-existing container of the same name, so an early exit never deletes a
   container this run did not create (mirrors the kind-cluster guard).

Updated values.yaml/helm.md TLS docs (external-only; both existingSecret paths
need the ack). New CI fail-closed assertions. Verified: matrix + embedded
byte-identical still hold.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview record

- Schema: reject orchestrator.leaseMode.enabled=true when postgres.mode=external.
  external + lease is unsupported in v0 (deferred behind #281/#299); without the
  guard the control-plane Deployment renders and crash-loops on a missing
  EDEN_CONTROL_PLANE_STORE_URL. New CI fail-closed assertion covers it.
- Commit the impl-stage codex-review record under
  docs/plans/review/eden-phase-13c-managed-postgres/impl/ (AGENTS.md requirement
  for a plan-backed chunk marked shipped).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…LS guard

The round-4 schema guard (TLS rejected for mode=embedded) made the Path B
rollback command fail: with --reuse-values the release retains tls.enabled=true,
so a bare --set postgres.mode=embedded trips the guard. Give the rollback an
explicit command that resets tls.enabled=false + clears the external.* overrides
while restoring replicas (verified it passes helm lint). Update the impl review
record to capture all six review rounds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ook drain race

- [P1] helm-lint external+existingSecret step ended on a negative `grep -q ... &&
  { exit 1; }`; when the grep correctly finds nothing it exits 1 and fails the
  step on the GOOD path. Convert negative checks to `if grep -q; then exit 1; fi`
  so the last command returns 0 when the chart is correct. Verified under bash -e.
- [P2] Runbook Path B drain: `kubectl scale` returns before pods exit, so an
  immediate reclaim/dump races terminating workers. Add `kubectl wait
  --for=delete pod` after each scale-down and a pre-snapshot pods-at-zero check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mode

- Troubleshooting: task-store-server CrashLoopBackOff guidance now branches on
  mode — external has no eden-postgres-0 pod; check the managed instance/DSN/TLS
  and the task-store-server logs.
- §9: managed external Postgres is no longer "not covered" (13c ships it); point
  to §4.4 + the runbook. Roadmap ship date -> 2026-06-10.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ckend conflicts)

Tree is identical to the validated rebase commit a931967 — values.schema.json
allOf clauses merged with #304's blob.backend clauses, CHANGELOG keeps both
[Unreleased] entries.
@ealt ealt enabled auto-merge (squash) June 10, 2026 17:14
@ealt ealt merged commit fc92bf0 into main Jun 10, 2026
24 checks passed
ealt added a commit that referenced this pull request Jun 11, 2026
…311)

* Add helm-upgrade-smoke CI job (Phase 13a §6.3 deferral, #284)

Installs the chart at the merge-base with origin/main, drives the fixture
to a first variant.integrated, helm-upgrades in place to the worktree
chart (--reset-then-reuse-values, the amended docs §7 procedure), and
asserts: no immutable-field errors, every workload rolls to readiness,
event counts never regress (PVC/state survival), the full helm-smoke
end-state is still reached, and helm test passes post-upgrade.

Shared kind/event harness extracted from ci-smoke.sh into a sourced
ci-smoke-lib.sh (bash-3.2-clean); ci-smoke.sh behavior unchanged.

First catch, fix bundled: upgrading a pre-13d release left API-defaulted
rollingUpdate values on the task-store-server Deployment while 13d's
template sets strategy Recreate — the API server rejects the patch. The
template now renders an explicit 'rollingUpdate: null' so the upgrade
deletes the field. Validated locally in kind with helm 3.16.2 (CI's pin)
via EDEN_UPGRADE_BASELINE_REF=07482b6 (13a chart -> worktree chart).

Closes #284.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Address codex round 0: doubled-total progress gate + baseline fallbacks

P2: the 3-variant fixture often completes during setup's own rollout
waits, making post-upgrade progress assertions vacuous. The smoke now
derives a doubled-total (6-variant) config, waits for the 6-variant
end-state, and asserts strict post-upgrade progress keyed on
variant.integrated (the last, slowest-to-saturate chain stage —
task.completed saturates while integrations are still in flight).

P3: when merge-base(HEAD, origin/main) is HEAD itself (main push or
local run on main), fall back to HEAD~1 so the run still crosses a
real chart diff; never overrides an explicit EDEN_UPGRADE_BASELINE_REF.
P3: fetch origin/main with an explicit refspec, tolerated only when
origin/main already resolves; guard that the baseline contains the
chart at all.

Validated: two further full kind runs on helm 3.16.2 with the default
merge-base baseline; the last exercised the strict-progress branch
(4/6 integrated pre-upgrade -> 6/6 post).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Address codex round 1: baseline strict progress on the post-rollout snapshot

Comparing INTEGRATED_FINAL against the PRE-upgrade snapshot let
integrations absorbed into the upgrade window masquerade as
post-upgrade progress (run-3's trace showed exactly that: 4/6 pre,
6/6 already at the post-rollout snapshot). The gate now baselines on
INTEGRATED_POST — taken after helm upgrade --wait and every rollout
settled — so the progress proven belongs to the upgraded-and-ready
stack; the everything-integrated-during-the-window case logs a NOTE
instead of flaking.

Validated: full kind run on helm 3.16.2, default merge-base baseline,
with the strict branch exercised (4/6 at the post-rollout snapshot ->
6/6 final).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Post-rebase docs: #310 deferral + post-rollout-baseline wording

Rebased onto fc92bf0 (#305 managed Postgres): kept both appended CI
jobs in ci.yml; #305's ci-smoke-managed-postgres.sh landed against the
pre-lib harness shape, ported to ci-smoke-lib.sh in follow-up #310.
Re-validated the full upgrade smoke locally on the rebased tree
(baseline fc92bf0 -> worktree, helm 3.16.2): PASSED.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Roadmap: flip #284 helm-upgrade-smoke to shipped (PR #311)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Eric Alt <13019253+ealt@users.noreply.github.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 13c — Managed Postgres migration

1 participant