Implement: Managed (external) Postgres mode for the Helm chart (#173)#305
Merged
Conversation
Add postgres.mode={embedded,external} to the 13a chart. embedded stays the
default and renders byte-for-byte identical (no upgrade rollout). external
points the task-store-server at an operator-provisioned managed Postgres
(RDS/Cloud SQL/AlloyDB/Supabase/Neon/…) via a URL-form DSN supplied inline or
from an existing Secret; values.schema.json fails closed when neither is set
and rejects weak TLS modes / missing CA bundles. Opt-in postgres.tls appends
the sslmode/sslrootcert suffix and mounts the CA bundle. setup-experiment-helm.sh
skips the embedded-Postgres readiness wait in external mode. New helm-lint
external matrix + helm-smoke-managed-postgres CI job (sibling Postgres on the
kind network). Runbook + helm.md cover migration, TLS, providers, pooler
threshold, and the schema-version forward constraint.
Deviates from the pre-13a plan where 13a's reality is simpler (no embedded.*
rename/shim; no envFrom->valueFrom switch; three DSNs not one) — see CHANGELOG
"Plan <-> chart reconciliation".
Closes #173
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Runbook external.existingSecret+TLS examples (Path A + Path B) were rejected by values.schema.json (missing tlsAlreadyEncodedInSecret + un-encoded DSN); encode TLS in the DSN, set the ack flag, note the inline-connectionString alternative. Verified the corrected combo passes helm lint. - Link the pooler/multi-replica/read-replica (#300) and migration-tool (#301) deferrals in the CHANGELOG per the AGENTS.md deferral-tracking rule. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nectionString When both postgres.external.connectionString AND external.existingSecret were set, secret.yaml still wrote the inline DSN into the chart-managed Secret's three DSN keys while the task-store-server sourced EDEN_STORE_URL from the existingSecret — two conflicting sources, leaking a stale inline password into the readonly/control-plane keys. Gate the inline write on existingSecret being absent so existingSecret takes precedence (the values.yaml comment already promised this). Verified: with both set, the chart Secret omits all DSN keys and the inline DSN appears nowhere in the render. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. Schema: reject mode=external + whole-chart secrets.existingSecret + inline connectionString (the chart Secret doesn't render in that case, so the inline DSN was silently dropped). Move the "require external.*" gate to a top-level clause that fires only when secrets.existingSecret is empty, and require external.connectionString be empty when secrets.existingSecret is set — so the whole-chart Secret's EDEN_STORE_URL is the authoritative DSN with no foot-gun. 2. task-store-server: blank EDEN_READONLY_PASSWORD in external mode so an operator's whole-chart existingSecret carrying it does not trigger ensure_readonly_role's CREATE/ALTER ROLE/GRANT against the managed instance (fails for accounts without role-management privilege; contradicts the runbook). 3. Runbook Path B: do the mode swap via `helm upgrade --reuse-values` with the app-tier replicas pinned to 0 (not a setup-script re-run, which reconciles them back to 1), so no new writes hit the restored DB before verification; bring the app tier up with a second upgrade only after the event count matches. Verified: schema matrix (A1-A3/B1-B3 + tls), readonly blanked in render, embedded still byte-identical. New CI fail-closed assertion for the B2 case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. Schema: reject postgres.tls.enabled=true unless mode=external — the embedded StatefulSet runs stock Postgres with no server-side TLS, so appending sslmode to its DSN would make the task-store-server fail to connect. 2. Schema: require tlsAlreadyEncodedInSecret when mode=external + tls.enabled + a whole-chart secrets.existingSecret (the chart-managed Secret doesn't render there, so the chart can't append the suffix — same fail-closed posture as the postgres.external.existingSecret guard). 3. ci-smoke-managed-postgres.sh: track CREATED_PG_CONTAINER and refuse to reuse a pre-existing container of the same name, so an early exit never deletes a container this run did not create (mirrors the kind-cluster guard). Updated values.yaml/helm.md TLS docs (external-only; both existingSecret paths need the ack). New CI fail-closed assertions. Verified: matrix + embedded byte-identical still hold. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview record - Schema: reject orchestrator.leaseMode.enabled=true when postgres.mode=external. external + lease is unsupported in v0 (deferred behind #281/#299); without the guard the control-plane Deployment renders and crash-loops on a missing EDEN_CONTROL_PLANE_STORE_URL. New CI fail-closed assertion covers it. - Commit the impl-stage codex-review record under docs/plans/review/eden-phase-13c-managed-postgres/impl/ (AGENTS.md requirement for a plan-backed chunk marked shipped). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…LS guard The round-4 schema guard (TLS rejected for mode=embedded) made the Path B rollback command fail: with --reuse-values the release retains tls.enabled=true, so a bare --set postgres.mode=embedded trips the guard. Give the rollback an explicit command that resets tls.enabled=false + clears the external.* overrides while restoring replicas (verified it passes helm lint). Update the impl review record to capture all six review rounds. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ook drain race
- [P1] helm-lint external+existingSecret step ended on a negative `grep -q ... &&
{ exit 1; }`; when the grep correctly finds nothing it exits 1 and fails the
step on the GOOD path. Convert negative checks to `if grep -q; then exit 1; fi`
so the last command returns 0 when the chart is correct. Verified under bash -e.
- [P2] Runbook Path B drain: `kubectl scale` returns before pods exit, so an
immediate reclaim/dump races terminating workers. Add `kubectl wait
--for=delete pod` after each scale-down and a pre-snapshot pods-at-zero check.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mode - Troubleshooting: task-store-server CrashLoopBackOff guidance now branches on mode — external has no eden-postgres-0 pod; check the managed instance/DSN/TLS and the task-store-server logs. - §9: managed external Postgres is no longer "not covered" (13c ships it); point to §4.4 + the runbook. Roadmap ship date -> 2026-06-10. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jun 10, 2026
ealt
added a commit
that referenced
this pull request
Jun 11, 2026
…311) * Add helm-upgrade-smoke CI job (Phase 13a §6.3 deferral, #284) Installs the chart at the merge-base with origin/main, drives the fixture to a first variant.integrated, helm-upgrades in place to the worktree chart (--reset-then-reuse-values, the amended docs §7 procedure), and asserts: no immutable-field errors, every workload rolls to readiness, event counts never regress (PVC/state survival), the full helm-smoke end-state is still reached, and helm test passes post-upgrade. Shared kind/event harness extracted from ci-smoke.sh into a sourced ci-smoke-lib.sh (bash-3.2-clean); ci-smoke.sh behavior unchanged. First catch, fix bundled: upgrading a pre-13d release left API-defaulted rollingUpdate values on the task-store-server Deployment while 13d's template sets strategy Recreate — the API server rejects the patch. The template now renders an explicit 'rollingUpdate: null' so the upgrade deletes the field. Validated locally in kind with helm 3.16.2 (CI's pin) via EDEN_UPGRADE_BASELINE_REF=07482b6 (13a chart -> worktree chart). Closes #284. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Address codex round 0: doubled-total progress gate + baseline fallbacks P2: the 3-variant fixture often completes during setup's own rollout waits, making post-upgrade progress assertions vacuous. The smoke now derives a doubled-total (6-variant) config, waits for the 6-variant end-state, and asserts strict post-upgrade progress keyed on variant.integrated (the last, slowest-to-saturate chain stage — task.completed saturates while integrations are still in flight). P3: when merge-base(HEAD, origin/main) is HEAD itself (main push or local run on main), fall back to HEAD~1 so the run still crosses a real chart diff; never overrides an explicit EDEN_UPGRADE_BASELINE_REF. P3: fetch origin/main with an explicit refspec, tolerated only when origin/main already resolves; guard that the baseline contains the chart at all. Validated: two further full kind runs on helm 3.16.2 with the default merge-base baseline; the last exercised the strict-progress branch (4/6 integrated pre-upgrade -> 6/6 post). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Address codex round 1: baseline strict progress on the post-rollout snapshot Comparing INTEGRATED_FINAL against the PRE-upgrade snapshot let integrations absorbed into the upgrade window masquerade as post-upgrade progress (run-3's trace showed exactly that: 4/6 pre, 6/6 already at the post-rollout snapshot). The gate now baselines on INTEGRATED_POST — taken after helm upgrade --wait and every rollout settled — so the progress proven belongs to the upgraded-and-ready stack; the everything-integrated-during-the-window case logs a NOTE instead of flaking. Validated: full kind run on helm 3.16.2, default merge-base baseline, with the strict branch exercised (4/6 at the post-rollout snapshot -> 6/6 final). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Post-rebase docs: #310 deferral + post-rollout-baseline wording Rebased onto fc92bf0 (#305 managed Postgres): kept both appended CI jobs in ci.yml; #305's ci-smoke-managed-postgres.sh landed against the pre-lib harness shape, ported to ci-smoke-lib.sh in follow-up #310. Re-validated the full upgrade smoke locally on the rebased tree (baseline fc92bf0 -> worktree, helm 3.16.2): PASSED. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Roadmap: flip #284 helm-upgrade-smoke to shipped (PR #311) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Eric Alt <13019253+ealt@users.noreply.github.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 13c. Adds a
postgres.mode={embedded,external}switch to the 13a Helm chart so an operator can run the task-store-server against a managed Postgres they provisioned out-of-band (RDS / Cloud SQL / AlloyDB / Supabase / Neon / Crunchy Bridge / …) instead of the in-clusterStatefulSet.embeddedstays the default and renders byte-for-byte identical to 13a (verified by diffinghelm template) — a 13a→13c upgrade with an untouchedvalues.yamltriggers no rollout.externalaccepts a URL-form DSN inline (postgres.external.connectionString) or from a Secret (postgres.external.existingSecret/ whole-chartsecrets.existingSecret).values.schema.jsonfails closed when neither is set (no fictional default host), rejects weaktls.mode, requires a CA bundle forverify-ca/verify-full, requires thetlsAlreadyEncodedInSecretack whenever the chart can't append the TLS suffix itself, rejectstls.enabledforembedded(stock Postgres has no server-side TLS), and rejectsexternal+ lease mode (unsupported in v0).postgres.tlsappends?sslmode=…[&sslrootcert=…]and mounts the provider CA bundle.setup-experiment-helm.shskips the embedded-Postgres readiness wait in external mode (otherwise mode-agnostic).PostgresStore/_postgres_schema.py/ the wire protocol are untouched — managed Postgres is a deployment-substrate change only (no conformance impact).Plan ↔ chart reconciliation. The 13c plan predates the 13a merge (#298) and assumed a chart shape 13a didn't ship. This PR deviates where 13a's reality is simpler — no
postgres.embedded.*rename/shim (AGENTS.md greenfield rule), noenvFrom→valueFromswitch / "upgrade trap" (13a already composesEDEN_STORE_URLinto the Secret + consumes it viaenvFrom), and three derived DSNs (store/readonly/control-plane) rather than one. Full reconciliation in the CHANGELOG entry.What this does NOT cover
EDEN_READONLY_PASSWORDis blanked so no DCL runs against the managed instance → #299. This also makes external + lease mode unsupported (lease mode itself deferred behind #281); the schema rejects the combination.ensure_schema-on-first-connect stays canonical for v0's single schema version; forward constraint (recognizeschema_versionas v1-applied) documented inhelm.md§4.5 → #301.helm-smoke-managed-postgresbranch-protection requirement — not yet required (same posture newly-added smoke jobs took), tracked with the 13a helm jobs under #286.Fresh-operator walkthrough
Changed operator-facing surfaces: chart values (
postgres.mode/external/tls),secret.yaml+task-store-serverenv wiring,setup-experiment-helm.sh, and the new migration runbook.helm template/helm lintmatrix (embedded byte-identical; external inline/existingSecret/whole-chart-secret; TLS verify-full CA mount; all schema fail-closed cases A1–A3/B1–B3 + tls+embedded + secrets+tls-ack + external+lease). Each runbook YAML/helmsnippet was lint-checked against the schema (incl. the correctedexistingSecret+TLS examples and the rollback reset command).kindis not installed locally, so the two kind-based smokes (ci-smoke.sh,ci-smoke-managed-postgres.sh) run in CI (helm-smoke,helm-smoke-managed-postgres); both scripts passbash -n.Test plan
scripts/git-hooks/pre-push(rename-discipline, complexity-gate, ruff, pyright, markdownlint) — clean.helm lint+helm templatematrix across both modes + the schema fail-closed cases (see helm-lint job; replicated locally).helm templatebyte-for-byte identical tomain.docs/plans/review/eden-phase-13c-managed-postgres/impl/.helm-lint+helm-smoke+helm-smoke-managed-postgres(kind-based smokes run in CI only).Related issues
Closes #173. Deferrals: #299, #300, #301 (filed this PR); references #281, #286.
🤖 Generated with Claude Code