Implement: helm-upgrade-smoke CI job (Phase 13a §6.3 deferral, #284)#311
Merged
Conversation
Installs the chart at the merge-base with origin/main, drives the fixture to a first variant.integrated, helm-upgrades in place to the worktree chart (--reset-then-reuse-values, the amended docs §7 procedure), and asserts: no immutable-field errors, every workload rolls to readiness, event counts never regress (PVC/state survival), the full helm-smoke end-state is still reached, and helm test passes post-upgrade. Shared kind/event harness extracted from ci-smoke.sh into a sourced ci-smoke-lib.sh (bash-3.2-clean); ci-smoke.sh behavior unchanged. First catch, fix bundled: upgrading a pre-13d release left API-defaulted rollingUpdate values on the task-store-server Deployment while 13d's template sets strategy Recreate — the API server rejects the patch. The template now renders an explicit 'rollingUpdate: null' so the upgrade deletes the field. Validated locally in kind with helm 3.16.2 (CI's pin) via EDEN_UPGRADE_BASELINE_REF=07482b6 (13a chart -> worktree chart). Closes #284. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
P2: the 3-variant fixture often completes during setup's own rollout waits, making post-upgrade progress assertions vacuous. The smoke now derives a doubled-total (6-variant) config, waits for the 6-variant end-state, and asserts strict post-upgrade progress keyed on variant.integrated (the last, slowest-to-saturate chain stage — task.completed saturates while integrations are still in flight). P3: when merge-base(HEAD, origin/main) is HEAD itself (main push or local run on main), fall back to HEAD~1 so the run still crosses a real chart diff; never overrides an explicit EDEN_UPGRADE_BASELINE_REF. P3: fetch origin/main with an explicit refspec, tolerated only when origin/main already resolves; guard that the baseline contains the chart at all. Validated: two further full kind runs on helm 3.16.2 with the default merge-base baseline; the last exercised the strict-progress branch (4/6 integrated pre-upgrade -> 6/6 post). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…napshot Comparing INTEGRATED_FINAL against the PRE-upgrade snapshot let integrations absorbed into the upgrade window masquerade as post-upgrade progress (run-3's trace showed exactly that: 4/6 pre, 6/6 already at the post-rollout snapshot). The gate now baselines on INTEGRATED_POST — taken after helm upgrade --wait and every rollout settled — so the progress proven belongs to the upgraded-and-ready stack; the everything-integrated-during-the-window case logs a NOTE instead of flaking. Validated: full kind run on helm 3.16.2, default merge-base baseline, with the strict branch exercised (4/6 at the post-rollout snapshot -> 6/6 final). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rebased onto fc92bf0 (#305 managed Postgres): kept both appended CI jobs in ci.yml; #305's ci-smoke-managed-postgres.sh landed against the pre-lib harness shape, ported to ci-smoke-lib.sh in follow-up #310. Re-validated the full upgrade smoke locally on the rebased tree (baseline fc92bf0 -> worktree, helm 3.16.2): PASSED. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds the
helm-upgrade-smokeCI job deferred from Phase 13a §6.3 (issue #284): proof thathelm upgradeworks in place against a live release carrying real experiment state. Closes #284.The job (
reference/helm/eden/ci-upgrade-smoke.sh):merge-base(HEAD, origin/main)(on a PR run: main's chart, setup script, and ci-values, extracted viagit archiveinto a layout-preserving temp tree). When the merge-base is HEAD itself (a main push, or a local run on main) the baseline falls back toHEAD~1, so the post-merge safety-net run still crosses a real chart diff. The baseline install runs the worktree-built image.variant.integratedand snapshots event counts. (The 3-variant fixture routinely completes during the setup script's own rollout waits — observed in every local run — which would make post-upgrade progress assertions vacuous; the doubled total keeps work in flight across the upgrade.)helm upgrade --reset-then-reuse-values --waits to the worktree chart — an immutable-field patch (StatefulSetvolumeClaimTemplates, ServiceclusterIP, …) fails right here.variant.integrated(the last, slowest-to-saturate stage of the task chain) whenever integrations remained in flight at the snapshot; the 6-variant end-state (≥6variant.integrated, ≥18task.completed, ≥6 ideationtask.completed); andhelm testpasses post-upgrade.Shared harness: the kind lifecycle, image build/load, event polling, and end-state assertions previously inlined in
ci-smoke.shmoved to a sourcedci-smoke-lib.sh(bash-3.2-clean) used by both smokes;ci-smoke.shbehavior is unchanged.First catch — chart fix bundled. The first local validation run (13a chart → this branch's chart in kind) surfaced a live upgrade-in-place break on main: 13d switched the task-store-server Deployment to
strategy: Recreate, but a release installed before that switch carries API-server-defaultedrollingUpdatevalues on the live object, and an upgrade that only patchestypeleaves them there — the API server rejects the Deployment (rollingUpdate: Forbidden: may not be specified when strategy type is 'Recreate'). The template now renders an explicitrollingUpdate: nullso the upgrade patch deletes the field. CI's merge-base baseline can't see this class of break (both sides already haveRecreate); only a cross-chart run against an older live release can — which is exactly what this job is for going forward.Docs §7 amendment (load-bearing).
docs/deployment/helm.md§7 documented plain--reuse-valuesas the upgrade procedure. That flag discards the new chart'svalues.yaml— any chart revision adding a template-referenced value (13d'sblob.*did exactly this) renders empty and breaks the upgrade for operators following the docs. §7 now prescribes--reset-then-reuse-values(helm ≥ 3.14; CI pins 3.16.2), which re-applies the release's user-supplied values on top of the new chart's defaults. The smoke runs the documented command, so the procedure stays validated on every chart-touching PR.Which phase this advances
Phase 13 hardening — closes the 13a §6.3 deferral (#284). CI-improvement chunk; planless per AGENTS.md (no plan was authored — the roadmap one-liner points at this PR).
Spec ↔ impl implications
None. No spec, wire, or schema change; the chart template fix and docs amendment are reference-deployment-substrate only.
Codex review
Impl-stage records at
docs/plans/review/issue-284-helm-upgrade-smoke/impl/20260610T102219/. Round 0 (codex-cli 0.130.0, read-only): fix-then-ship, 0×P1 + 1×P2 (vacuous post-upgrade progress assertion → doubled-total config + integrated-keyed strict-progress assert) + 2×P3 (push-to-main degenerated to a same-chart upgrade →HEAD~1fallback;git fetch origin mainrefspec hygiene). Round 1: both P3s verified resolved; one residual P2 on the P2-fix itself — strict progress was baselined on the pre-upgrade snapshot, so integrations absorbed into the upgrade window could masquerade as post-upgrade progress (run 3's own trace showed the shape: 4/6 pre, 6/6 already at the post-rollout snapshot). The gate now baselines on the post-rollout snapshot. All findings addressed.Validation
Four full local kind runs of
ci-upgrade-smoke.sh, all with helm 3.16.2 (the CI-pinned version):EDEN_UPGRADE_BASELINE_REF=07482b6(13a chart → this branch): caught therollingUpdate: Forbiddenbreak above, then PASSED end-to-end after the template fix — a real cross-chart upgrade crossing the 13dblob.*+Recreatechanges.variant.integrated), exposing thetask.completed-saturation flaw in the first strict-progress gate.Plus: the new CI job runs on this PR itself (the workflow edit sets
run_all=true), exercising the merge-base → PR-chart path on a fresh runner.helm lint,helm template, full markdownlint,ruff,pyright,uv run pytest -q(2334 passed; one ideator-subprocess flake under parallel kind load passed in isolation — no Python touched), shellcheck on all three scripts, rename-discipline, and complexity gates: clean.What this does NOT cover
Porting
ci-smoke-managed-postgres.shto the shared lib: Implement: Managed (external) Postgres mode for the Helm chart (#173) #305 (13c) landed its smoke in parallel against the pre-libci-smoke.shshape, so it carries an inlined copy of the harness. Deferred to ci-smoke-managed-postgres.sh should source ci-smoke-lib.sh #310 rather than touching the just-merged 13c smoke without a local validation run in this PR.Lease-mode upgrade behavior (the 13a §6.3 "lease holders unchanged within
leaseDurationSeconds" assertion): lease mode is opt-in, deferred, and unvalidated behind Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281; the smoke runs the single-experiment default path. An upgrade smoke for lease mode belongs to Reconcile #128 opaque-id minting with #147's lease-mode orchestrator (multi-experiment smoke gap) #281's validation scope.Published-chart-version baselines: the chart has no published releases (it ships in-tree), so the issue's "install at 0.1.0" is realized as "install at the merge-base with main". If/when chart releases are published, the baseline can switch to the latest published version.
Lockstep chart+CLI changes: the baseline install runs the worktree image under the baseline chart, so a PR changing both in lockstep can legitimately fail the baseline bring-up. Escape hatch documented in the script header and job comment: pin
EDEN_UPGRADE_BASELINE_REFin the same PR with the reason called out.The job is not branch-protection-required initially (same posture
helm-smoketook in 13a); bump to required after ~2 weeks clean on main.Fresh-operator walkthrough
Operator-facing surface changed: the documented upgrade procedure (
docs/deployment/helm.md§7). A fresh operator with a live Helm release who pulls a newer chart now runshelm upgrade eden reference/helm/eden -n <ns> --reset-then-reuse-values, and §7 explains when plain--reuse-valuesis still fine (same-chart values tweaks, e.g. the §4.3 registry example) and why it breaks across chart revisions. The exact §7 command is what the CI smoke executes against a live release on every chart-touching PR.🤖 Generated with Claude Code