Skip to content

Implement: helm-upgrade-smoke CI job (Phase 13a §6.3 deferral, #284)#311

Merged
ealt merged 5 commits into
mainfrom
impl/issue-284-helm-upgrade-smoke
Jun 11, 2026
Merged

Implement: helm-upgrade-smoke CI job (Phase 13a §6.3 deferral, #284)#311
ealt merged 5 commits into
mainfrom
impl/issue-284-helm-upgrade-smoke

Conversation

@ealt

@ealt ealt commented Jun 10, 2026

Copy link
Copy Markdown
Owner

What this does

Adds the helm-upgrade-smoke CI job deferred from Phase 13a §6.3 (issue #284): proof that helm upgrade works in place against a live release carrying real experiment state. Closes #284.

The job (reference/helm/eden/ci-upgrade-smoke.sh):

  1. Installs the chart at the upgrade baselinemerge-base(HEAD, origin/main) (on a PR run: main's chart, setup script, and ci-values, extracted via git archive into a layout-preserving temp tree). When the merge-base is HEAD itself (a main push, or a local run on main) the baseline falls back to HEAD~1, so the post-merge safety-net run still crosses a real chart diff. The baseline install runs the worktree-built image.
  2. Drives a doubled-total (6-variant) derivative of the fixture config to a first variant.integrated and snapshots event counts. (The 3-variant fixture routinely completes during the setup script's own rollout waits — observed in every local run — which would make post-upgrade progress assertions vacuous; the doubled total keeps work in flight across the upgrade.)
  3. helm upgrade --reset-then-reuse-values --waits to the worktree chart — an immutable-field patch (StatefulSet volumeClaimTemplates, Service clusterIP, …) fails right here.
  4. Asserts: every StatefulSet/Deployment rolls to readiness; event counts never regress across the upgrade (the event log is append-only, so a regression means Postgres state was lost — the PVC-reclaim-mistake catch from the issue); strict post-upgrade progress keyed on variant.integrated (the last, slowest-to-saturate stage of the task chain) whenever integrations remained in flight at the snapshot; the 6-variant end-state (≥6 variant.integrated, ≥18 task.completed, ≥6 ideation task.completed); and helm test passes post-upgrade.

Shared harness: the kind lifecycle, image build/load, event polling, and end-state assertions previously inlined in ci-smoke.sh moved to a sourced ci-smoke-lib.sh (bash-3.2-clean) used by both smokes; ci-smoke.sh behavior is unchanged.

First catch — chart fix bundled. The first local validation run (13a chart → this branch's chart in kind) surfaced a live upgrade-in-place break on main: 13d switched the task-store-server Deployment to strategy: Recreate, but a release installed before that switch carries API-server-defaulted rollingUpdate values on the live object, and an upgrade that only patches type leaves them there — the API server rejects the Deployment (rollingUpdate: Forbidden: may not be specified when strategy type is 'Recreate'). The template now renders an explicit rollingUpdate: null so the upgrade patch deletes the field. CI's merge-base baseline can't see this class of break (both sides already have Recreate); only a cross-chart run against an older live release can — which is exactly what this job is for going forward.

Docs §7 amendment (load-bearing). docs/deployment/helm.md §7 documented plain --reuse-values as the upgrade procedure. That flag discards the new chart's values.yaml — any chart revision adding a template-referenced value (13d's blob.* did exactly this) renders empty and breaks the upgrade for operators following the docs. §7 now prescribes --reset-then-reuse-values (helm ≥ 3.14; CI pins 3.16.2), which re-applies the release's user-supplied values on top of the new chart's defaults. The smoke runs the documented command, so the procedure stays validated on every chart-touching PR.

Which phase this advances

Phase 13 hardening — closes the 13a §6.3 deferral (#284). CI-improvement chunk; planless per AGENTS.md (no plan was authored — the roadmap one-liner points at this PR).

Spec ↔ impl implications

None. No spec, wire, or schema change; the chart template fix and docs amendment are reference-deployment-substrate only.

Codex review

Impl-stage records at docs/plans/review/issue-284-helm-upgrade-smoke/impl/20260610T102219/. Round 0 (codex-cli 0.130.0, read-only): fix-then-ship, 0×P1 + 1×P2 (vacuous post-upgrade progress assertion → doubled-total config + integrated-keyed strict-progress assert) + 2×P3 (push-to-main degenerated to a same-chart upgrade → HEAD~1 fallback; git fetch origin main refspec hygiene). Round 1: both P3s verified resolved; one residual P2 on the P2-fix itself — strict progress was baselined on the pre-upgrade snapshot, so integrations absorbed into the upgrade window could masquerade as post-upgrade progress (run 3's own trace showed the shape: 4/6 pre, 6/6 already at the post-rollout snapshot). The gate now baselines on the post-rollout snapshot. All findings addressed.

Validation

Four full local kind runs of ci-upgrade-smoke.sh, all with helm 3.16.2 (the CI-pinned version):

  1. EDEN_UPGRADE_BASELINE_REF=07482b6 (13a chart → this branch): caught the rollingUpdate: Forbidden break above, then PASSED end-to-end after the template fix — a real cross-chart upgrade crossing the 13d blob.* + Recreate changes.
  2. Default (merge-base) baseline, post-codex-round-0 fixes: PASSED; integrations demonstrably continued across the upgrade (3 → 6 variant.integrated), exposing the task.completed-saturation flaw in the first strict-progress gate.
  3. Default baseline with the integrated-keyed gate: PASSED with the strict-progress assertion exercised (4/6 → 6/6) — and its trace exposed the round-1 baseline issue.
  4. Default baseline on the final code: PASSED with the post-rollout-baselined strict-progress branch exercised (4/6 integrated at the post-rollout snapshot → 6/6 final). Two additional attempts of this run died to local-infra flakes (kind API server drop; kubeadm OOM under a concurrently running Compose stack) before the clean pass — both failed pre-install, neither implicates the script.

Plus: the new CI job runs on this PR itself (the workflow edit sets run_all=true), exercising the merge-base → PR-chart path on a fresh runner. helm lint, helm template, full markdownlint, ruff, pyright, uv run pytest -q (2334 passed; one ideator-subprocess flake under parallel kind load passed in isolation — no Python touched), shellcheck on all three scripts, rename-discipline, and complexity gates: clean.

What this does NOT cover

Fresh-operator walkthrough

Operator-facing surface changed: the documented upgrade procedure (docs/deployment/helm.md §7). A fresh operator with a live Helm release who pulls a newer chart now runs helm upgrade eden reference/helm/eden -n <ns> --reset-then-reuse-values, and §7 explains when plain --reuse-values is still fine (same-chart values tweaks, e.g. the §4.3 registry example) and why it breaks across chart revisions. The exact §7 command is what the CI smoke executes against a live release on every chart-touching PR.

🤖 Generated with Claude Code

ealt and others added 5 commits June 10, 2026 10:43
Installs the chart at the merge-base with origin/main, drives the fixture
to a first variant.integrated, helm-upgrades in place to the worktree
chart (--reset-then-reuse-values, the amended docs §7 procedure), and
asserts: no immutable-field errors, every workload rolls to readiness,
event counts never regress (PVC/state survival), the full helm-smoke
end-state is still reached, and helm test passes post-upgrade.

Shared kind/event harness extracted from ci-smoke.sh into a sourced
ci-smoke-lib.sh (bash-3.2-clean); ci-smoke.sh behavior unchanged.

First catch, fix bundled: upgrading a pre-13d release left API-defaulted
rollingUpdate values on the task-store-server Deployment while 13d's
template sets strategy Recreate — the API server rejects the patch. The
template now renders an explicit 'rollingUpdate: null' so the upgrade
deletes the field. Validated locally in kind with helm 3.16.2 (CI's pin)
via EDEN_UPGRADE_BASELINE_REF=07482b6 (13a chart -> worktree chart).

Closes #284.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
P2: the 3-variant fixture often completes during setup's own rollout
waits, making post-upgrade progress assertions vacuous. The smoke now
derives a doubled-total (6-variant) config, waits for the 6-variant
end-state, and asserts strict post-upgrade progress keyed on
variant.integrated (the last, slowest-to-saturate chain stage —
task.completed saturates while integrations are still in flight).

P3: when merge-base(HEAD, origin/main) is HEAD itself (main push or
local run on main), fall back to HEAD~1 so the run still crosses a
real chart diff; never overrides an explicit EDEN_UPGRADE_BASELINE_REF.
P3: fetch origin/main with an explicit refspec, tolerated only when
origin/main already resolves; guard that the baseline contains the
chart at all.

Validated: two further full kind runs on helm 3.16.2 with the default
merge-base baseline; the last exercised the strict-progress branch
(4/6 integrated pre-upgrade -> 6/6 post).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…napshot

Comparing INTEGRATED_FINAL against the PRE-upgrade snapshot let
integrations absorbed into the upgrade window masquerade as
post-upgrade progress (run-3's trace showed exactly that: 4/6 pre,
6/6 already at the post-rollout snapshot). The gate now baselines on
INTEGRATED_POST — taken after helm upgrade --wait and every rollout
settled — so the progress proven belongs to the upgraded-and-ready
stack; the everything-integrated-during-the-window case logs a NOTE
instead of flaking.

Validated: full kind run on helm 3.16.2, default merge-base baseline,
with the strict branch exercised (4/6 at the post-rollout snapshot ->
6/6 final).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rebased onto fc92bf0 (#305 managed Postgres): kept both appended CI
jobs in ci.yml; #305's ci-smoke-managed-postgres.sh landed against the
pre-lib harness shape, ported to ci-smoke-lib.sh in follow-up #310.
Re-validated the full upgrade smoke locally on the rebased tree
(baseline fc92bf0 -> worktree, helm 3.16.2): PASSED.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ealt ealt merged commit 6876c65 into main Jun 11, 2026
49 of 50 checks passed
@ealt ealt deleted the impl/issue-284-helm-upgrade-smoke branch June 11, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

helm-upgrade-smoke CI job (Phase 13a §6.3 deferral)

1 participant