fix(core): make migration Job hooks stable across Helm + ArgoCD#358
Conversation
WalkthroughThe Helm Job template annotation strategy is consolidated to unconditionally apply combined pre-install and pre-upgrade hook annotations with a simplified hook-delete-policy, removing prior conditional gating. The Postgres Job template delegates to the refactored base template and adds explicit service-account-specific hook metadata. Chart versions are updated to 1.6.0 across documentation. ChangesHook Annotation Consolidation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| argocd.argoproj.io/hook: PreSync | ||
| argocd.argoproj.io/hook-delete-policy: BeforeHookCreation | ||
| argocd.argoproj.io/sync-wave: "10" |
There was a problem hiding this comment.
remve this, argocd auto translate it
| argocd.argoproj.io/hook: PreSync | ||
| argocd.argoproj.io/hook-delete-policy: BeforeHookCreation | ||
| argocd.argoproj.io/sync-wave: "0" |
There was a problem hiding this comment.
remve this, argocd auto translate it
0f394bb to
aec3fa6
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
🛑 Changes requested — automated review
The new pre-install hook semantics can block fresh installs that rely on the bundled PostgreSQL chart, and the ServiceAccount hook introduces uninstall orphans. These are functional regressions in the hook lifecycle change.
`core.job.annotations` had two real-world failure modes operators have
hit across Formance projects (membership, portal, console-v3, and
inherited by every chart that uses the helper):
1. **Identity flip on first upgrade.** The conditional
`{{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }}`
meant the Job was rendered as a plain release resource on
`helm install`, but as a `pre-upgrade` hook on the next
`helm upgrade`. Helm tracks hooks outside the release manifest, so
the same `Job/<release>-migrate` flips identity:
- Install: tracked in release, owned by Helm release.
- Upgrade: emitted as a hook, NOT in release manifest, orphaned.
First post-feature-enable upgrade fails with `Job already exists`
(install-time Job is still there, hook can't claim the same name);
`helm uninstall` later leaves an orphan because Helm only deletes
resources it tracks in the latest release.
2. **`hook-succeeded` swallows logs.** The previous delete policy
`before-hook-creation,hook-succeeded,hook-failed` deleted the
Job (and its pod) immediately on success, leaving `kubectl logs`
empty when an operator goes to investigate a slow migration.
3. **ArgoCD's `helm template` ignored hook annotations entirely**, so
under Argo the Job re-applied on every sync as a plain resource and
tripped Job-spec immutability on the second sync.
This commit replaces the helper with always-emit hook annotations
covering all three orchestration paths:
- `helm.sh/hook: pre-install,pre-upgrade` — stable identity on both
install and upgrade; no flip.
- `helm.sh/hook-delete-policy: before-hook-creation` — handles spec
changes (image bumps) by deleting the previous Job before
recreating. `hook-succeeded` removed; cleanup is bounded by the
Job's `ttlSecondsAfterFinished` instead.
- `argocd.argoproj.io/hook: PreSync` +
`argocd.argoproj.io/hook-delete-policy: BeforeHookCreation` +
`argocd.argoproj.io/sync-wave: "10"` — same intent under Argo.
Also fixes `core.postgres.job.sa.annotations` (previously
deprecated and aliased to `core.job.annotations`, so the SA inherited
the same `hook-weight: "10"` as the Job that referenced it — meaning
the SA had no guarantee of existing when the Job's pod tried to
schedule). Now emits the same hook block but at weight `0` /
sync-wave `0` so the SA is always created first.
`feature.migrationHooks` is no longer consulted; consumers can
drop it from their values without behaviour change. Kept in
deprecated state with the old alias `core.postgres.job.annotations`
for backward-compat.
Bump core to 1.6.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aec3fa6 to
e0178e4
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
🛑 Changes requested — automated review
The patch does not reliably deliver the intended stable hook lifecycle: defaults still emit no hooks, and enabling the new pre-install hook can break first installs with bundled PostgreSQL dependencies.
| {{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }} | ||
| helm.sh/hook: pre-upgrade | ||
| helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded,hook-failed | ||
| {{- if .Values.feature.migrationHooks }} |
There was a problem hiding this comment.
🔴 [blocker] Remove the migrationHooks gate for job hooks
When feature.migrationHooks is false (the default in the consuming charts) or a consumer drops the deprecated value, this helper now emits no annotations at all, so migration Jobs are still rendered as normal release resources under Helm/ArgoCD and retain the existing first-upgrade/Job immutability failure mode. The hook annotations need to be emitted independently of this value for the new lifecycle to apply.
| helm.sh/hook: pre-upgrade | ||
| helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded,hook-failed | ||
| {{- if .Values.feature.migrationHooks }} | ||
| helm.sh/hook: pre-install,pre-upgrade |
There was a problem hiding this comment.
🟠 [major] Avoid running bundled-DB migrations as pre-install hooks
For charts that use the bundled PostgreSQL dependency (for example membership defaults to postgresql.enabled: true), enabling these hooks makes the migration Job run as a pre-install hook before Helm creates the PostgreSQL subchart Service/Secret. The Job's pod then references credentials from resources that do not exist yet, causing first installs to hang/fail; previously install rendered the migration Job as a normal resource alongside its dependencies.
Why we have this upgrade bug
The bug is resource-identity flip caused by a conditional hook annotation.
The old helper gated hooks on
and (not .Release.IsInstall) .Values.feature.migrationHooks. Read that condition carefully:helm install→.Release.IsInstallis true → condition is false → no hook annotation. The migration Job is emitted as a normal release resource and Helm tracks it in the release manifest.helm upgrade(withmigrationHooks=true) →.Release.IsInstallis false → condition is true →pre-upgradehook annotation is added. Helm now treats the same Job as a hook.The problem is that Helm tracks hooks outside the release manifest. A resource that was a release-managed Job on install becomes a hook-managed Job on upgrade, but with the same name in the same namespace. Helm sees two things claiming the same Kubernetes object:
Result:
Job already existson the first upgrade. Worse, on a laterhelm uninstall, Helm only sees the new-identity hook record and not the old-identity release resource — the original Job is left orphaned in the cluster.A second smaller bug compounded it:
hook-delete-policyincludedhook-succeeded, so on a successful migration the pod was destroyed immediately. Operators investigating a downstream regression had nokubectl logsto look at.A third related issue: ArgoCD's
helm templaterendering path also reads Helm hook annotations and translates them to its nativePreSync/BeforeHookCreation/sync-wave. But because the old gate only emitted hooks on upgrade-with-flag, ArgoCD saw a plain Job resource on every sync and tripped Job-spec immutability on the second sync.The fix
if .Values.feature.migrationHooks. The flag is preserved (see below), but the condition no longer depends onIsInstall. When the flag is on, the Job carries hook annotations on both install and upgrade — identity is stable for the entire lifecycle.pre-upgradeonly topre-install,pre-upgrade.hook-delete-policyreduced tobefore-hook-creation. Cleanup is now bounded by the Job'sttlSecondsAfterFinished, which keeps the pod around forkubectl logspost-success.A new
core.postgres.job.sa.annotationshelper (also gated byfeature.migrationHooks) emits hook annotations on the migration ServiceAccount withhook-weight: "0"so the SA materializes before the Job (hook-weight: "10"). Without this, consumers using a dedicated migration SA hitServiceAccount not foundat Job pod scheduling time.Is it still backwards-compatible with external systems?
Yes, for every relevant external surface.
helm install/helm upgradeconsumers withfeature.migrationHooks=false(the default): zero behaviour change. The helper emits no annotations, the Job is a plain release resource, exactly as before.helm install/helm upgradeconsumers withfeature.migrationHooks=true: previously broken on first upgrade. Now works — Job is a stable hook identity throughout.feature.migrationHooks=true, ArgoCD's Helm-template renderer translates the newpre-install,pre-upgradeannotations toPreSync, andbefore-hook-creationtoBeforeHookCreation. Migration runs cleanly asPreSyncand replaces the previous Job on each sync (covers image bumps without hitting Job immutability).core.postgres.job.annotations: still works. The alias is kept (deprecated) and delegates tocore.job.annotations.core.postgres.job.serviceAccountName+core.postgres.job.sa.annotations: new helper plugs in cleanly withhook-weight: "0"ordering. Existing SA paths without this helper continue to work as before.The PR also cascades the
coreminor bump through every dependent chart (agent,console-v3,membership,portal,regions,stargate,cloudprem,formance) per theCLAUDE.mdcascade rule, so umbrella consumers pick the new core version up automatically.Is there a feature gate?
Yes —
feature.migrationHooksis preserved and is the single gate.values.yamlasfeature.migrationHooks: falseby default (portal, console-v3, membership).false: helper emits no annotations, Job is a plain release resource, no hook lifecycle. Identical to the old default behaviour.true: helper emits the full Helm hook annotations on both install and upgrade. Stable identity throughout the Job's lifecycle.core.postgres.job.sa.annotationsso the SA and Job toggle together — there's no half-on state where the Job is a hook but the SA isn't (or vice versa).false → trueon a release that already has a plain Job will recreate the Job as a hook on the next upgrade (theJob already existserror surface is removed because the new code emitspre-install,pre-upgradeconsistently).Cascade
corebumps1.5.1 → 1.6.0(minor: helper semantics change, no breaking API). Every dependent chart bumped to mirror the minor step:Test plan
just pcpasses (helm lint, render, helm-docs, README regen, schema regen)helm install+helm upgradecycle withfeature.migrationHooks=true: Job is a hook on both, identity stable, no orphans on uninstallhelm install+helm upgradewithfeature.migrationHooks=false(default): Job is a plain resource — no behaviour change vs current mainfeature.migrationHooks=true: Job runs asPreSync, recreated each sync viaBeforeHookCreationkubectl logsttlSecondsAfterFinished, not by hook policycore.postgres.job.serviceAccountName+core.postgres.job.sa.annotations): SA materializes before the Job, noServiceAccount not foundscheduling failures