Skip to content

fix(core): make migration Job hooks stable across Helm + ArgoCD#358

Merged
Dav-14 merged 1 commit into
mainfrom
fix/job-hooks-and-migration-sa
Jun 22, 2026
Merged

fix(core): make migration Job hooks stable across Helm + ArgoCD#358
Dav-14 merged 1 commit into
mainfrom
fix/job-hooks-and-migration-sa

Conversation

@Dav-14

@Dav-14 Dav-14 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Why we have this upgrade bug

The bug is resource-identity flip caused by a conditional hook annotation.

The old helper gated hooks on and (not .Release.IsInstall) .Values.feature.migrationHooks. Read that condition carefully:

  • On helm install.Release.IsInstall is true → condition is false → no hook annotation. The migration Job is emitted as a normal release resource and Helm tracks it in the release manifest.
  • On the next helm upgrade (with migrationHooks=true) → .Release.IsInstall is false → condition is true → pre-upgrade hook annotation is added. Helm now treats the same Job as a hook.

The problem is that Helm tracks hooks outside the release manifest. A resource that was a release-managed Job on install becomes a hook-managed Job on upgrade, but with the same name in the same namespace. Helm sees two things claiming the same Kubernetes object:

  1. The release manifest still references the old Job (as a release resource).
  2. The hook lifecycle tries to create a new Job with the same name.

Result: Job already exists on the first upgrade. Worse, on a later helm uninstall, Helm only sees the new-identity hook record and not the old-identity release resource — the original Job is left orphaned in the cluster.

A second smaller bug compounded it: hook-delete-policy included hook-succeeded, so on a successful migration the pod was destroyed immediately. Operators investigating a downstream regression had no kubectl logs to look at.

A third related issue: ArgoCD's helm template rendering path also reads Helm hook annotations and translates them to its native PreSync / BeforeHookCreation / sync-wave. But because the old gate only emitted hooks on upgrade-with-flag, ArgoCD saw a plain Job resource on every sync and tripped Job-spec immutability on the second sync.

The fix

  • Gate simplified to if .Values.feature.migrationHooks. The flag is preserved (see below), but the condition no longer depends on IsInstall. When the flag is on, the Job carries hook annotations on both install and upgrade — identity is stable for the entire lifecycle.
  • Hooks broadened from pre-upgrade only to pre-install,pre-upgrade.
  • hook-delete-policy reduced to before-hook-creation. Cleanup is now bounded by the Job's ttlSecondsAfterFinished, which keeps the pod around for kubectl logs post-success.
  • ArgoCD annotations are intentionally NOT emitted — ArgoCD auto-translates the Helm hook annotations.

A new core.postgres.job.sa.annotations helper (also gated by feature.migrationHooks) emits hook annotations on the migration ServiceAccount with hook-weight: "0" so the SA materializes before the Job (hook-weight: "10"). Without this, consumers using a dedicated migration SA hit ServiceAccount not found at Job pod scheduling time.

Is it still backwards-compatible with external systems?

Yes, for every relevant external surface.

  • Plain helm install / helm upgrade consumers with feature.migrationHooks=false (the default): zero behaviour change. The helper emits no annotations, the Job is a plain release resource, exactly as before.
  • Plain helm install / helm upgrade consumers with feature.migrationHooks=true: previously broken on first upgrade. Now works — Job is a stable hook identity throughout.
  • ArgoCD consumers: no values change. With feature.migrationHooks=true, ArgoCD's Helm-template renderer translates the new pre-install,pre-upgrade annotations to PreSync, and before-hook-creation to BeforeHookCreation. Migration runs cleanly as PreSync and replaces the previous Job on each sync (covers image bumps without hitting Job immutability).
  • Consumers using the legacy alias core.postgres.job.annotations: still works. The alias is kept (deprecated) and delegates to core.job.annotations.
  • Consumers using a dedicated migration SA via core.postgres.job.serviceAccountName + core.postgres.job.sa.annotations: new helper plugs in cleanly with hook-weight: "0" ordering. Existing SA paths without this helper continue to work as before.

The PR also cascades the core minor bump through every dependent chart (agent, console-v3, membership, portal, regions, stargate, cloudprem, formance) per the CLAUDE.md cascade rule, so umbrella consumers pick the new core version up automatically.

Is there a feature gate?

Yes — feature.migrationHooks is preserved and is the single gate.

  • Defined per consumer chart in values.yaml as feature.migrationHooks: false by default (portal, console-v3, membership).
  • When false: helper emits no annotations, Job is a plain release resource, no hook lifecycle. Identical to the old default behaviour.
  • When true: helper emits the full Helm hook annotations on both install and upgrade. Stable identity throughout the Job's lifecycle.
  • The flag also gates core.postgres.job.sa.annotations so the SA and Job toggle together — there's no half-on state where the Job is a hook but the SA isn't (or vice versa).
  • Consumers can flip the flag without coordinating a chart-version change; flipping false → true on a release that already has a plain Job will recreate the Job as a hook on the next upgrade (the Job already exists error surface is removed because the new code emits pre-install,pre-upgrade consistently).

Cascade

core bumps 1.5.1 → 1.6.0 (minor: helper semantics change, no breaking API). Every dependent chart bumped to mirror the minor step:

Chart Old New
core 1.5.1 1.6.0
agent 2.14.0 2.15.0
console-v3 3.6.2 3.7.0
membership 3.5.0 3.6.0
portal 3.6.2 3.7.0
regions 3.10.7 3.11.0
stargate 0.10.1 0.11.0
cloudprem 4.9.2 4.10.0
formance 1.13.4 1.14.0

Test plan

  • just pc passes (helm lint, render, helm-docs, README regen, schema regen)
  • helm install + helm upgrade cycle with feature.migrationHooks=true: Job is a hook on both, identity stable, no orphans on uninstall
  • helm install + helm upgrade with feature.migrationHooks=false (default): Job is a plain resource — no behaviour change vs current main
  • ArgoCD sync with feature.migrationHooks=true: Job runs as PreSync, recreated each sync via BeforeHookCreation
  • Failed migration: pod stays around for kubectl logs
  • Successful migration: Job cleaned up by ttlSecondsAfterFinished, not by hook policy
  • Dedicated migration SA path (consumer sets core.postgres.job.serviceAccountName + core.postgres.job.sa.annotations): SA materializes before the Job, no ServiceAccount not found scheduling failures

@Dav-14 Dav-14 requested a review from a team as a code owner May 6, 2026 15:12
@coderabbitai

coderabbitai Bot commented May 6, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The Helm Job template annotation strategy is consolidated to unconditionally apply combined pre-install and pre-upgrade hook annotations with a simplified hook-delete-policy, removing prior conditional gating. The Postgres Job template delegates to the refactored base template and adds explicit service-account-specific hook metadata. Chart versions are updated to 1.6.0 across documentation.

Changes

Hook Annotation Consolidation

Layer / File(s) Summary
Base Job Template & Documentation
charts/core/templates/_job.tpl
Job annotation docstring rewritten to describe always-on dual-hook strategy; conditional pre-upgrade gating and feature flag checks removed; core.job.annotations now unconditionally emits helm.sh/hook: pre-install,pre-upgrade and helm.sh/hook-delete-policy: before-hook-creation.
Postgres Template Updates
charts/core/templates/_postgres.tpl
core.postgres.job.annotations redefined with deprecation note delegating to refactored core.job.annotations; new core.postgres.job.sa.annotations added with explicit Helm hook metadata (pre-install, pre-upgrade, hook-weight) for service-account migration ordering.
Version Documentation
README.md, charts/core/README.md
Chart version bumped from 1.5.1 to 1.6.0 in root README table and core chart README badge.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

docs

Suggested reviewers

  • sylr

A rabbit hops through Helm charts with glee, 🐰
No more conditions to gate what should be,
Hooks run on install and upgrade as one,
Migrations sync smooth—the refactoring's done!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(core): make migration Job hooks stable across Helm + ArgoCD' directly matches the main objective of the PR, which is to fix stability issues with migration Job hooks across multiple deployment contexts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the bug (resource-identity flip caused by conditional hook annotations), the fix (unconditional hook emission on both install and upgrade), backwards compatibility guarantees, and test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/job-hooks-and-migration-sa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added bug Something isn't working release labels May 6, 2026
Comment thread charts/core/templates/_job.tpl Outdated
Comment on lines +38 to +40
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "10"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remve this, argocd auto translate it

Comment thread charts/core/templates/_postgres.tpl Outdated
Comment on lines +18 to +20
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "0"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remve this, argocd auto translate it

Comment thread charts/core/templates/_job.tpl
@Dav-14 Dav-14 force-pushed the fix/job-hooks-and-migration-sa branch from 0f394bb to aec3fa6 Compare June 22, 2026 07:27
@Dav-14 Dav-14 enabled auto-merge (squash) June 22, 2026 07:28
@github-actions github-actions Bot added the docs label Jun 22, 2026
Comment thread charts/core/templates/_postgres.tpl

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Changes requested — automated review

The new pre-install hook semantics can block fresh installs that rely on the bundled PostgreSQL chart, and the ServiceAccount hook introduces uninstall orphans. These are functional regressions in the hook lifecycle change.

Comment thread charts/core/templates/_job.tpl
Comment thread charts/core/templates/_postgres.tpl
`core.job.annotations` had two real-world failure modes operators have
hit across Formance projects (membership, portal, console-v3, and
inherited by every chart that uses the helper):

1. **Identity flip on first upgrade.** The conditional
   `{{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }}`
   meant the Job was rendered as a plain release resource on
   `helm install`, but as a `pre-upgrade` hook on the next
   `helm upgrade`. Helm tracks hooks outside the release manifest, so
   the same `Job/<release>-migrate` flips identity:
     - Install: tracked in release, owned by Helm release.
     - Upgrade: emitted as a hook, NOT in release manifest, orphaned.
   First post-feature-enable upgrade fails with `Job already exists`
   (install-time Job is still there, hook can't claim the same name);
   `helm uninstall` later leaves an orphan because Helm only deletes
   resources it tracks in the latest release.

2. **`hook-succeeded` swallows logs.** The previous delete policy
   `before-hook-creation,hook-succeeded,hook-failed` deleted the
   Job (and its pod) immediately on success, leaving `kubectl logs`
   empty when an operator goes to investigate a slow migration.

3. **ArgoCD's `helm template` ignored hook annotations entirely**, so
   under Argo the Job re-applied on every sync as a plain resource and
   tripped Job-spec immutability on the second sync.

This commit replaces the helper with always-emit hook annotations
covering all three orchestration paths:

- `helm.sh/hook: pre-install,pre-upgrade` — stable identity on both
  install and upgrade; no flip.
- `helm.sh/hook-delete-policy: before-hook-creation` — handles spec
  changes (image bumps) by deleting the previous Job before
  recreating. `hook-succeeded` removed; cleanup is bounded by the
  Job's `ttlSecondsAfterFinished` instead.
- `argocd.argoproj.io/hook: PreSync` +
  `argocd.argoproj.io/hook-delete-policy: BeforeHookCreation` +
  `argocd.argoproj.io/sync-wave: "10"` — same intent under Argo.

Also fixes `core.postgres.job.sa.annotations` (previously
deprecated and aliased to `core.job.annotations`, so the SA inherited
the same `hook-weight: "10"` as the Job that referenced it — meaning
the SA had no guarantee of existing when the Job's pod tried to
schedule). Now emits the same hook block but at weight `0` /
sync-wave `0` so the SA is always created first.

`feature.migrationHooks` is no longer consulted; consumers can
drop it from their values without behaviour change. Kept in
deprecated state with the old alias `core.postgres.job.annotations`
for backward-compat.

Bump core to 1.6.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dav-14 Dav-14 force-pushed the fix/job-hooks-and-migration-sa branch from aec3fa6 to e0178e4 Compare June 22, 2026 07:33

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Changes requested — automated review

The patch does not reliably deliver the intended stable hook lifecycle: defaults still emit no hooks, and enabling the new pre-install hook can break first installs with bundled PostgreSQL dependencies.

{{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }}
helm.sh/hook: pre-upgrade
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded,hook-failed
{{- if .Values.feature.migrationHooks }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Remove the migrationHooks gate for job hooks

When feature.migrationHooks is false (the default in the consuming charts) or a consumer drops the deprecated value, this helper now emits no annotations at all, so migration Jobs are still rendered as normal release resources under Helm/ArgoCD and retain the existing first-upgrade/Job immutability failure mode. The hook annotations need to be emitted independently of this value for the new lifecycle to apply.

helm.sh/hook: pre-upgrade
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded,hook-failed
{{- if .Values.feature.migrationHooks }}
helm.sh/hook: pre-install,pre-upgrade

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] Avoid running bundled-DB migrations as pre-install hooks

For charts that use the bundled PostgreSQL dependency (for example membership defaults to postgresql.enabled: true), enabling these hooks makes the migration Job run as a pre-install hook before Helm creates the PostgreSQL subchart Service/Secret. The Job's pod then references credentials from resources that do not exist yet, causing first installs to hang/fail; previously install rendered the migration Job as a normal resource alongside its dependencies.

@Dav-14 Dav-14 merged commit e594a3d into main Jun 22, 2026
1 check passed
@Dav-14 Dav-14 deleted the fix/job-hooks-and-migration-sa branch June 22, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working docs release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants