Skip to content

fix: keep failed-to-prune resources in status.Inventory (#1664)#1665

Merged
matheuscscp merged 2 commits into
fluxcd:mainfrom
gecube:fix/1664-keep-failed-prune-in-inventory
Jun 12, 2026
Merged

fix: keep failed-to-prune resources in status.Inventory (#1664)#1665
matheuscscp merged 2 commits into
fluxcd:mainfrom
gecube:fix/1664-keep-failed-prune-in-inventory

Conversation

@gecube

@gecube gecube commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1664. When status.Inventory is advanced to the new build output before the prune step succeeds, and prune subsequently fails because the apiserver rejects the DELETE (admission webhook denial, in-use resource, etc.), the stale resources are silently lost from Flux's tracking. They:

  • are no longer in status.Inventory,
  • still carry the kustomize.toolkit.fluxcd.io/name=… label,
  • are not surfaced as prunable by the next reconcile's old-vs-new diff (the "old" snapshot has already advanced past them),
  • cannot be re-pruned by flux reconcile — only manual kubectl delete removes them.

Result: long-lived orphans labeled by the Kustomization but invisible to its GC.

Behaviour change

prune() now also returns the slice of survivor objects whose DELETE wasn't confirmed (no DeletedAction or SkippedAction ChangeSetEntry). The reconcile path merges those survivors back into status.Inventory immediately before recording PruneFailedReason and returning the error. The next reconcile then sees them in the diff again and retries the prune.

Treating SkippedAction entries as settled means resources explicitly opted out (kustomize.toolkit.fluxcd.io/prune: disabled) are not re-tracked — keeping them would create an infinite retry loop for opt-outs.

Paths NOT affected:

  • spec.prune: falseprune() short-circuits before any of the new logic runs.
  • The finalizer path (Kustomization deletion) uses deleteObjects directly, not prune().
  • Successful prunes — survivors is empty, no merge occurs.

Reproducer

I built a minimal kind-based reproducer at https://github.com/gecube/flux-kc-1664-repro. The relevant variant — repro-restart-matrix.sh scenario webhook-deny-delete — reproduces the exact production state from #1664 (3/3 consecutive runs): per-AZ ConfigMaps survive past the consolidation commit, deletionTimestamp not set, status.Inventory already advanced past them. With this patch applied to a locally-built kustomize-controller, the same reproducer should leave the orphans in inventory and retry prune on the next reconcile (I have not yet rebuilt-and-loaded a custom controller image into kind to verify end-to-end; happy to do so if useful before merge).

Tests

  • internal/inventory: new subtest merge_re-adds_objects_whose_prune_failed plus a nil-safety subtest.
  • internal/controller: new file kustomization_prune_survivors_test.go with unit-level coverage of pruneSurvivors across the DeletedAction / SkippedAction / UnknownAction / missing-entry cases.
  • Existing tests pass locally:
    • go test ./internal/inventory/...
    • go test ./internal/controller/... -run "Prune|Inventory|DeletionPolicy"
  • go vet ./... clean, gofmt -l clean.

Notes for maintainers

Marking this draft because:

  1. I haven't yet end-to-end-tested with a custom-built controller image against the kind reproducer (only unit-tested the new functions). Happy to do so before merge.
  2. There may be an alternative design you'd prefer — e.g. tracking the failed-prune subset separately in status (under staleResources or similar) rather than merging back into inventory; or surfacing an additional OrphanedResourcesDetected condition; or a different stance on SkippedAction re-tracking semantics. I picked the smallest change that restores GC convergence.
  3. The SkippedAction case is the subtle one — could you confirm my reading that an explicit prune: disabled opt-out should never end up in survivors? If you'd rather treat skipped-as-survivor, the change is a one-line swap.

Reviews welcome. Test plan after rebuild verification:

  • unit tests for pruneSurvivors and inventory.Merge
  • existing prune / inventory / deletion-policy tests pass locally
  • full local repro with rebuilt image (will run if useful)
  • CI green

@gecube

gecube commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

E2E verification against the original repro

I rebuilt the controller with this PR (make docker-build IMG=local/kustomize-controller:fix-1664), loaded it into the kind cluster from https://github.com/gecube/flux-kc-1664-repro, and re-ran the webhook-deny-delete scenario that previously reproduced the orphan state 3/3 times on v1.8.8. Two checks:

Check 1 — orphans stay tracked while the webhook blocks DELETE

After stage C with a ValidatingAdmissionPolicy denying DELETE on app-1a/1b/1c:

$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='{range .status.inventory.entries[*]}{.id}{"\n"}{end}'
kc-1664_app-flat__ConfigMap
kc-1664_app-1c__ConfigMap
kc-1664_app-1b__ConfigMap
kc-1664_app-1a__ConfigMap

The three resources that failed to prune stayed in status.Inventory — exactly what this PR promises. (On vanilla v1.8.8 the inventory would contain only app-flat and the per-AZ ConfigMaps would be permanently orphan.)

Status surface is also actionable now:

Ready=False  reason=PruneFailed
message: delete failed, errors: ConfigMap/kc-1664/app-1c delete failed:
  configmaps "app-1c" is forbidden: ValidatingAdmissionPolicy
  'kc-1664-block-delete' with binding 'kc-1664-block-delete-binding'
  denied request: kc-1664-block-delete: refuse DELETE for app-1a/1b/1c; …

The operator now sees exactly which webhook is blocking which DELETE.

Check 2 — Flux recovers automatically when the obstruction is removed

$ kubectl delete validatingadmissionpolicybinding kc-1664-block-delete-binding
$ kubectl delete validatingadmissionpolicy kc-1664-block-delete
$ flux reconcile kustomization kc-1664 -n flux-system
✔ applied revision main@sha1:350b6b0a…
$ kubectl -n kc-1664 get cm
NAME       DATA   AGE
app-flat   1      4m
$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='{range .status.inventory.entries[*]}{.id}{"\n"}{end}'
kc-1664_app-flat__ConfigMap
$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status}'
Ready=True

So the fix doesn't just expose the orphans — it lets the controller converge on its own once the cause is fixed. No manual kubectl delete step is required.

I haven't touched the --with-finalizer variant (where deletionTimestamp was getting set + a controller-owned finalizer blocked completion), since that path was already visible to operators in current Flux releases (Ready=False flips, the resource is Terminating). The bug surface this PR fixes is the silent one — admission rejections that never even reach the deletion phase.

Ready for review whenever you have a moment.

@gecube gecube marked this pull request as ready for review June 11, 2026 19:44
When the kustomize-controller advances status.Inventory to the new build
output before the prune step succeeds, and prune subsequently fails because
the apiserver rejects the DELETE (admission webhook denial, validation error,
in-use resource, etc.), the stale resources are silently lost from Flux's
tracking. status.Inventory no longer references them; the next reconcile's
old-vs-new inventory diff yields nothing for them; flux reconcile is a no-op;
and the resources remain on the cluster forever — still labeled with this
Kustomization but invisible to its GC.

This is reproducible with a minimal kind/Flux setup using a
ValidatingAdmissionPolicy that denies DELETE on the resources to be pruned
(see fluxcd#1664 for the reproducer and the original
production trigger, which was Karpenter's EC2NodeClass validation webhook
firing while a sibling NodeClaim still referenced the class being pruned).

The fix:

- prune() now also returns the slice of survivor objects whose DELETE was
  not confirmed by the apiserver (i.e. no DeletedAction / SkippedAction
  entry in the ChangeSet).
- The reconcile path merges those survivors back into status.Inventory
  before recording the PruneFailedReason and returning the error. The next
  reconcile then sees them again in old-vs-new and retries the prune.
- A new inventory.Merge helper performs the additive, idempotent insert.
- Behaviour with no failure, with spec.Prune=false, or with resources
  explicitly opted out via kustomize.toolkit.fluxcd.io/prune=disabled is
  unchanged: SkippedAction entries are treated as settled, so opt-outs do
  not produce an infinite retry loop.

Unit tests for pruneSurvivors and inventory.Merge cover the deleted/skipped/
unknown/missing action cases and the nil-safety paths.

Fixes fluxcd#1664

Signed-off-by: gecube <gb12335@gmail.com>
Signed-off-by: Matheus Pimenta <matheuscscp@gmail.com>
@stefanprodan stefanprodan added the area/server-side-apply SSA related issues and pull requests label Jun 12, 2026

@matheuscscp matheuscscp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@gecube Nice one!

@matheuscscp matheuscscp added the bug Something isn't working label Jun 12, 2026
@matheuscscp matheuscscp merged commit e4d3f6d into fluxcd:main Jun 12, 2026
8 checks passed
@gecube gecube deleted the fix/1664-keep-failed-prune-in-inventory branch June 12, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/server-side-apply SSA related issues and pull requests bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resources not pruned after a rapid series of commits where intermediate kustomize build output is broken

3 participants