fix: keep failed-to-prune resources in status.Inventory (#1664) by gecube · Pull Request #1665 · fluxcd/kustomize-controller

gecube · 2026-06-11T19:36:34Z

Summary

Fixes #1664. When status.Inventory is advanced to the new build output before the prune step succeeds, and prune subsequently fails because the apiserver rejects the DELETE (admission webhook denial, in-use resource, etc.), the stale resources are silently lost from Flux's tracking. They:

are no longer in status.Inventory,
still carry the kustomize.toolkit.fluxcd.io/name=… label,
are not surfaced as prunable by the next reconcile's old-vs-new diff (the "old" snapshot has already advanced past them),
cannot be re-pruned by flux reconcile — only manual kubectl delete removes them.

Result: long-lived orphans labeled by the Kustomization but invisible to its GC.

Behaviour change

prune() now also returns the slice of survivor objects whose DELETE wasn't confirmed (no DeletedAction or SkippedAction ChangeSetEntry). The reconcile path merges those survivors back into status.Inventory immediately before recording PruneFailedReason and returning the error. The next reconcile then sees them in the diff again and retries the prune.

Treating SkippedAction entries as settled means resources explicitly opted out (kustomize.toolkit.fluxcd.io/prune: disabled) are not re-tracked — keeping them would create an infinite retry loop for opt-outs.

Paths NOT affected:

spec.prune: false — prune() short-circuits before any of the new logic runs.
The finalizer path (Kustomization deletion) uses deleteObjects directly, not prune().
Successful prunes — survivors is empty, no merge occurs.

Reproducer

I built a minimal kind-based reproducer at https://github.com/gecube/flux-kc-1664-repro. The relevant variant — repro-restart-matrix.sh scenario webhook-deny-delete — reproduces the exact production state from #1664 (3/3 consecutive runs): per-AZ ConfigMaps survive past the consolidation commit, deletionTimestamp not set, status.Inventory already advanced past them. With this patch applied to a locally-built kustomize-controller, the same reproducer should leave the orphans in inventory and retry prune on the next reconcile (I have not yet rebuilt-and-loaded a custom controller image into kind to verify end-to-end; happy to do so if useful before merge).

Tests

internal/inventory: new subtest merge_re-adds_objects_whose_prune_failed plus a nil-safety subtest.
internal/controller: new file kustomization_prune_survivors_test.go with unit-level coverage of pruneSurvivors across the DeletedAction / SkippedAction / UnknownAction / missing-entry cases.
Existing tests pass locally:
- go test ./internal/inventory/...
- go test ./internal/controller/... -run "Prune|Inventory|DeletionPolicy"
go vet ./... clean, gofmt -l clean.

Notes for maintainers

Marking this draft because:

I haven't yet end-to-end-tested with a custom-built controller image against the kind reproducer (only unit-tested the new functions). Happy to do so before merge.
There may be an alternative design you'd prefer — e.g. tracking the failed-prune subset separately in status (under staleResources or similar) rather than merging back into inventory; or surfacing an additional OrphanedResourcesDetected condition; or a different stance on SkippedAction re-tracking semantics. I picked the smallest change that restores GC convergence.
The SkippedAction case is the subtle one — could you confirm my reading that an explicit prune: disabled opt-out should never end up in survivors? If you'd rather treat skipped-as-survivor, the change is a one-line swap.

Reviews welcome. Test plan after rebuild verification:

unit tests for pruneSurvivors and inventory.Merge
existing prune / inventory / deletion-policy tests pass locally
full local repro with rebuilt image (will run if useful)
CI green

gecube · 2026-06-11T19:44:49Z

E2E verification against the original repro

I rebuilt the controller with this PR (make docker-build IMG=local/kustomize-controller:fix-1664), loaded it into the kind cluster from https://github.com/gecube/flux-kc-1664-repro, and re-ran the webhook-deny-delete scenario that previously reproduced the orphan state 3/3 times on v1.8.8. Two checks:

Check 1 — orphans stay tracked while the webhook blocks DELETE

After stage C with a ValidatingAdmissionPolicy denying DELETE on app-1a/1b/1c:

$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='{range .status.inventory.entries[*]}{.id}{"\n"}{end}'
kc-1664_app-flat__ConfigMap
kc-1664_app-1c__ConfigMap
kc-1664_app-1b__ConfigMap
kc-1664_app-1a__ConfigMap

The three resources that failed to prune stayed in status.Inventory — exactly what this PR promises. (On vanilla v1.8.8 the inventory would contain only app-flat and the per-AZ ConfigMaps would be permanently orphan.)

Status surface is also actionable now:

Ready=False  reason=PruneFailed
message: delete failed, errors: ConfigMap/kc-1664/app-1c delete failed:
  configmaps "app-1c" is forbidden: ValidatingAdmissionPolicy
  'kc-1664-block-delete' with binding 'kc-1664-block-delete-binding'
  denied request: kc-1664-block-delete: refuse DELETE for app-1a/1b/1c; …

The operator now sees exactly which webhook is blocking which DELETE.

Check 2 — Flux recovers automatically when the obstruction is removed

$ kubectl delete validatingadmissionpolicybinding kc-1664-block-delete-binding
$ kubectl delete validatingadmissionpolicy kc-1664-block-delete
$ flux reconcile kustomization kc-1664 -n flux-system
✔ applied revision main@sha1:350b6b0a…
$ kubectl -n kc-1664 get cm
NAME       DATA   AGE
app-flat   1      4m
$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='{range .status.inventory.entries[*]}{.id}{"\n"}{end}'
kc-1664_app-flat__ConfigMap
$ kubectl -n flux-system get kustomization kc-1664 \
    -o jsonpath='Ready={.status.conditions[?(@.type=="Ready")].status}'
Ready=True

So the fix doesn't just expose the orphans — it lets the controller converge on its own once the cause is fixed. No manual kubectl delete step is required.

I haven't touched the --with-finalizer variant (where deletionTimestamp was getting set + a controller-owned finalizer blocked completion), since that path was already visible to operators in current Flux releases (Ready=False flips, the resource is Terminating). The bug surface this PR fixes is the silent one — admission rejections that never even reach the deletion phase.

Ready for review whenever you have a moment.

When the kustomize-controller advances status.Inventory to the new build output before the prune step succeeds, and prune subsequently fails because the apiserver rejects the DELETE (admission webhook denial, validation error, in-use resource, etc.), the stale resources are silently lost from Flux's tracking. status.Inventory no longer references them; the next reconcile's old-vs-new inventory diff yields nothing for them; flux reconcile is a no-op; and the resources remain on the cluster forever — still labeled with this Kustomization but invisible to its GC. This is reproducible with a minimal kind/Flux setup using a ValidatingAdmissionPolicy that denies DELETE on the resources to be pruned (see fluxcd#1664 for the reproducer and the original production trigger, which was Karpenter's EC2NodeClass validation webhook firing while a sibling NodeClaim still referenced the class being pruned). The fix: - prune() now also returns the slice of survivor objects whose DELETE was not confirmed by the apiserver (i.e. no DeletedAction / SkippedAction entry in the ChangeSet). - The reconcile path merges those survivors back into status.Inventory before recording the PruneFailedReason and returning the error. The next reconcile then sees them again in old-vs-new and retries the prune. - A new inventory.Merge helper performs the additive, idempotent insert. - Behaviour with no failure, with spec.Prune=false, or with resources explicitly opted out via kustomize.toolkit.fluxcd.io/prune=disabled is unchanged: SkippedAction entries are treated as settled, so opt-outs do not produce an infinite retry loop. Unit tests for pruneSurvivors and inventory.Merge cover the deleted/skipped/ unknown/missing action cases and the nil-safety paths. Fixes fluxcd#1664 Signed-off-by: gecube <gb12335@gmail.com>

Signed-off-by: Matheus Pimenta <matheuscscp@gmail.com>

matheuscscp

LGTM! 🚀

@gecube Nice one!

gecube marked this pull request as ready for review June 11, 2026 19:44

gecube force-pushed the fix/1664-keep-failed-prune-in-inventory branch from d85dcdb to ab4ec07 Compare June 11, 2026 20:06

gecube mentioned this pull request Jun 11, 2026

Resources not pruned after a rapid series of commits where intermediate kustomize build output is broken #1664

Closed

Add tests for failed deletion recovery

40ede32

Signed-off-by: Matheus Pimenta <matheuscscp@gmail.com>

stefanprodan added the area/server-side-apply SSA related issues and pull requests label Jun 12, 2026

matheuscscp approved these changes Jun 12, 2026

View reviewed changes

matheuscscp added the bug Something isn't working label Jun 12, 2026

matheuscscp merged commit e4d3f6d into fluxcd:main Jun 12, 2026
8 checks passed

gecube deleted the fix/1664-keep-failed-prune-in-inventory branch June 12, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep failed-to-prune resources in status.Inventory (#1664)#1665

fix: keep failed-to-prune resources in status.Inventory (#1664)#1665
matheuscscp merged 2 commits into
fluxcd:mainfrom
gecube:fix/1664-keep-failed-prune-in-inventory

gecube commented Jun 11, 2026

Uh oh!

gecube commented Jun 11, 2026

Uh oh!

matheuscscp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gecube commented Jun 11, 2026

Summary

Behaviour change

Reproducer

Tests

Notes for maintainers

Uh oh!

gecube commented Jun 11, 2026

E2E verification against the original repro

Uh oh!

matheuscscp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants