fix: keep failed-to-prune resources in status.Inventory (#1664)#1665
Conversation
E2E verification against the original reproI rebuilt the controller with this PR ( Check 1 — orphans stay tracked while the webhook blocks DELETE After stage C with a The three resources that failed to prune stayed in Status surface is also actionable now: The operator now sees exactly which webhook is blocking which DELETE. Check 2 — Flux recovers automatically when the obstruction is removed So the fix doesn't just expose the orphans — it lets the controller converge on its own once the cause is fixed. No manual I haven't touched the Ready for review whenever you have a moment. |
When the kustomize-controller advances status.Inventory to the new build output before the prune step succeeds, and prune subsequently fails because the apiserver rejects the DELETE (admission webhook denial, validation error, in-use resource, etc.), the stale resources are silently lost from Flux's tracking. status.Inventory no longer references them; the next reconcile's old-vs-new inventory diff yields nothing for them; flux reconcile is a no-op; and the resources remain on the cluster forever — still labeled with this Kustomization but invisible to its GC. This is reproducible with a minimal kind/Flux setup using a ValidatingAdmissionPolicy that denies DELETE on the resources to be pruned (see fluxcd#1664 for the reproducer and the original production trigger, which was Karpenter's EC2NodeClass validation webhook firing while a sibling NodeClaim still referenced the class being pruned). The fix: - prune() now also returns the slice of survivor objects whose DELETE was not confirmed by the apiserver (i.e. no DeletedAction / SkippedAction entry in the ChangeSet). - The reconcile path merges those survivors back into status.Inventory before recording the PruneFailedReason and returning the error. The next reconcile then sees them again in old-vs-new and retries the prune. - A new inventory.Merge helper performs the additive, idempotent insert. - Behaviour with no failure, with spec.Prune=false, or with resources explicitly opted out via kustomize.toolkit.fluxcd.io/prune=disabled is unchanged: SkippedAction entries are treated as settled, so opt-outs do not produce an infinite retry loop. Unit tests for pruneSurvivors and inventory.Merge cover the deleted/skipped/ unknown/missing action cases and the nil-safety paths. Fixes fluxcd#1664 Signed-off-by: gecube <gb12335@gmail.com>
d85dcdb to
ab4ec07
Compare
Signed-off-by: Matheus Pimenta <matheuscscp@gmail.com>
matheuscscp
left a comment
There was a problem hiding this comment.
LGTM! 🚀
@gecube Nice one!
Summary
Fixes #1664. When
status.Inventoryis advanced to the new build output before the prune step succeeds, and prune subsequently fails because the apiserver rejects the DELETE (admission webhook denial, in-use resource, etc.), the stale resources are silently lost from Flux's tracking. They:status.Inventory,kustomize.toolkit.fluxcd.io/name=…label,flux reconcile— only manualkubectl deleteremoves them.Result: long-lived orphans labeled by the
Kustomizationbut invisible to its GC.Behaviour change
prune()now also returns the slice of survivor objects whose DELETE wasn't confirmed (noDeletedActionorSkippedActionChangeSetEntry). The reconcile path merges those survivors back intostatus.Inventoryimmediately before recordingPruneFailedReasonand returning the error. The next reconcile then sees them in the diff again and retries the prune.Treating
SkippedActionentries as settled means resources explicitly opted out (kustomize.toolkit.fluxcd.io/prune: disabled) are not re-tracked — keeping them would create an infinite retry loop for opt-outs.Paths NOT affected:
spec.prune: false—prune()short-circuits before any of the new logic runs.deleteObjectsdirectly, notprune().survivorsis empty, no merge occurs.Reproducer
I built a minimal kind-based reproducer at https://github.com/gecube/flux-kc-1664-repro. The relevant variant —
repro-restart-matrix.shscenariowebhook-deny-delete— reproduces the exact production state from #1664 (3/3 consecutive runs): per-AZ ConfigMaps survive past the consolidation commit,deletionTimestampnot set,status.Inventoryalready advanced past them. With this patch applied to a locally-builtkustomize-controller, the same reproducer should leave the orphans in inventory and retry prune on the next reconcile (I have not yet rebuilt-and-loaded a custom controller image into kind to verify end-to-end; happy to do so if useful before merge).Tests
internal/inventory: new subtestmerge_re-adds_objects_whose_prune_failedplus a nil-safety subtest.internal/controller: new filekustomization_prune_survivors_test.gowith unit-level coverage ofpruneSurvivorsacross theDeletedAction/SkippedAction/UnknownAction/ missing-entry cases.go test ./internal/inventory/...go test ./internal/controller/... -run "Prune|Inventory|DeletionPolicy"go vet ./...clean,gofmt -lclean.Notes for maintainers
Marking this draft because:
status(understaleResourcesor similar) rather than merging back into inventory; or surfacing an additionalOrphanedResourcesDetectedcondition; or a different stance onSkippedActionre-tracking semantics. I picked the smallest change that restores GC convergence.SkippedActioncase is the subtle one — could you confirm my reading that an explicitprune: disabledopt-out should never end up insurvivors? If you'd rather treat skipped-as-survivor, the change is a one-line swap.Reviews welcome. Test plan after rebuild verification:
pruneSurvivorsandinventory.Merge