fix(connection): use MergeFrom patch for finalizer writes to avoid 409 conflicts#414
Merged
Merged
Conversation
…9 conflicts
Sub-resource reconcilers added and removed the resource finalizer with a
full-object Update. The object is loaded once from the informer cache, so
the Update carries a stale resourceVersion as an optimistic-lock
precondition. When the server copy moves ahead — concurrent writers such
as GitOps controllers, or informer-cache lag — the API server rejects the
write with a 409 Conflict ("the object has been modified") and the
finalizer is never removed. The object is then stuck Terminating while the
connection requeues hot, and it only clears once the cache happens to catch
up.
Replace each finalizer Update with a client.MergeFrom patch, guarded by the
boolean return of Add/RemoveFinalizer. The patch sends only the finalizer
diff and carries no resourceVersion precondition, so it always applies to
the live object regardless of cache lag or concurrent writers; the guard
avoids an API call when the finalizer set is unchanged.
Applied across all PulsarConnection sub-reconcilers (tenant, namespace,
topic, source, sink, function, package, permission, geo-replication,
ns-isolation-policy) and the PulsarConnection reconciler itself.
Add a regression test that removes a PulsarSource finalizer from a
stale-resourceVersion copy and asserts there is no conflict, that the write
uses Patch rather than a full-object Update, and that the object is
garbage-collected promptly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
@freeznet:Thanks for your contribution. For this PR, do we need to update docs? |
…finalizer writes The previous change persisted finalizers (and the connection's admin-URL default) with a client.MergeFrom patch. A JSON merge patch replaces list fields atomically, so removing the operator finalizer from a stale cached object emits a patch that clears the whole finalizers list. That clobbers a finalizer added concurrently by another controller (e.g. the Kubernetes foregroundDeletion finalizer, or a GitOps finalizer) and deletes the object prematurely while that finalizer is still pending. The connection's adminServiceURL default had the same hazard: a MergeFrom patch carries no resourceVersion precondition, so it could overwrite a concurrent spec edit. Read the live object under optimistic-lock retry instead. New helpers ensureFinalizer / removeFinalizer (finalizer_helpers.go) Get the latest object, mutate only the operator finalizer, and Update with retry.RetryOnConflict, so concurrently-added finalizers are preserved and a stale cache no longer wedges deletion. The PulsarConnection reconciler applies the same Get-mutate-Update-retry to its adminServiceURL default and finalizer. Tests: - PulsarSource deletion preserves a concurrently-added foreign finalizer and does not garbage-collect the object while that finalizer is still pending. - PulsarSource deletion still removes the operator finalizer and the object is collected promptly even when the cached resourceVersion is stale. - PulsarConnection reconcile does not overwrite a concurrent adminServiceURL edit when defaulting from the secure URL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The finalizer helpers and the connection reconcile loop re-read the object under retry.RetryOnConflict before writing, but used the manager's default client, whose reads go through the informer cache. Under cache lag that Get can return a stale resourceVersion, the Update 409s, and the retry re-reads the same stale cache — so deletion could still be delayed until the cache converges, missing the "remove the finalizer on the first reconcile, regardless of informer-cache lag" goal. Thread mgr.GetAPIReader() (an uncached, direct-from-apiserver reader) into the PulsarConnectionReconciler and use it for the finalizer Get; writes still go through the normal client. A ContainsFinalizer guard skips the uncached read entirely on the steady-state path (finalizer already present), so direct reads happen only on the first add and on deletion — both rare. This does not change the clobber fix: a stale-resourceVersion Update is fail-safe (409, never a lost update), so concurrent finalizers were already preserved; the uncached read additionally guarantees deletion is not wedged by a lagging cache. Tests: a finalizer removal driven by a persistently stale reader conflicts (the stuck symptom), while the same removal with a live reader succeeds; and the add-path guard issues no API call when the finalizer is already present. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jiangpengcheng
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a Pulsar sub-resource (e.g. a
PulsarSource) is deleted, the operator deletes the backing Pulsar resource and then removes the Kubernetes finalizer. The finalizer-removal write could fail with a409 Conflict("the object has been modified"), leaving the object stuckTerminatingand the owningPulsarConnectionrequeuing in a hot loop. It only cleared once the informer cache happened to catch up. This can also block GitOps convergence and hold up teardown of the connection, since a stuck child blocksPulsarConnectiondeletion.Root cause
Sub-resource reconcilers add/remove the finalizer with a full-object
Update:The object is loaded once from the controller-runtime informer cache (the cached
ListinObserve), so theUpdatecarries the cachedresourceVersionas an optimistic-lock precondition. When the server copy is newer — informer-cache lag, or a concurrent writer such as a GitOps controller re-syncing the same object — the API server rejects the write with409 Conflict, and the finalizer is never removed. An unrelated error on a sibling resource of the same connection makes the connection requeue faster than the cache settles, so the conflict repeats instead of self-correcting.Fix
Replace each finalizer
Updatewith aclient.MergeFrompatch, guarded by the boolean return ofAdd/RemoveFinalizer:A
MergeFrompatch sends only the finalizer diff and carries noresourceVersionprecondition, so it always applies to the live object regardless of cache lag or concurrent writers. The boolean guard avoids an API call when the finalizer set is unchanged.Scope
The same
Add/RemoveFinalizer+Updatepattern existed in everyPulsarConnectionsub-reconciler, so the change is applied consistently across all of them and thePulsarConnectionreconciler itself:reconcile_tenant.go,reconcile_namespace.go,reconcile_topic.go,reconcile_source.go,reconcile_sink.go,reconcile_function.go,reconcile_package.go,reconcile_permission.go,reconcile_geo_replication.go,reconcile_nsisolationpolicy.go,reconciler.go.Status().Updatecalls and the single non-finalizer objectUpdate(the permission state annotation) are intentionally left unchanged.Testing
pkg/connectionunit tests pass.reconcile_source_test.go) reconciles a deletingPulsarSourcefrom a stale-resourceVersioncopy (after an out-of-band server write) and asserts: no conflict error, the finalizer is persisted viaPatchrather than a full-objectUpdate, and the object is garbage-collected promptly.Updatefrom a stale copy) returns a409 Conflict, so the regression test is non-vacuous — it fails against the oldUpdate-based code and passes with the patch-based fix.🤖 Generated with Claude Code