Skip to content

fix(connection): use MergeFrom patch for finalizer writes to avoid 409 conflicts#414

Merged
freeznet merged 3 commits into
mainfrom
fix/finalizer-removal-mergefrom-patch
Jun 29, 2026
Merged

fix(connection): use MergeFrom patch for finalizer writes to avoid 409 conflicts#414
freeznet merged 3 commits into
mainfrom
fix/finalizer-removal-mergefrom-patch

Conversation

@freeznet

Copy link
Copy Markdown
Member

Problem

When a Pulsar sub-resource (e.g. a PulsarSource) is deleted, the operator deletes the backing Pulsar resource and then removes the Kubernetes finalizer. The finalizer-removal write could fail with a 409 Conflict ("the object has been modified"), leaving the object stuck Terminating and the owning PulsarConnection requeuing in a hot loop. It only cleared once the informer cache happened to catch up. This can also block GitOps convergence and hold up teardown of the connection, since a stuck child blocks PulsarConnection deletion.

Root cause

Sub-resource reconcilers add/remove the finalizer with a full-object Update:

controllerutil.RemoveFinalizer(obj, resourcev1alpha1.FinalizerName)
if err := r.conn.client.Update(ctx, obj); err != nil { ... }

The object is loaded once from the controller-runtime informer cache (the cached List in Observe), so the Update carries the cached resourceVersion as an optimistic-lock precondition. When the server copy is newer — informer-cache lag, or a concurrent writer such as a GitOps controller re-syncing the same object — the API server rejects the write with 409 Conflict, and the finalizer is never removed. An unrelated error on a sibling resource of the same connection makes the connection requeue faster than the cache settles, so the conflict repeats instead of self-correcting.

Fix

Replace each finalizer Update with a client.MergeFrom patch, guarded by the boolean return of Add/RemoveFinalizer:

patch := client.MergeFrom(obj.DeepCopy())
if controllerutil.RemoveFinalizer(obj, resourcev1alpha1.FinalizerName) {
    if err := r.conn.client.Patch(ctx, obj, patch); err != nil { ... }
}

A MergeFrom patch sends only the finalizer diff and carries no resourceVersion precondition, so it always applies to the live object regardless of cache lag or concurrent writers. The boolean guard avoids an API call when the finalizer set is unchanged.

Scope

The same Add/RemoveFinalizer + Update pattern existed in every PulsarConnection sub-reconciler, so the change is applied consistently across all of them and the PulsarConnection reconciler itself:

reconcile_tenant.go, reconcile_namespace.go, reconcile_topic.go, reconcile_source.go, reconcile_sink.go, reconcile_function.go, reconcile_package.go, reconcile_permission.go, reconcile_geo_replication.go, reconcile_nsisolationpolicy.go, reconciler.go.

Status().Update calls and the single non-finalizer object Update (the permission state annotation) are intentionally left unchanged.

Testing

  • Existing pkg/connection unit tests pass.
  • New regression test (reconcile_source_test.go) reconciles a deleting PulsarSource from a stale-resourceVersion copy (after an out-of-band server write) and asserts: no conflict error, the finalizer is persisted via Patch rather than a full-object Update, and the object is garbage-collected promptly.
  • A companion test confirms the pre-fix path (a full Update from a stale copy) returns a 409 Conflict, so the regression test is non-vacuous — it fails against the old Update-based code and passes with the patch-based fix.

🤖 Generated with Claude Code

…9 conflicts

Sub-resource reconcilers added and removed the resource finalizer with a
full-object Update. The object is loaded once from the informer cache, so
the Update carries a stale resourceVersion as an optimistic-lock
precondition. When the server copy moves ahead — concurrent writers such
as GitOps controllers, or informer-cache lag — the API server rejects the
write with a 409 Conflict ("the object has been modified") and the
finalizer is never removed. The object is then stuck Terminating while the
connection requeues hot, and it only clears once the cache happens to catch
up.

Replace each finalizer Update with a client.MergeFrom patch, guarded by the
boolean return of Add/RemoveFinalizer. The patch sends only the finalizer
diff and carries no resourceVersion precondition, so it always applies to
the live object regardless of cache lag or concurrent writers; the guard
avoids an API call when the finalizer set is unchanged.

Applied across all PulsarConnection sub-reconcilers (tenant, namespace,
topic, source, sink, function, package, permission, geo-replication,
ns-isolation-policy) and the PulsarConnection reconciler itself.

Add a regression test that removes a PulsarSource finalizer from a
stale-resourceVersion copy and asserts there is no conflict, that the write
uses Patch rather than a full-object Update, and that the object is
garbage-collected promptly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 25, 2026 15:57
@freeznet freeznet requested review from a team as code owners June 25, 2026 15:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions

Copy link
Copy Markdown
Contributor

@freeznet:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@github-actions github-actions Bot added the doc-info-missing This pr needs to mark a document option in description label Jun 25, 2026
freeznet and others added 2 commits June 26, 2026 10:51
…finalizer writes

The previous change persisted finalizers (and the connection's admin-URL
default) with a client.MergeFrom patch. A JSON merge patch replaces list
fields atomically, so removing the operator finalizer from a stale cached
object emits a patch that clears the whole finalizers list. That clobbers a
finalizer added concurrently by another controller (e.g. the Kubernetes
foregroundDeletion finalizer, or a GitOps finalizer) and deletes the object
prematurely while that finalizer is still pending. The connection's
adminServiceURL default had the same hazard: a MergeFrom patch carries no
resourceVersion precondition, so it could overwrite a concurrent spec edit.

Read the live object under optimistic-lock retry instead. New helpers
ensureFinalizer / removeFinalizer (finalizer_helpers.go) Get the latest
object, mutate only the operator finalizer, and Update with
retry.RetryOnConflict, so concurrently-added finalizers are preserved and a
stale cache no longer wedges deletion. The PulsarConnection reconciler applies
the same Get-mutate-Update-retry to its adminServiceURL default and finalizer.

Tests:
- PulsarSource deletion preserves a concurrently-added foreign finalizer and
  does not garbage-collect the object while that finalizer is still pending.
- PulsarSource deletion still removes the operator finalizer and the object is
  collected promptly even when the cached resourceVersion is stale.
- PulsarConnection reconcile does not overwrite a concurrent adminServiceURL
  edit when defaulting from the secure URL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The finalizer helpers and the connection reconcile loop re-read the object
under retry.RetryOnConflict before writing, but used the manager's default
client, whose reads go through the informer cache. Under cache lag that Get
can return a stale resourceVersion, the Update 409s, and the retry re-reads
the same stale cache — so deletion could still be delayed until the cache
converges, missing the "remove the finalizer on the first reconcile,
regardless of informer-cache lag" goal.

Thread mgr.GetAPIReader() (an uncached, direct-from-apiserver reader) into
the PulsarConnectionReconciler and use it for the finalizer Get; writes still
go through the normal client. A ContainsFinalizer guard skips the uncached
read entirely on the steady-state path (finalizer already present), so direct
reads happen only on the first add and on deletion — both rare.

This does not change the clobber fix: a stale-resourceVersion Update is
fail-safe (409, never a lost update), so concurrent finalizers were already
preserved; the uncached read additionally guarantees deletion is not wedged
by a lagging cache.

Tests: a finalizer removal driven by a persistently stale reader conflicts
(the stuck symptom), while the same removal with a live reader succeeds; and
the add-path guard issues no API call when the finalizer is already present.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@freeznet freeznet merged commit 7012b18 into main Jun 29, 2026
5 checks passed
@freeznet freeznet deleted the fix/finalizer-removal-mergefrom-patch branch June 29, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-info-missing This pr needs to mark a document option in description

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants