Skip to content

[fix][broker] Fix geo-replication backlog stuck after backlog quota rejection#26002

Open
void-ptr974 wants to merge 1 commit into
apache:masterfrom
void-ptr974:fix/replicator-publish-failure-permit-clean
Open

[fix][broker] Fix geo-replication backlog stuck after backlog quota rejection#26002
void-ptr974 wants to merge 1 commit into
apache:masterfrom
void-ptr974:fix/replicator-publish-failure-permit-clean

Conversation

@void-ptr974

@void-ptr974 void-ptr974 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Motivation

When a persistent replicator publish fails because the remote producer is rejected, for example by backlog quota, PersistentReplicator rewinds the cursor so the failed entry can be read and retried later.

However, the failed send was not marked as completed in the corresponding in-flight task. As a result, the in-flight task could keep consuming a read permit even though the cursor had already been rewound. After the producer reconnects, the replicator might not read the failed entry again, leaving the replication backlog stuck.

This can cause geo-replication to stop making progress after a transient remote publish failure. It also caused ReplicatorTest.testResumptionAfterBacklogRelaxed to fail intermittently because the backlog remained at 1 instead of returning to 0.

Reference CI failure from #25995:

Pulsar CI attempt 1: https://github.com/apache/pulsar/actions/runs/27323273803/attempts/1
Failed job: CI - Unit - Brokers - Broker Group 5
Job URL: https://github.com/apache/pulsar/actions/runs/27323273803/job/80719328022
ReplicatorTest.testResumptionAfterBacklogRelaxed([producer_exception])
org.awaitility.core.ConditionTimeoutException: Assertion condition expected [0] but found [1] within 40 seconds.
at org.apache.pulsar.broker.service.ReplicatorTest.testResumptionAfterBacklogRelaxed(ReplicatorTest.java:1097)
Caused by: java.lang.AssertionError: expected [0] but found [1]
at org.apache.pulsar.broker.service.ReplicatorTest.lambda$testResumptionAfterBacklogRelaxed$19(ReplicatorTest.java:1099)

The same test report shows the failure happened after the remote backlog quota rejected the replicator producer:

Backlog quota type exceeded for topic. Applying policy {backlogQuotaType=destination_storage, policy=producer_exception, topic=.../producer_exception}
Producer has exceeded backlog quota on topic. Disconnecting producer {producerName=pulsar.repl.r1-->r2, topic=.../producer_exception}
ProducerBlockedQuotaExceededException: The backlog quota of the topic .../producer_exception that the producer pulsar.repl.r1-->r2 produces to is exceeded

Modifications

Mark the failed publish as completed in the current InFlightTask after rewinding the cursor in PersistentReplicator.ProducerSendCallback.

The cursor rewind makes the failed entry readable again, so the in-flight task should release its permit. This allows the replicator to resume reading entries after the remote producer becomes available again.

Added a unit test covering the failed publish path to verify that:

  • the in-flight task is completed when publish fails with ProducerBlockedQuotaExceededException
  • the read permit is released after the failed send is handled

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added PersistentReplicatorInflightTaskTest.testFailedPublishCompletesInFlightTask
  • Verified locally with:
    • ./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.persistent.PersistentReplicatorInflightTaskTest.testFailedPublishCompletesInFlightTask
    • ./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.ReplicatorTest.testResumptionAfterBacklogRelaxed

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

@lhotari lhotari requested a review from poorbarcode June 11, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant