fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933
Open
Zivxx wants to merge 6 commits into
Open
fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933Zivxx wants to merge 6 commits into
Zivxx wants to merge 6 commits into
Conversation
…is down During a broadcaster-backbone outage, a publish reaching one worker was still delivered to that worker's own clients (local in-process notifier) while peers' clients stayed stale — a transient fleet split lasting the whole outage (observed 3/9 PDPs diverging in a 3-server/9-PDP staging test). Gate the whole publish on live backbone connectivity: add is_backbone_connected() to ReconnectingBroadcaster (True only while actively subscribed — unlike is_reader_healthy(), which deliberately stays healthy across a transient reconnect) and a FreezablePubSubEndpoint that skips publish entirely while disconnected. Clients stay connected but frozen on the last consistent state; the write still lands in its datasource and the existing reconnect resync makes every client refetch it, so the fleet converges together. Skipping the whole publish also keeps the replay buffer out of the recovery path (no partial replayed state racing the resync refetch). Configurable via OPAL_BROADCAST_FREEZE_ON_DISCONNECT (default true). Scope: designed for source-based updates (entry carries a URL clients refetch). Inline updates issued during an outage are dropped rather than deferred — accepted, inline is legacy. Validated on a 3-server/9-PDP staging stack against a real RDS stop/start: during-outage divergence 3/9 -> 0/9, zero client disconnects and zero worker restarts across ~20-minute outages, and a source-based write made during the gap converged 9/9 via the resync refetch alone (its notification provably frozen, nothing replayed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…exemptions, alias bypass Review findings on the freeze-on-disconnect gate, all fixed here: - Gate on a real backbone GAP (had a session, lost it, reader retrying) via is_in_backbone_gap(), not on mere "not subscribed". Freezing while the reader was never started (healthy idle worker: the global listening context is only pinned when statistics are enabled, and upstream cancels the reader when the last listener leaves) or never yet connected (boot) silently dropped publishes fleet-wide with no resync ever firing to reconcile them. - Exempt internal coordination topics from the freeze: "__"-prefixed channels (statistics protocol, broadcaster keepalive) and the git-webhook trigger topic. These target server-side subscribers, not clients — dropping them broke statistics state (ghost clients, workers that never sync) and lost repo-pull triggers that no resync re-issues (webhooks are the only pull trigger with the default POLICY_REPO_POLLING_INTERVAL=0). Exempt topics keep the pre-freeze deliver-locally + buffer-for-replay behavior. - Re-bind the library's class-level `notify = publish` alias, which bound the BASE publish and let endpoint.notify(...) bypass the gate entirely. - Refuse BROADCAST_FREEZE_ON_DISCONNECT with BROADCAST_RESYNC_ON_RECONNECT disabled (warn + disable freeze): the resync is the freeze's only recovery path, so that combination silently lost every update published during gaps. - Rate-limit the freeze log: WARNING once per gap episode, DEBUG afterwards, and a suppressed-count summary on recovery (a long outage previously emitted an unbounded WARNING per frozen stats keepalive). - Document recovery scope honestly (configured-source refetch; one-off URL / inline updates are dropped by a freeze) and the residual windows that keep pre-freeze behavior (outbound-failure while subscribed, connect flap, RPC client-originated publishes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
✅ Deploy Preview for opal-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
CI fixes: - pre-commit / docformatter: new docstrings were not wrapped per the pinned docformatter 1.7.5; re-wrapped in place (no code changes), verified idempotent alongside black/isort and the test suite. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a “fleet-consistency freeze” for OPAL Server publishes during broadcaster backbone gaps so that writes received by a single worker during an outage are not delivered locally (and thus can’t split the fleet) until recovery via reconnect resync.
Changes:
- Add backbone subscription/gap tracking to
ReconnectingBroadcasterand a newFreezablePubSubEndpointthat suppressespublish()during gaps (with exemptions). - Wire the new endpoint into server pubsub initialization with a safety guardrail when resync is disabled.
- Add config flag
BROADCAST_FREEZE_ON_DISCONNECT(defaulttrue) and new unit tests covering gap lifecycle + freeze behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| packages/opal-server/opal_server/tests/reconnecting_broadcaster_test.py | Adds tests for is_backbone_connected / is_in_backbone_gap lifecycle and freeze gating behavior. |
| packages/opal-server/opal_server/pubsub.py | Switches server endpoint to FreezablePubSubEndpoint and applies freeze + exemptions + resync guardrail. |
| packages/opal-server/opal_server/pubsub_resilience.py | Adds backbone connectivity state, gap detection API, and implements FreezablePubSubEndpoint. |
| packages/opal-server/opal_server/config.py | Introduces BROADCAST_FREEZE_ON_DISCONNECT configuration option and documentation text. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+667
to
+673
| def _is_exempt(self, topics) -> bool: | ||
| if isinstance(topics, str): | ||
| topics = [topics] | ||
| return all( | ||
| topic.startswith("__") or topic in self._freeze_exempt_topics | ||
| for topic in topics | ||
| ) |
Comment on lines
+944
to
+954
| def _fabricate_gap(broadcaster): | ||
| """Put a broadcaster into the mid-gap state (had a session, lost it, reader | ||
| pending) without driving a real backbone. | ||
|
|
||
| Returns the dummy reader task — cancel it in the test's cleanup. | ||
| """ | ||
| dummy = asyncio.get_event_loop().create_task(asyncio.sleep(60)) | ||
| broadcaster._subscription_task = dummy | ||
| broadcaster._had_backbone_connection = True | ||
| broadcaster._backbone_connected = False | ||
| return dummy |
Comment on lines
+993
to
+995
| finally: | ||
| dummy.cancel() | ||
| # no broadcaster (single worker) -> never |
Comment on lines
+1026
to
+1027
| finally: | ||
| dummy.cancel() |
… over An exempt (internal-topic) publish arriving mid-gap reaches the delivery path and was resetting the freeze-episode counter and logging "Backbone recovered" while the backbone was still down. Emit the summary only when the gap has actually ended; exempt deliveries mid-gap no longer end the episode. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…antics CI fixes: - E2E Tests / App Tests: the cross-instance consistency test asserted the pre-freeze behavior (a one-off update published during a backbone outage reaches clients via buffer+replay). With BROADCAST_FREEZE_ON_DISCONNECT (default on) that publish is frozen so the fleet never splits. The test now asserts the new semantics: the publish is frozen at the gate, exempt internal topics still ride buffer+replay, recovery resyncs clients, the frozen update reached NO client (new check_clients_not_logged helper), and a post-recovery re-publish converges both clients together and logs the freeze-episode summary. Also raise the gitea startup wait budget (120s -> 300s): slow bind-mount environments need minutes to bring the web listener up; fast environments exit the poll early and pay nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The problem
Follow-up to #915. That PR made a fleet of OPAL servers survive a broadcast-backbone outage (reconnect with backoff, buffer outbound broadcasts, resync clients on recovery). Multi-server validation of it surfaced a consistency gap during the outage:
A server-side
publishfans out two independent ways — local in-process delivery to that worker's own clients, and the broadcaster to peer workers. Only the outbound path is buffered when the backbone is down; local delivery still fires. So an update that reaches one server mid-outage is applied to its clients while every other server's clients hold old data — a fleet split that lasts the whole outage. In a 3-server / 9-client test against a real managed-Postgres backbone stop, a data update published during the outage left the fleet at 3/9 new vs 6/9 old for the entire ~13-minute gap. For an authorization system, that means the same request can be allowed by one PDP and denied by another until the backbone returns.The fix
FreezablePubSubEndpointgates client-facing publishes on live backbone connectivity: while theReconnectingBroadcasteris mid-gap (an established backbone session was lost and is being re-acquired — seeis_in_backbone_gap()),publishis skipped entirely. Clients stay connected but frozen on the last consistent state; the write still lands in its datasource; and the existing reconnect resync (from #915) makes every worker's clients refetch on recovery — so the whole fleet moves together instead of diverging. Skipping the whole publish also keeps the replay buffer out of the recovery path for these updates: convergence is purely the resync refetch.Configurable via
OPAL_BROADCAST_FREEZE_ON_DISCONNECT(defaulttrue; setfalsefor the previous behavior).Scope & semantics (please read if you rely on runtime data updates)
OPAL_DATA_CONFIG_SOURCES/ scope config) and the policy bundle. These are reconciled fleet-wide on recovery.datapayloads, or fetch URLs not part of the configured sources. Consistency is chosen over freshness for these; if you rely on them, set the flag tofalse.__-prefixed channels (statistics protocol, broadcaster keepalive) and the git-webhook trigger topic. These target server-side subscribers, not clients; freezing them would break statistics state and silently drop repo-pull triggers that no resync re-issues.OPAL_BROADCAST_RESYNC_ON_RECONNECT=falseis refused with a warning (freeze disabled) rather than silently losing updates.Hardening (second commit)
An adversarial review of the first commit caught real issues, all addressed:
is_in_backbone_gap()(had a session, lost it, retrying), not mere "not subscribed". Freezing while the reader was never started (idle worker — the reader only runs while a listening context is held) or never yet connected (boot) would silently drop publishes on a healthy backbone with nothing to ever reconcile them; those states delegate to the pre-freeze behavior. Gap state is scoped to the reader task's own lifetime, mirroring the resync's semantics.notify = publishalias re-bound — the upstream endpoint aliasesnotify = publishat class level, which binds the base publish; without re-binding,endpoint.notify(...)bypassed the gate entirely.Validation
notifyalias pin.🤖 Generated with Claude Code