fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent by Zivxx · Pull Request #933 · permitio/opal

Zivxx · 2026-07-02T15:34:59Z

The problem

Follow-up to #915. That PR made a fleet of OPAL servers survive a broadcast-backbone outage (reconnect with backoff, buffer outbound broadcasts, resync clients on recovery). Multi-server validation of it surfaced a consistency gap during the outage:

A server-side publish fans out two independent ways — local in-process delivery to that worker's own clients, and the broadcaster to peer workers. Only the outbound path is buffered when the backbone is down; local delivery still fires. So an update that reaches one server mid-outage is applied to its clients while every other server's clients hold old data — a fleet split that lasts the whole outage. In a 3-server / 9-client test against a real managed-Postgres backbone stop, a data update published during the outage left the fleet at 3/9 new vs 6/9 old for the entire ~13-minute gap. For an authorization system, that means the same request can be allowed by one PDP and denied by another until the backbone returns.

The fix

FreezablePubSubEndpoint gates client-facing publishes on live backbone connectivity: while the ReconnectingBroadcaster is mid-gap (an established backbone session was lost and is being re-acquired — see is_in_backbone_gap()), publish is skipped entirely. Clients stay connected but frozen on the last consistent state; the write still lands in its datasource; and the existing reconnect resync (from #915) makes every worker's clients refetch on recovery — so the whole fleet moves together instead of diverging. Skipping the whole publish also keeps the replay buffer out of the recovery path for these updates: convergence is purely the resync refetch.

Configurable via OPAL_BROADCAST_FREEZE_ON_DISCONNECT (default true; set false for the previous behavior).

Scope & semantics (please read if you rely on runtime data updates)

Covered: updates whose data clients re-fetch on reconnect — the configured data sources (OPAL_DATA_CONFIG_SOURCES / scope config) and the policy bundle. These are reconciled fleet-wide on recovery.
Dropped, not deferred: one-off updates published during a gap that fall outside that set — inline data payloads, or fetch URLs not part of the configured sources. Consistency is chosen over freshness for these; if you rely on them, set the flag to false.
Exempt (keep the old deliver-locally + buffer-for-replay behavior): internal coordination topics — __-prefixed channels (statistics protocol, broadcaster keepalive) and the git-webhook trigger topic. These target server-side subscribers, not clients; freezing them would break statistics state and silently drop repo-pull triggers that no resync re-issues.
Guardrail: the resync is the freeze's only recovery path, so enabling the freeze with OPAL_BROADCAST_RESYNC_ON_RECONNECT=false is refused with a warning (freeze disabled) rather than silently losing updates.

Hardening (second commit)

An adversarial review of the first commit caught real issues, all addressed:

Gap-only gating — freeze keys off is_in_backbone_gap() (had a session, lost it, retrying), not mere "not subscribed". Freezing while the reader was never started (idle worker — the reader only runs while a listening context is held) or never yet connected (boot) would silently drop publishes on a healthy backbone with nothing to ever reconcile them; those states delegate to the pre-freeze behavior. Gap state is scoped to the reader task's own lifetime, mirroring the resync's semantics.
notify = publish alias re-bound — the upstream endpoint aliases notify = publish at class level, which binds the base publish; without re-binding, endpoint.notify(...) bypassed the gate entirely.
Log hygiene — one WARNING per freeze episode (subsequent hits at DEBUG) plus a suppressed-count summary on recovery, instead of a WARNING per frozen keepalive for the whole outage.
Known limitations (documented in the class docstring; each degrades to the pre-freeze behavior, never worse): an outbound broadcast can fail while the reader subscription is alive (delivered locally + buffered, as before); the gate reopens on re-subscribe before the session proves sustained (a rare connect-flap can slip a publish through); client-originated RPC publishes bypass the endpoint override.

Validation

Unit: 49 tests green, including a fault-injectable in-memory backbone driving the full gap lifecycle, the freeze matrix (never-started / never-connected / gap / reconnected / exempt topics / disabled / no broadcaster / stock broadcaster), suppression + episode-counter behavior, reader-restart state hygiene, and the notify alias pin.
Multi-server staging (3 servers × 9 clients, shared Postgres backbone, real stop/start and reboot outages):
- Without the fix: update during the outage → 3/9 clients diverge for the entire gap.
- With the fix: same choreography → 0/9 leak (write-target server's own clients included), held across a ~20-minute outage; all 9 clients stayed connected, 0 worker restarts.
- Source-based end-to-end: a value written to a datasource during the gap, with its update notification provably frozen at the gate (and nothing buffered), converged 9/9 after recovery via the resync refetch alone.
- The final run also confirmed the exemptions live (internal topics buffered-for-replay during the gap, as before this PR) and the episode logging (single WARNING + recovery summary).

🤖 Generated with Claude Code

…is down During a broadcaster-backbone outage, a publish reaching one worker was still delivered to that worker's own clients (local in-process notifier) while peers' clients stayed stale — a transient fleet split lasting the whole outage (observed 3/9 PDPs diverging in a 3-server/9-PDP staging test). Gate the whole publish on live backbone connectivity: add is_backbone_connected() to ReconnectingBroadcaster (True only while actively subscribed — unlike is_reader_healthy(), which deliberately stays healthy across a transient reconnect) and a FreezablePubSubEndpoint that skips publish entirely while disconnected. Clients stay connected but frozen on the last consistent state; the write still lands in its datasource and the existing reconnect resync makes every client refetch it, so the fleet converges together. Skipping the whole publish also keeps the replay buffer out of the recovery path (no partial replayed state racing the resync refetch). Configurable via OPAL_BROADCAST_FREEZE_ON_DISCONNECT (default true). Scope: designed for source-based updates (entry carries a URL clients refetch). Inline updates issued during an outage are dropped rather than deferred — accepted, inline is legacy. Validated on a 3-server/9-PDP staging stack against a real RDS stop/start: during-outage divergence 3/9 -> 0/9, zero client disconnects and zero worker restarts across ~20-minute outages, and a source-based write made during the gap converged 9/9 via the resync refetch alone (its notification provably frozen, nothing replayed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…exemptions, alias bypass Review findings on the freeze-on-disconnect gate, all fixed here: - Gate on a real backbone GAP (had a session, lost it, reader retrying) via is_in_backbone_gap(), not on mere "not subscribed". Freezing while the reader was never started (healthy idle worker: the global listening context is only pinned when statistics are enabled, and upstream cancels the reader when the last listener leaves) or never yet connected (boot) silently dropped publishes fleet-wide with no resync ever firing to reconcile them. - Exempt internal coordination topics from the freeze: "__"-prefixed channels (statistics protocol, broadcaster keepalive) and the git-webhook trigger topic. These target server-side subscribers, not clients — dropping them broke statistics state (ghost clients, workers that never sync) and lost repo-pull triggers that no resync re-issues (webhooks are the only pull trigger with the default POLICY_REPO_POLLING_INTERVAL=0). Exempt topics keep the pre-freeze deliver-locally + buffer-for-replay behavior. - Re-bind the library's class-level `notify = publish` alias, which bound the BASE publish and let endpoint.notify(...) bypass the gate entirely. - Refuse BROADCAST_FREEZE_ON_DISCONNECT with BROADCAST_RESYNC_ON_RECONNECT disabled (warn + disable freeze): the resync is the freeze's only recovery path, so that combination silently lost every update published during gaps. - Rate-limit the freeze log: WARNING once per gap episode, DEBUG afterwards, and a suppressed-count summary on recovery (a long outage previously emitted an unbounded WARNING per frozen stats keepalive). - Document recovery scope honestly (configured-source refetch; one-off URL / inline updates are dropped by a freeze) and the residual windows that keep pre-freeze behavior (outbound-failure while subscribed, connect flap, RPC client-originated publishes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

netlify · 2026-07-02T15:35:05Z

✅ Deploy Preview for opal-docs ready!

Name	Link
🔨 Latest commit	`9060551`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6a4a34e6f0994b000710683a
😎 Deploy Preview	https://deploy-preview-933--opal-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

CI fixes: - pre-commit / docformatter: new docstrings were not wrapped per the pinned docformatter 1.7.5; re-wrapped in place (no code changes), verified idempotent alongside black/isort and the test suite. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a “fleet-consistency freeze” for OPAL Server publishes during broadcaster backbone gaps so that writes received by a single worker during an outage are not delivered locally (and thus can’t split the fleet) until recovery via reconnect resync.

Changes:

Add backbone subscription/gap tracking to ReconnectingBroadcaster and a new FreezablePubSubEndpoint that suppresses publish() during gaps (with exemptions).
Wire the new endpoint into server pubsub initialization with a safety guardrail when resync is disabled.
Add config flag BROADCAST_FREEZE_ON_DISCONNECT (default true) and new unit tests covering gap lifecycle + freeze behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
packages/opal-server/opal_server/tests/reconnecting_broadcaster_test.py	Adds tests for `is_backbone_connected` / `is_in_backbone_gap` lifecycle and freeze gating behavior.
packages/opal-server/opal_server/pubsub.py	Switches server endpoint to `FreezablePubSubEndpoint` and applies freeze + exemptions + resync guardrail.
packages/opal-server/opal_server/pubsub_resilience.py	Adds backbone connectivity state, gap detection API, and implements `FreezablePubSubEndpoint`.
packages/opal-server/opal_server/config.py	Introduces `BROADCAST_FREEZE_ON_DISCONNECT` configuration option and documentation text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def _is_exempt(self, topics) -> bool:
+        if isinstance(topics, str):
+            topics = [topics]
+        return all(
+            topic.startswith("__") or topic in self._freeze_exempt_topics
+            for topic in topics
+        )


+def _fabricate_gap(broadcaster):
+    """Put a broadcaster into the mid-gap state (had a session, lost it, reader
+    pending) without driving a real backbone.
+
+    Returns the dummy reader task — cancel it in the test's cleanup.
+    """
+    dummy = asyncio.get_event_loop().create_task(asyncio.sleep(60))
+    broadcaster._subscription_task = dummy
+    broadcaster._had_backbone_connection = True
+    broadcaster._backbone_connected = False
+    return dummy


+    finally:
+        dummy.cancel()
+    # no broadcaster (single worker) -> never


+    finally:
+        dummy.cancel()


… over An exempt (internal-topic) publish arriving mid-gap reaches the delivery path and was resetting the freeze-episode counter and logging "Backbone recovered" while the backbone was still down. Emit the summary only when the gap has actually ended; exempt deliveries mid-gap no longer end the episode. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…antics CI fixes: - E2E Tests / App Tests: the cross-instance consistency test asserted the pre-freeze behavior (a one-off update published during a backbone outage reaches clients via buffer+replay). With BROADCAST_FREEZE_ON_DISCONNECT (default on) that publish is frozen so the fleet never splits. The test now asserts the new semantics: the publish is frozen at the gate, exempt internal topics still ride buffer+replay, recovery resyncs clients, the frozen update reached NO client (new check_clients_not_logged helper), and a post-recovery re-publish converges both clients together and logs the freeze-episode summary. Also raise the gitea startup wait budget (120s -> 300s): slow bind-mount environments need minutes to bring the web listener up; fast environments exit the poll early and pay nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Zivxx and others added 2 commits July 2, 2026 17:19

zeevmoney requested review from Copilot and zeevmoney July 2, 2026 15:51

Copilot started reviewing on behalf of zeevmoney July 2, 2026 15:51 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Zivxx and others added 3 commits July 5, 2026 13:29

Merge branch 'master' into fix/opal-broadcaster-freeze-on-disconnect

9060551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933

fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933
Zivxx wants to merge 6 commits into
masterfrom
fix/opal-broadcaster-freeze-on-disconnect

Zivxx commented Jul 2, 2026

Uh oh!

netlify Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Zivxx commented Jul 2, 2026

The problem

The fix

Scope & semantics (please read if you rely on runtime data updates)

Hardening (second commit)

Validation

Uh oh!

netlify Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jul 2, 2026 •

edited

Loading