Skip to content

fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933

Open
Zivxx wants to merge 6 commits into
masterfrom
fix/opal-broadcaster-freeze-on-disconnect
Open

fix(opal-server): freeze publishes during broadcaster backbone gaps to keep the fleet consistent#933
Zivxx wants to merge 6 commits into
masterfrom
fix/opal-broadcaster-freeze-on-disconnect

Conversation

@Zivxx

@Zivxx Zivxx commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

The problem

Follow-up to #915. That PR made a fleet of OPAL servers survive a broadcast-backbone outage (reconnect with backoff, buffer outbound broadcasts, resync clients on recovery). Multi-server validation of it surfaced a consistency gap during the outage:

A server-side publish fans out two independent ways — local in-process delivery to that worker's own clients, and the broadcaster to peer workers. Only the outbound path is buffered when the backbone is down; local delivery still fires. So an update that reaches one server mid-outage is applied to its clients while every other server's clients hold old data — a fleet split that lasts the whole outage. In a 3-server / 9-client test against a real managed-Postgres backbone stop, a data update published during the outage left the fleet at 3/9 new vs 6/9 old for the entire ~13-minute gap. For an authorization system, that means the same request can be allowed by one PDP and denied by another until the backbone returns.

The fix

FreezablePubSubEndpoint gates client-facing publishes on live backbone connectivity: while the ReconnectingBroadcaster is mid-gap (an established backbone session was lost and is being re-acquired — see is_in_backbone_gap()), publish is skipped entirely. Clients stay connected but frozen on the last consistent state; the write still lands in its datasource; and the existing reconnect resync (from #915) makes every worker's clients refetch on recovery — so the whole fleet moves together instead of diverging. Skipping the whole publish also keeps the replay buffer out of the recovery path for these updates: convergence is purely the resync refetch.

Configurable via OPAL_BROADCAST_FREEZE_ON_DISCONNECT (default true; set false for the previous behavior).

Scope & semantics (please read if you rely on runtime data updates)

  • Covered: updates whose data clients re-fetch on reconnect — the configured data sources (OPAL_DATA_CONFIG_SOURCES / scope config) and the policy bundle. These are reconciled fleet-wide on recovery.
  • Dropped, not deferred: one-off updates published during a gap that fall outside that set — inline data payloads, or fetch URLs not part of the configured sources. Consistency is chosen over freshness for these; if you rely on them, set the flag to false.
  • Exempt (keep the old deliver-locally + buffer-for-replay behavior): internal coordination topics — __-prefixed channels (statistics protocol, broadcaster keepalive) and the git-webhook trigger topic. These target server-side subscribers, not clients; freezing them would break statistics state and silently drop repo-pull triggers that no resync re-issues.
  • Guardrail: the resync is the freeze's only recovery path, so enabling the freeze with OPAL_BROADCAST_RESYNC_ON_RECONNECT=false is refused with a warning (freeze disabled) rather than silently losing updates.

Hardening (second commit)

An adversarial review of the first commit caught real issues, all addressed:

  • Gap-only gating — freeze keys off is_in_backbone_gap() (had a session, lost it, retrying), not mere "not subscribed". Freezing while the reader was never started (idle worker — the reader only runs while a listening context is held) or never yet connected (boot) would silently drop publishes on a healthy backbone with nothing to ever reconcile them; those states delegate to the pre-freeze behavior. Gap state is scoped to the reader task's own lifetime, mirroring the resync's semantics.
  • notify = publish alias re-bound — the upstream endpoint aliases notify = publish at class level, which binds the base publish; without re-binding, endpoint.notify(...) bypassed the gate entirely.
  • Log hygiene — one WARNING per freeze episode (subsequent hits at DEBUG) plus a suppressed-count summary on recovery, instead of a WARNING per frozen keepalive for the whole outage.
  • Known limitations (documented in the class docstring; each degrades to the pre-freeze behavior, never worse): an outbound broadcast can fail while the reader subscription is alive (delivered locally + buffered, as before); the gate reopens on re-subscribe before the session proves sustained (a rare connect-flap can slip a publish through); client-originated RPC publishes bypass the endpoint override.

Validation

  • Unit: 49 tests green, including a fault-injectable in-memory backbone driving the full gap lifecycle, the freeze matrix (never-started / never-connected / gap / reconnected / exempt topics / disabled / no broadcaster / stock broadcaster), suppression + episode-counter behavior, reader-restart state hygiene, and the notify alias pin.
  • Multi-server staging (3 servers × 9 clients, shared Postgres backbone, real stop/start and reboot outages):
    • Without the fix: update during the outage → 3/9 clients diverge for the entire gap.
    • With the fix: same choreography → 0/9 leak (write-target server's own clients included), held across a ~20-minute outage; all 9 clients stayed connected, 0 worker restarts.
    • Source-based end-to-end: a value written to a datasource during the gap, with its update notification provably frozen at the gate (and nothing buffered), converged 9/9 after recovery via the resync refetch alone.
    • The final run also confirmed the exemptions live (internal topics buffered-for-replay during the gap, as before this PR) and the episode logging (single WARNING + recovery summary).

🤖 Generated with Claude Code

Zivxx and others added 2 commits July 2, 2026 17:19
…is down

During a broadcaster-backbone outage, a publish reaching one worker was still
delivered to that worker's own clients (local in-process notifier) while peers'
clients stayed stale — a transient fleet split lasting the whole outage
(observed 3/9 PDPs diverging in a 3-server/9-PDP staging test).

Gate the whole publish on live backbone connectivity: add
is_backbone_connected() to ReconnectingBroadcaster (True only while actively
subscribed — unlike is_reader_healthy(), which deliberately stays healthy
across a transient reconnect) and a FreezablePubSubEndpoint that skips publish
entirely while disconnected. Clients stay connected but frozen on the last
consistent state; the write still lands in its datasource and the existing
reconnect resync makes every client refetch it, so the fleet converges
together. Skipping the whole publish also keeps the replay buffer out of the
recovery path (no partial replayed state racing the resync refetch).

Configurable via OPAL_BROADCAST_FREEZE_ON_DISCONNECT (default true).

Scope: designed for source-based updates (entry carries a URL clients
refetch). Inline updates issued during an outage are dropped rather than
deferred — accepted, inline is legacy.

Validated on a 3-server/9-PDP staging stack against a real RDS stop/start:
during-outage divergence 3/9 -> 0/9, zero client disconnects and zero worker
restarts across ~20-minute outages, and a source-based write made during the
gap converged 9/9 via the resync refetch alone (its notification provably
frozen, nothing replayed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…exemptions, alias bypass

Review findings on the freeze-on-disconnect gate, all fixed here:

- Gate on a real backbone GAP (had a session, lost it, reader retrying) via
  is_in_backbone_gap(), not on mere "not subscribed". Freezing while the reader
  was never started (healthy idle worker: the global listening context is only
  pinned when statistics are enabled, and upstream cancels the reader when the
  last listener leaves) or never yet connected (boot) silently dropped
  publishes fleet-wide with no resync ever firing to reconcile them.
- Exempt internal coordination topics from the freeze: "__"-prefixed channels
  (statistics protocol, broadcaster keepalive) and the git-webhook trigger
  topic. These target server-side subscribers, not clients — dropping them
  broke statistics state (ghost clients, workers that never sync) and lost
  repo-pull triggers that no resync re-issues (webhooks are the only pull
  trigger with the default POLICY_REPO_POLLING_INTERVAL=0). Exempt topics keep
  the pre-freeze deliver-locally + buffer-for-replay behavior.
- Re-bind the library's class-level `notify = publish` alias, which bound the
  BASE publish and let endpoint.notify(...) bypass the gate entirely.
- Refuse BROADCAST_FREEZE_ON_DISCONNECT with BROADCAST_RESYNC_ON_RECONNECT
  disabled (warn + disable freeze): the resync is the freeze's only recovery
  path, so that combination silently lost every update published during gaps.
- Rate-limit the freeze log: WARNING once per gap episode, DEBUG afterwards,
  and a suppressed-count summary on recovery (a long outage previously emitted
  an unbounded WARNING per frozen stats keepalive).
- Document recovery scope honestly (configured-source refetch; one-off URL /
  inline updates are dropped by a freeze) and the residual windows that keep
  pre-freeze behavior (outbound-failure while subscribed, connect flap, RPC
  client-originated publishes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@netlify

netlify Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs ready!

Name Link
🔨 Latest commit 9060551
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a4a34e6f0994b000710683a
😎 Deploy Preview https://deploy-preview-933--opal-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

CI fixes:
- pre-commit / docformatter: new docstrings were not wrapped per the pinned
  docformatter 1.7.5; re-wrapped in place (no code changes), verified
  idempotent alongside black/isort and the test suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a “fleet-consistency freeze” for OPAL Server publishes during broadcaster backbone gaps so that writes received by a single worker during an outage are not delivered locally (and thus can’t split the fleet) until recovery via reconnect resync.

Changes:

  • Add backbone subscription/gap tracking to ReconnectingBroadcaster and a new FreezablePubSubEndpoint that suppresses publish() during gaps (with exemptions).
  • Wire the new endpoint into server pubsub initialization with a safety guardrail when resync is disabled.
  • Add config flag BROADCAST_FREEZE_ON_DISCONNECT (default true) and new unit tests covering gap lifecycle + freeze behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
packages/opal-server/opal_server/tests/reconnecting_broadcaster_test.py Adds tests for is_backbone_connected / is_in_backbone_gap lifecycle and freeze gating behavior.
packages/opal-server/opal_server/pubsub.py Switches server endpoint to FreezablePubSubEndpoint and applies freeze + exemptions + resync guardrail.
packages/opal-server/opal_server/pubsub_resilience.py Adds backbone connectivity state, gap detection API, and implements FreezablePubSubEndpoint.
packages/opal-server/opal_server/config.py Introduces BROADCAST_FREEZE_ON_DISCONNECT configuration option and documentation text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +667 to +673
def _is_exempt(self, topics) -> bool:
if isinstance(topics, str):
topics = [topics]
return all(
topic.startswith("__") or topic in self._freeze_exempt_topics
for topic in topics
)
Comment on lines +944 to +954
def _fabricate_gap(broadcaster):
"""Put a broadcaster into the mid-gap state (had a session, lost it, reader
pending) without driving a real backbone.

Returns the dummy reader task — cancel it in the test's cleanup.
"""
dummy = asyncio.get_event_loop().create_task(asyncio.sleep(60))
broadcaster._subscription_task = dummy
broadcaster._had_backbone_connection = True
broadcaster._backbone_connected = False
return dummy
Comment on lines +993 to +995
finally:
dummy.cancel()
# no broadcaster (single worker) -> never
Comment on lines +1026 to +1027
finally:
dummy.cancel()
Zivxx and others added 3 commits July 5, 2026 13:29
… over

An exempt (internal-topic) publish arriving mid-gap reaches the delivery path
and was resetting the freeze-episode counter and logging "Backbone recovered"
while the backbone was still down. Emit the summary only when the gap has
actually ended; exempt deliveries mid-gap no longer end the episode.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…antics

CI fixes:
- E2E Tests / App Tests: the cross-instance consistency test asserted the
  pre-freeze behavior (a one-off update published during a backbone outage
  reaches clients via buffer+replay). With BROADCAST_FREEZE_ON_DISCONNECT
  (default on) that publish is frozen so the fleet never splits. The test now
  asserts the new semantics: the publish is frozen at the gate, exempt
  internal topics still ride buffer+replay, recovery resyncs clients, the
  frozen update reached NO client (new check_clients_not_logged helper), and
  a post-recovery re-publish converges both clients together and logs the
  freeze-episode summary.

Also raise the gitea startup wait budget (120s -> 300s): slow bind-mount
environments need minutes to bring the web listener up; fast environments
exit the poll early and pay nothing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants