Skip to content

recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40

Open
NikolayS wants to merge 3 commits into
masterfrom
feature/recovery-pause-logical-slot
Open

recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40
NikolayS wants to merge 3 commits into
masterfrom
feature/recovery-pause-logical-slot

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

Summary

Logical replication slots on standbys are invalidated whenever WAL replay applies a catalog vacuum record whose snapshotConflictHorizon exceeds the slot's catalog_xmin. For archive-only standbys (no streaming link to primary), this is unavoidable and typically happens ~2×autovacuum_naptime after slot creation, breaking continuous CDC pipelines.

This series introduces recovery_pause_on_logical_slot_conflict: pause replay instead of invalidating, giving an operator (or automation) time to drain the slot. The slot survives; replay resumes.

Patch structure

Patch 1 — xlogrecovery: make ConfirmRecoveryPaused and CheckForStandbyTrigger extern

MaybePauseOnLogicalSlotConflict runs inside ResolveRecoveryConflictWithSnapshot, not in the main recovery loop. Its wait loop needs to (a) transition RECOVERY_PAUSE_REQUESTED → RECOVERY_PAUSED and (b) break on pg_promote(). Both functions were static. No behaviour change — visibility only.

Patch 2 — Add recovery_pause_on_logical_slot_conflict GUC

New PGC_SIGHUP boolean GUC (default off). When on, hooks into ResolveRecoveryConflictWithSnapshot() via MaybePauseOnLogicalSlotConflict(). Reuses the existing SetRecoveryPause / recoveryNotPausedCV machinery — no new shared memory.

Crash safety: after advancing catalog_xmin in memory, CheckPointReplicationSlots(false) flushes dirty slots to disk immediately. This upholds the write-before-memory-update invariant from LogicalConfirmReceivedLocation (logical.c): the on-disk state must lead the in-memory state so vacuum cannot reclaim catalog tuples the slot still needs.

Includes TAP test with 10 assertions covering slot survival, pause/resume cycle, baseline invalidation (GUC off), and pg_promote() under pause.

Patch 3 — Auto-resume recovery once the logical slot conflict is resolved

Lifts the slot-still-blocking predicate into AnySlotStillBlocksConflict() and polls it every 1 s inside the wait loop. When no slot blocks, flips pause state to NOT_PAUSED and exits — no operator intervention needed. Manual pg_wal_replay_resume() still works as an escape hatch. pg_promote() still breaks the loop.

Design principles (from design review 2026-05-27 with Andrey Borodin)

  • Correctness over performance: the CheckPointReplicationSlots(false) call on the resume path is one fsync on a rare code path. The alternative (defer to next restartpoint) would violate the write-ordering invariant that the rest of the slot machinery upholds.
  • No new shared memory: reuses existing pause/resume infrastructure.
  • Each patch independently useful: Patch 1 is clean refactoring; Patch 2 gives manual-drain workflow; Patch 3 adds automation.

Target

July 2026 PostgreSQL commitfest (PG20 development branch).

Test plan

  • make -C src/test/recovery check TESTS=050_recovery_pause_on_slot_conflict passes
  • GUC visible in pg_settings with correct description
  • Slot survives catalog prune with GUC on; invalidates with GUC off (existing behaviour unchanged)
  • pg_promote() completes in < 10 s while standby is paused
  • Auto-resume fires within ~15 s after slot is drained (bgwriter xl_running_xacts cadence)

🤖 Generated with Claude Code

@NikolayS

Copy link
Copy Markdown
Owner Author

samorev: concurrency/API design review

Static review focused on locking, condition variables, startup process interactions, promotion/manual resume behavior, slot persistence invariants, and externally visible semantics.

Findings:

  1. Critical — src/backend/storage/ipc/standby.c:721-760 — crash-safety invariant is still violated

    • Evidence: code sets s->effective_catalog_xmin = new_xmin / s->effective_xmin = new_xmin before CheckPointReplicationSlots(false).
    • Issue: this exposes the advanced xmin to vacuum before the slot state is durable. Crash in that window can restart with old on-disk xmin after vacuum already removed tuples the slot may need.
    • Suggested fix: follow LogicalConfirmReceivedLocation() ordering: update durable slot data/save first, then publish effective xmin.
  2. High — src/backend/storage/ipc/standby.c:669 — auto-resume can clear a user-requested recovery pause

    • Evidence: auto-resume calls SetRecoveryPause(false) using the global recovery pause state.
    • Issue: if an operator calls pg_wal_replay_pause() while this conflict pause is active, the auto-resume path can unpause recovery anyway. This changes externally visible pause semantics.
    • Suggested fix: track ownership/reason for this pause separately, or only auto-clear pauses known to be created by this feature and not superseded by a manual pause request.
  3. High — src/backend/storage/ipc/standby.c:686-730 — catalog-conflict path advances xmin, not just catalog_xmin

    • Evidence: advance predicate includes s->data.xmin, then sets s->data.xmin = new_xmin and s->effective_xmin = new_xmin.
    • Issue: the feature is justified by catalog conflicts, but xmin protects non-catalog tuple visibility for logical decoding. Advancing it based only on a catalog prune conflict horizon may drop protection the slot still needs.
    • Suggested fix: don’t mutate xmin here unless there is a separate proof it is safe; constrain the conflict handling to catalog_xmin, or adjust the invalidation path so logical catalog conflicts don’t require unsafe xmin advancement.

No tests run; review was static only.

@NikolayS

NikolayS commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Bug (patch 3, auto-resume): pre-existing user pause gets clobbered.
In MaybePauseOnLogicalSlotConflict() (standby.c) the auto-resume path calls SetRecoveryPause(false) unconditionally. If an operator ran pg_wal_replay_pause() before the conflict fired, auto-resume silently clears their pause and recovery resumes against intent — single global pause state, no record of who paused.
Fix (~10 lines): capture user_paused = (GetRecoveryPauseState() != RECOVERY_NOT_PAUSED) before SetRecoveryPause(true); on auto-resume only clear when !user_paused. Add a subtest (currently uncovered: test exercises pause+auto-resume, control, promote-escape, manual-resume — but not user-pause-before-conflict).

@NikolayS

NikolayS commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Test not registered in meson build.
t/050_recovery_pause_on_slot_conflict.pl is missing from the tests = [...] list in src/test/recovery/meson.build → meson/CI never runs it (verified: 0 matches; build is otherwise clean). As committed the 329-line test provides no protection. Add the entry. Minor: filename 050 collides with existing 050_redo_segment_missing.pl; renumber to 054 while at it.

@NikolayS

NikolayS commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Code duplication (patch 2/3) — consider consolidating:

  1. Slot scan: AnySlotStillBlocksConflict() iterates all slots under ReplicationSlotControlLock doing catalog_xmin-vs-horizon checks that InvalidateObsoleteReplicationSlots()/DetermineSlotInvalidationCause() already do — two back-to-back scans under the same lock in the conflict path. (Comment already notes the PrecedesOrEquals mirroring, so drift risk is real.)
  2. Wait loop: the CV-sleep + ConfirmRecoveryPaused() loop largely duplicates recoveryPausesHere().
    Not blocking, but a reviewer will ask to share logic / reduce the two-scan cost.

@NikolayS

NikolayS commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Review session takeaways (action items)

Before/at submission:

  1. Register the TAP test in src/test/recovery/meson.build (currently never runs) + renumber 050→054.
  2. Fix user-pause-clobber bug (patch 3): capture user_paused before SetRecoveryPause(true); auto-clear only if !user_paused; add a subtest.
  3. Get one clean pass on Linux/cfbot — real 'does it work' proof (local macOS env broken: SIP breaks all recovery TAP tests; community CI down).

Polish:
4. Commit-1 comment forward-references MaybePauseOnLogicalSlotConflict (defined in commit 2) → make generic.
5. Document why dboid + snapshotConflictHorizon params are needed.

Decided: keep current 3-commit structure (readability > merging into InvalidateObsoleteReplicationSlots; lock-count/dup deferred). Build + rebase-onto-master both clean.

@NikolayS

NikolayS commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Test verdict: core behavior VALIDATED; the timeout was test flakiness (not a patch bug)

Re-ran 054 with a 3× timeout on a clean-ish env. Tests 1–7 passed: GUC-on logical slot survived the catalog prune (22 pauses observed, 3092 events decoded); GUC-off control slot invalidated as expected. So the pause/auto-resume/drain mechanism works.

The 1000s timeout was a test bug, not patch code. Phase-1's archive-readiness poll (lines 82–91) keys off $phase1_seg (segment 3), but the second pg_log_standby_snapshot() anchor lands in a later segment (seg 7) that stays an unarchived open segment — so pg_create_logical_replication_slot blocks in DecodingContextFindStartpoint waiting for SNAPBUILD_CONSISTENT on WAL that never arrives. Phase-2's WAL switch that would ship it is gated behind the hung call → deadlock by construction on fast machines. Same class recurs at line 305 (promote standby).

New action item #6: make the phase-1 poll wait for the segment holding the snapshot anchor to close+archive before creating slots (both sites). This is the real flakiness fix; CI would hit it intermittently.

Nik Samokhvalov and others added 3 commits June 3, 2026 11:20
…xtern

MaybePauseOnLogicalSlotConflict (introduced in the next commit) runs
inside ResolveRecoveryConflictWithSnapshot, which is called from the
WAL apply path rather than the main recovery loop.  Its wait loop must
do the same two things recoveryPausesHere() does:

  1. Transition RECOVERY_PAUSE_REQUESTED -> RECOVERY_PAUSED so that
     pg_wal_replay_resume() can release the pause.
  2. Check for a promote signal so that pg_promote() does not stall
     while the startup process is sleeping inside the slot-conflict wait.

Both are currently static.  Remove the static qualifier and add extern
declarations to xlogrecovery.h so standby.c can call them.

No behaviour change — only visibility changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a new GUC, recovery_pause_on_logical_slot_conflict (PGC_SIGHUP,
default off). When enabled, WAL replay on a standby pauses instead of
invalidating an active logical replication slot whose catalog_xmin
would be overtaken by a Heap2/PRUNE_ON_ACCESS record's
snapshotConflictHorizon. An operator can then drain the slot via
pg_logical_slot_get_changes and call pg_wal_replay_resume() to
continue. On resume, the slot's catalog_xmin is advanced past the
conflict horizon so the subsequent InvalidateObsoleteReplicationSlots
call becomes a no-op; replay continues to the next conflict and the
cycle repeats.

This makes logical decoding from an archive-only standby (no streaming
replication link to the primary) viable for continuous CDC. Without
this GUC, slots on such standbys are invalidated the first time replay
applies a catalog vacuum record whose horizon exceeds the slot's
catalog_xmin — typically ~2 * autovacuum_naptime after slot creation.

Hooks into ResolveRecoveryConflictWithSnapshot(), the single choke
point in the replay path for RS_INVAL_HORIZON conflicts, via a new
MaybePauseOnLogicalSlotConflict() function in standby.c. Reuses the
existing SetRecoveryPause / recoveryNotPausedCV machinery — no new
shared-memory state. Hot path when GUC off is one boolean early-return.

Edge cases handled:
- Slots still inside DecodingContextFindStartpoint
  (effective_catalog_xmin not yet valid) are skipped. Pausing for them
  would deadlock: snapbuild needs WAL to advance, pause holds it back.
  Invalidating an in-progress slot is harmless — the caller retries.
- Pause-check uses TransactionIdPrecedesOrEquals to match the
  semantics of DetermineSlotInvalidationCause. Without that, a slot
  whose catalog_xmin was just advanced to horizon+1 by a previous
  pause cycle would fail to re-pause on a subsequent record with
  horizon == catalog_xmin, yet would still be invalidated.
- CheckForStandbyTrigger() is called in the wait loop so pg_promote()
  does not stall while paused. Mirrors the existing recoveryPausesHere
  escape loop.
- Synced slots (data.synced == true) are skipped in both the
  pause-check and advance scans. Writing to their fields from the
  startup process would race with the slot-sync worker.

Crash safety: after advancing catalog_xmin in memory, dirty slots are
flushed to disk immediately via CheckPointReplicationSlots(false) before
returning. This upholds the write-before-memory-update invariant
established by LogicalConfirmReceivedLocation (logical.c): the on-disk
state must reflect any advance before the in-memory value becomes
visible, so that vacuum cannot reclaim catalog tuples the slot still
needs. Deferring to the next restartpoint would leave a crash window.

Includes a TAP test (050_recovery_pause_on_slot_conflict.pl) covering:
- GUC registration
- Slot survival through catalog PRUNE_ON_ACCESS records (GUC on)
- Baseline slot invalidation (GUC off, unchanged upstream behaviour)
- pg_promote() succeeds in under 10 s while the standby is paused
  (guards the CheckForStandbyTrigger() escape path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous behavior under recovery_pause_on_logical_slot_conflict
required the operator to both drain (or drop / advance) the slot AND
call pg_wal_replay_resume() to continue — two steps, even though the
first step is the one that matters semantically. That split also meant
the feature couldn't underpin a continuous-CDC service without
external orchestration to issue the resume.

Lift the scan predicate ("does any slot in `dboid` still block this
conflict?") out of the initial check into a helper
AnySlotStillBlocksConflict(). Call it again every 1s inside the
existing wait loop. When it returns false, flip the pause state to
NOT_PAUSED and let the loop exit; the existing post-wait advance then
bumps catalog_xmin past the horizon on drained slots so the
fall-through InvalidateObsoleteReplicationSlots() is a no-op.

"No longer blocking" covers every unblock path, not just drain:

  * drained past the pause LSN (confirmed_flush >= captured
    conflict_lsn) — the main case
  * slot dropped (pg_drop_replication_slot) — removed from the scan
  * slot advanced (pg_replication_slot_advance) — catalog_xmin moves
    past the horizon
  * slot invalidated for another reason (e.g. RS_INVAL_WAL_REMOVED
    from max_slot_wal_keep_size, applied by the checkpointer, which
    runs even while the startup process is asleep in our wait loop)
    — data.invalidated != RS_INVAL_NONE, scan skips it

Manual pg_wal_replay_resume() still works as the "give up on this
slot and let it invalidate" escape hatch, and CheckForStandbyTrigger
still breaks the loop for pg_promote().

Capture conflict_lsn once at pause time and reuse it for both the
in-wait predicate and the post-wait advance, replacing the redundant
second GetXLogReplayRecPtr() call.

GUC long_desc, postgresql.conf.sample comment, and the xlogrecovery.c
variable-decl comment updated to describe auto-resume.
@NikolayS NikolayS force-pushed the feature/recovery-pause-logical-slot branch from 3c543dc to 44eceb3 Compare June 3, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants