recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40
recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40NikolayS wants to merge 3 commits into
Conversation
samorev: concurrency/API design reviewStatic review focused on locking, condition variables, startup process interactions, promotion/manual resume behavior, slot persistence invariants, and externally visible semantics. Findings:
No tests run; review was static only. |
|
Bug (patch 3, auto-resume): pre-existing user pause gets clobbered. |
|
Test not registered in meson build. |
|
Code duplication (patch 2/3) — consider consolidating:
|
Review session takeaways (action items)Before/at submission:
Polish: Decided: keep current 3-commit structure (readability > merging into InvalidateObsoleteReplicationSlots; lock-count/dup deferred). Build + rebase-onto-master both clean. |
Test verdict: core behavior VALIDATED; the timeout was test flakiness (not a patch bug)Re-ran 054 with a 3× timeout on a clean-ish env. Tests 1–7 passed: GUC-on logical slot survived the catalog prune (22 pauses observed, 3092 events decoded); GUC-off control slot invalidated as expected. So the pause/auto-resume/drain mechanism works. The 1000s timeout was a test bug, not patch code. Phase-1's archive-readiness poll (lines 82–91) keys off New action item #6: make the phase-1 poll wait for the segment holding the snapshot anchor to close+archive before creating slots (both sites). This is the real flakiness fix; CI would hit it intermittently. |
…xtern
MaybePauseOnLogicalSlotConflict (introduced in the next commit) runs
inside ResolveRecoveryConflictWithSnapshot, which is called from the
WAL apply path rather than the main recovery loop. Its wait loop must
do the same two things recoveryPausesHere() does:
1. Transition RECOVERY_PAUSE_REQUESTED -> RECOVERY_PAUSED so that
pg_wal_replay_resume() can release the pause.
2. Check for a promote signal so that pg_promote() does not stall
while the startup process is sleeping inside the slot-conflict wait.
Both are currently static. Remove the static qualifier and add extern
declarations to xlogrecovery.h so standby.c can call them.
No behaviour change — only visibility changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a new GUC, recovery_pause_on_logical_slot_conflict (PGC_SIGHUP, default off). When enabled, WAL replay on a standby pauses instead of invalidating an active logical replication slot whose catalog_xmin would be overtaken by a Heap2/PRUNE_ON_ACCESS record's snapshotConflictHorizon. An operator can then drain the slot via pg_logical_slot_get_changes and call pg_wal_replay_resume() to continue. On resume, the slot's catalog_xmin is advanced past the conflict horizon so the subsequent InvalidateObsoleteReplicationSlots call becomes a no-op; replay continues to the next conflict and the cycle repeats. This makes logical decoding from an archive-only standby (no streaming replication link to the primary) viable for continuous CDC. Without this GUC, slots on such standbys are invalidated the first time replay applies a catalog vacuum record whose horizon exceeds the slot's catalog_xmin — typically ~2 * autovacuum_naptime after slot creation. Hooks into ResolveRecoveryConflictWithSnapshot(), the single choke point in the replay path for RS_INVAL_HORIZON conflicts, via a new MaybePauseOnLogicalSlotConflict() function in standby.c. Reuses the existing SetRecoveryPause / recoveryNotPausedCV machinery — no new shared-memory state. Hot path when GUC off is one boolean early-return. Edge cases handled: - Slots still inside DecodingContextFindStartpoint (effective_catalog_xmin not yet valid) are skipped. Pausing for them would deadlock: snapbuild needs WAL to advance, pause holds it back. Invalidating an in-progress slot is harmless — the caller retries. - Pause-check uses TransactionIdPrecedesOrEquals to match the semantics of DetermineSlotInvalidationCause. Without that, a slot whose catalog_xmin was just advanced to horizon+1 by a previous pause cycle would fail to re-pause on a subsequent record with horizon == catalog_xmin, yet would still be invalidated. - CheckForStandbyTrigger() is called in the wait loop so pg_promote() does not stall while paused. Mirrors the existing recoveryPausesHere escape loop. - Synced slots (data.synced == true) are skipped in both the pause-check and advance scans. Writing to their fields from the startup process would race with the slot-sync worker. Crash safety: after advancing catalog_xmin in memory, dirty slots are flushed to disk immediately via CheckPointReplicationSlots(false) before returning. This upholds the write-before-memory-update invariant established by LogicalConfirmReceivedLocation (logical.c): the on-disk state must reflect any advance before the in-memory value becomes visible, so that vacuum cannot reclaim catalog tuples the slot still needs. Deferring to the next restartpoint would leave a crash window. Includes a TAP test (050_recovery_pause_on_slot_conflict.pl) covering: - GUC registration - Slot survival through catalog PRUNE_ON_ACCESS records (GUC on) - Baseline slot invalidation (GUC off, unchanged upstream behaviour) - pg_promote() succeeds in under 10 s while the standby is paused (guards the CheckForStandbyTrigger() escape path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous behavior under recovery_pause_on_logical_slot_conflict
required the operator to both drain (or drop / advance) the slot AND
call pg_wal_replay_resume() to continue — two steps, even though the
first step is the one that matters semantically. That split also meant
the feature couldn't underpin a continuous-CDC service without
external orchestration to issue the resume.
Lift the scan predicate ("does any slot in `dboid` still block this
conflict?") out of the initial check into a helper
AnySlotStillBlocksConflict(). Call it again every 1s inside the
existing wait loop. When it returns false, flip the pause state to
NOT_PAUSED and let the loop exit; the existing post-wait advance then
bumps catalog_xmin past the horizon on drained slots so the
fall-through InvalidateObsoleteReplicationSlots() is a no-op.
"No longer blocking" covers every unblock path, not just drain:
* drained past the pause LSN (confirmed_flush >= captured
conflict_lsn) — the main case
* slot dropped (pg_drop_replication_slot) — removed from the scan
* slot advanced (pg_replication_slot_advance) — catalog_xmin moves
past the horizon
* slot invalidated for another reason (e.g. RS_INVAL_WAL_REMOVED
from max_slot_wal_keep_size, applied by the checkpointer, which
runs even while the startup process is asleep in our wait loop)
— data.invalidated != RS_INVAL_NONE, scan skips it
Manual pg_wal_replay_resume() still works as the "give up on this
slot and let it invalidate" escape hatch, and CheckForStandbyTrigger
still breaks the loop for pg_promote().
Capture conflict_lsn once at pause time and reuse it for both the
in-wait predicate and the post-wait advance, replacing the redundant
second GetXLogReplayRecPtr() call.
GUC long_desc, postgresql.conf.sample comment, and the xlogrecovery.c
variable-decl comment updated to describe auto-resume.
3c543dc to
44eceb3
Compare
Summary
Logical replication slots on standbys are invalidated whenever WAL replay applies a catalog vacuum record whose
snapshotConflictHorizonexceeds the slot'scatalog_xmin. For archive-only standbys (no streaming link to primary), this is unavoidable and typically happens ~2×autovacuum_naptimeafter slot creation, breaking continuous CDC pipelines.This series introduces
recovery_pause_on_logical_slot_conflict: pause replay instead of invalidating, giving an operator (or automation) time to drain the slot. The slot survives; replay resumes.Patch structure
Patch 1 — xlogrecovery: make ConfirmRecoveryPaused and CheckForStandbyTrigger extern
MaybePauseOnLogicalSlotConflictruns insideResolveRecoveryConflictWithSnapshot, not in the main recovery loop. Its wait loop needs to (a) transitionRECOVERY_PAUSE_REQUESTED → RECOVERY_PAUSEDand (b) break onpg_promote(). Both functions were static. No behaviour change — visibility only.Patch 2 — Add recovery_pause_on_logical_slot_conflict GUC
New
PGC_SIGHUPboolean GUC (defaultoff). When on, hooks intoResolveRecoveryConflictWithSnapshot()viaMaybePauseOnLogicalSlotConflict(). Reuses the existingSetRecoveryPause/recoveryNotPausedCVmachinery — no new shared memory.Crash safety: after advancing
catalog_xminin memory,CheckPointReplicationSlots(false)flushes dirty slots to disk immediately. This upholds the write-before-memory-update invariant fromLogicalConfirmReceivedLocation(logical.c): the on-disk state must lead the in-memory state so vacuum cannot reclaim catalog tuples the slot still needs.Includes TAP test with 10 assertions covering slot survival, pause/resume cycle, baseline invalidation (GUC off), and
pg_promote()under pause.Patch 3 — Auto-resume recovery once the logical slot conflict is resolved
Lifts the slot-still-blocking predicate into
AnySlotStillBlocksConflict()and polls it every 1 s inside the wait loop. When no slot blocks, flips pause state toNOT_PAUSEDand exits — no operator intervention needed. Manualpg_wal_replay_resume()still works as an escape hatch.pg_promote()still breaks the loop.Design principles (from design review 2026-05-27 with Andrey Borodin)
CheckPointReplicationSlots(false)call on the resume path is one fsync on a rare code path. The alternative (defer to next restartpoint) would violate the write-ordering invariant that the rest of the slot machinery upholds.Target
July 2026 PostgreSQL commitfest (PG20 development branch).
Test plan
make -C src/test/recovery check TESTS=050_recovery_pause_on_slot_conflictpassespg_settingswith correct descriptionpg_promote()completes in < 10 s while standby is pausedxl_running_xactscadence)🤖 Generated with Claude Code