recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series) by NikolayS · Pull Request #40 · NikolayS/postgres

NikolayS · 2026-05-27T17:58:47Z

Summary

Logical replication slots on standbys are invalidated whenever WAL replay applies a catalog vacuum record whose snapshotConflictHorizon exceeds the slot's catalog_xmin. For archive-only standbys (no streaming link to primary), this is unavoidable and typically happens ~2×autovacuum_naptime after slot creation, breaking continuous CDC pipelines.

This series introduces recovery_pause_on_logical_slot_conflict: pause replay instead of invalidating, giving an operator (or automation) time to drain the slot. The slot survives; replay resumes.

Patch structure

Patch 1 — xlogrecovery: make ConfirmRecoveryPaused and CheckForStandbyTrigger extern

MaybePauseOnLogicalSlotConflict runs inside ResolveRecoveryConflictWithSnapshot, not in the main recovery loop. Its wait loop needs to (a) transition RECOVERY_PAUSE_REQUESTED → RECOVERY_PAUSED and (b) break on pg_promote(). Both functions were static. No behaviour change — visibility only.

Patch 2 — Add recovery_pause_on_logical_slot_conflict GUC

New PGC_SIGHUP boolean GUC (default off). When on, hooks into ResolveRecoveryConflictWithSnapshot() via MaybePauseOnLogicalSlotConflict(). Reuses the existing SetRecoveryPause / recoveryNotPausedCV machinery — no new shared memory.

Crash safety: after advancing catalog_xmin in memory, CheckPointReplicationSlots(false) flushes dirty slots to disk immediately. This upholds the write-before-memory-update invariant from LogicalConfirmReceivedLocation (logical.c): the on-disk state must lead the in-memory state so vacuum cannot reclaim catalog tuples the slot still needs.

Includes TAP test with 10 assertions covering slot survival, pause/resume cycle, baseline invalidation (GUC off), and pg_promote() under pause.

Patch 3 — Auto-resume recovery once the logical slot conflict is resolved

Lifts the slot-still-blocking predicate into AnySlotStillBlocksConflict() and polls it every 1 s inside the wait loop. When no slot blocks, flips pause state to NOT_PAUSED and exits — no operator intervention needed. Manual pg_wal_replay_resume() still works as an escape hatch. pg_promote() still breaks the loop.

Design principles (from design review 2026-05-27 with Andrey Borodin)

Correctness over performance: the CheckPointReplicationSlots(false) call on the resume path is one fsync on a rare code path. The alternative (defer to next restartpoint) would violate the write-ordering invariant that the rest of the slot machinery upholds.
No new shared memory: reuses existing pause/resume infrastructure.
Each patch independently useful: Patch 1 is clean refactoring; Patch 2 gives manual-drain workflow; Patch 3 adds automation.

Target

July 2026 PostgreSQL commitfest (PG20 development branch).

Test plan

make -C src/test/recovery check TESTS=050_recovery_pause_on_slot_conflict passes
GUC visible in pg_settings with correct description
Slot survives catalog prune with GUC on; invalidates with GUC off (existing behaviour unchanged)
pg_promote() completes in < 10 s while standby is paused
Auto-resume fires within ~15 s after slot is drained (bgwriter xl_running_xacts cadence)

🤖 Generated with Claude Code

NikolayS · 2026-05-28T12:44:31Z

samorev: concurrency/API design review

Static review focused on locking, condition variables, startup process interactions, promotion/manual resume behavior, slot persistence invariants, and externally visible semantics.

Findings:

Critical — src/backend/storage/ipc/standby.c:721-760 — crash-safety invariant is still violated
- Evidence: code sets s->effective_catalog_xmin = new_xmin / s->effective_xmin = new_xmin before CheckPointReplicationSlots(false).
- Issue: this exposes the advanced xmin to vacuum before the slot state is durable. Crash in that window can restart with old on-disk xmin after vacuum already removed tuples the slot may need.
- Suggested fix: follow LogicalConfirmReceivedLocation() ordering: update durable slot data/save first, then publish effective xmin.
High — src/backend/storage/ipc/standby.c:669 — auto-resume can clear a user-requested recovery pause
- Evidence: auto-resume calls SetRecoveryPause(false) using the global recovery pause state.
- Issue: if an operator calls pg_wal_replay_pause() while this conflict pause is active, the auto-resume path can unpause recovery anyway. This changes externally visible pause semantics.
- Suggested fix: track ownership/reason for this pause separately, or only auto-clear pauses known to be created by this feature and not superseded by a manual pause request.
High — src/backend/storage/ipc/standby.c:686-730 — catalog-conflict path advances xmin, not just catalog_xmin
- Evidence: advance predicate includes s->data.xmin, then sets s->data.xmin = new_xmin and s->effective_xmin = new_xmin.
- Issue: the feature is justified by catalog conflicts, but xmin protects non-catalog tuple visibility for logical decoding. Advancing it based only on a catalog prune conflict horizon may drop protection the slot still needs.
- Suggested fix: don’t mutate xmin here unless there is a separate proof it is safe; constrain the conflict handling to catalog_xmin, or adjust the invalidation path so logical catalog conflicts don’t require unsafe xmin advancement.

No tests run; review was static only.

NikolayS · 2026-06-03T17:37:37Z

Bug (patch 3, auto-resume): pre-existing user pause gets clobbered.
In MaybePauseOnLogicalSlotConflict() (standby.c) the auto-resume path calls SetRecoveryPause(false) unconditionally. If an operator ran pg_wal_replay_pause() before the conflict fired, auto-resume silently clears their pause and recovery resumes against intent — single global pause state, no record of who paused.
Fix (~10 lines): capture user_paused = (GetRecoveryPauseState() != RECOVERY_NOT_PAUSED) before SetRecoveryPause(true); on auto-resume only clear when !user_paused. Add a subtest (currently uncovered: test exercises pause+auto-resume, control, promote-escape, manual-resume — but not user-pause-before-conflict).

NikolayS · 2026-06-03T17:38:01Z

Test not registered in meson build.
t/050_recovery_pause_on_slot_conflict.pl is missing from the tests = [...] list in src/test/recovery/meson.build → meson/CI never runs it (verified: 0 matches; build is otherwise clean). As committed the 329-line test provides no protection. Add the entry. Minor: filename 050 collides with existing 050_redo_segment_missing.pl; renumber to 054 while at it.

NikolayS · 2026-06-03T17:41:38Z

Code duplication (patch 2/3) — consider consolidating:

Slot scan: AnySlotStillBlocksConflict() iterates all slots under ReplicationSlotControlLock doing catalog_xmin-vs-horizon checks that InvalidateObsoleteReplicationSlots()/DetermineSlotInvalidationCause() already do — two back-to-back scans under the same lock in the conflict path. (Comment already notes the PrecedesOrEquals mirroring, so drift risk is real.)
Wait loop: the CV-sleep + ConfirmRecoveryPaused() loop largely duplicates recoveryPausesHere().
Not blocking, but a reviewer will ask to share logic / reduce the two-scan cost.

NikolayS · 2026-06-03T17:49:08Z

Review session takeaways (action items)

Before/at submission:

Register the TAP test in src/test/recovery/meson.build (currently never runs) + renumber 050→054.
Fix user-pause-clobber bug (patch 3): capture user_paused before SetRecoveryPause(true); auto-clear only if !user_paused; add a subtest.
Get one clean pass on Linux/cfbot — real 'does it work' proof (local macOS env broken: SIP breaks all recovery TAP tests; community CI down).

Polish:
4. Commit-1 comment forward-references MaybePauseOnLogicalSlotConflict (defined in commit 2) → make generic.
5. Document why dboid + snapshotConflictHorizon params are needed.

Decided: keep current 3-commit structure (readability > merging into InvalidateObsoleteReplicationSlots; lock-count/dup deferred). Build + rebase-onto-master both clean.

NikolayS · 2026-06-03T17:52:56Z

Test verdict: core behavior VALIDATED; the timeout was test flakiness (not a patch bug)

Re-ran 054 with a 3× timeout on a clean-ish env. Tests 1–7 passed: GUC-on logical slot survived the catalog prune (22 pauses observed, 3092 events decoded); GUC-off control slot invalidated as expected. So the pause/auto-resume/drain mechanism works.

The 1000s timeout was a test bug, not patch code. Phase-1's archive-readiness poll (lines 82–91) keys off $phase1_seg (segment 3), but the second pg_log_standby_snapshot() anchor lands in a later segment (seg 7) that stays an unarchived open segment — so pg_create_logical_replication_slot blocks in DecodingContextFindStartpoint waiting for SNAPBUILD_CONSISTENT on WAL that never arrives. Phase-2's WAL switch that would ship it is gated behind the hung call → deadlock by construction on fast machines. Same class recurs at line 305 (promote standby).

New action item #6: make the phase-1 poll wait for the segment holding the snapshot anchor to close+archive before creating slots (both sites). This is the real flakiness fix; CI would hit it intermittently.

…xtern MaybePauseOnLogicalSlotConflict (introduced in the next commit) runs inside ResolveRecoveryConflictWithSnapshot, which is called from the WAL apply path rather than the main recovery loop. Its wait loop must do the same two things recoveryPausesHere() does: 1. Transition RECOVERY_PAUSE_REQUESTED -> RECOVERY_PAUSED so that pg_wal_replay_resume() can release the pause. 2. Check for a promote signal so that pg_promote() does not stall while the startup process is sleeping inside the slot-conflict wait. Both are currently static. Remove the static qualifier and add extern declarations to xlogrecovery.h so standby.c can call them. No behaviour change — only visibility changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a new GUC, recovery_pause_on_logical_slot_conflict (PGC_SIGHUP, default off). When enabled, WAL replay on a standby pauses instead of invalidating an active logical replication slot whose catalog_xmin would be overtaken by a Heap2/PRUNE_ON_ACCESS record's snapshotConflictHorizon. An operator can then drain the slot via pg_logical_slot_get_changes and call pg_wal_replay_resume() to continue. On resume, the slot's catalog_xmin is advanced past the conflict horizon so the subsequent InvalidateObsoleteReplicationSlots call becomes a no-op; replay continues to the next conflict and the cycle repeats. This makes logical decoding from an archive-only standby (no streaming replication link to the primary) viable for continuous CDC. Without this GUC, slots on such standbys are invalidated the first time replay applies a catalog vacuum record whose horizon exceeds the slot's catalog_xmin — typically ~2 * autovacuum_naptime after slot creation. Hooks into ResolveRecoveryConflictWithSnapshot(), the single choke point in the replay path for RS_INVAL_HORIZON conflicts, via a new MaybePauseOnLogicalSlotConflict() function in standby.c. Reuses the existing SetRecoveryPause / recoveryNotPausedCV machinery — no new shared-memory state. Hot path when GUC off is one boolean early-return. Edge cases handled: - Slots still inside DecodingContextFindStartpoint (effective_catalog_xmin not yet valid) are skipped. Pausing for them would deadlock: snapbuild needs WAL to advance, pause holds it back. Invalidating an in-progress slot is harmless — the caller retries. - Pause-check uses TransactionIdPrecedesOrEquals to match the semantics of DetermineSlotInvalidationCause. Without that, a slot whose catalog_xmin was just advanced to horizon+1 by a previous pause cycle would fail to re-pause on a subsequent record with horizon == catalog_xmin, yet would still be invalidated. - CheckForStandbyTrigger() is called in the wait loop so pg_promote() does not stall while paused. Mirrors the existing recoveryPausesHere escape loop. - Synced slots (data.synced == true) are skipped in both the pause-check and advance scans. Writing to their fields from the startup process would race with the slot-sync worker. Crash safety: after advancing catalog_xmin in memory, dirty slots are flushed to disk immediately via CheckPointReplicationSlots(false) before returning. This upholds the write-before-memory-update invariant established by LogicalConfirmReceivedLocation (logical.c): the on-disk state must reflect any advance before the in-memory value becomes visible, so that vacuum cannot reclaim catalog tuples the slot still needs. Deferring to the next restartpoint would leave a crash window. Includes a TAP test (050_recovery_pause_on_slot_conflict.pl) covering: - GUC registration - Slot survival through catalog PRUNE_ON_ACCESS records (GUC on) - Baseline slot invalidation (GUC off, unchanged upstream behaviour) - pg_promote() succeeds in under 10 s while the standby is paused (guards the CheckForStandbyTrigger() escape path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous behavior under recovery_pause_on_logical_slot_conflict required the operator to both drain (or drop / advance) the slot AND call pg_wal_replay_resume() to continue — two steps, even though the first step is the one that matters semantically. That split also meant the feature couldn't underpin a continuous-CDC service without external orchestration to issue the resume. Lift the scan predicate ("does any slot in `dboid` still block this conflict?") out of the initial check into a helper AnySlotStillBlocksConflict(). Call it again every 1s inside the existing wait loop. When it returns false, flip the pause state to NOT_PAUSED and let the loop exit; the existing post-wait advance then bumps catalog_xmin past the horizon on drained slots so the fall-through InvalidateObsoleteReplicationSlots() is a no-op. "No longer blocking" covers every unblock path, not just drain: * drained past the pause LSN (confirmed_flush >= captured conflict_lsn) — the main case * slot dropped (pg_drop_replication_slot) — removed from the scan * slot advanced (pg_replication_slot_advance) — catalog_xmin moves past the horizon * slot invalidated for another reason (e.g. RS_INVAL_WAL_REMOVED from max_slot_wal_keep_size, applied by the checkpointer, which runs even while the startup process is asleep in our wait loop) — data.invalidated != RS_INVAL_NONE, scan skips it Manual pg_wal_replay_resume() still works as the "give up on this slot and let it invalidate" escape hatch, and CheckForStandbyTrigger still breaks the loop for pg_promote(). Capture conflict_lsn once at pause time and reuse it for both the in-wait predicate and the post-wait advance, replacing the redundant second GetXLogReplayRecPtr() call. GUC long_desc, postgresql.conf.sample comment, and the xlogrecovery.c variable-decl comment updated to describe auto-resume.

NikolayS mentioned this pull request Jun 3, 2026

Test design: end-to-end PoC for archive-fed logical decoding (recovery_pause_on_logical_slot_conflict) #43

Open

23 tasks

Nik Samokhvalov and others added 3 commits June 3, 2026 11:20

NikolayS force-pushed the feature/recovery-pause-logical-slot branch from 3c543dc to 44eceb3 Compare June 3, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40

recovery_pause_on_logical_slot_conflict: pause instead of invalidate (3-patch series)#40
NikolayS wants to merge 3 commits into
masterfrom
feature/recovery-pause-logical-slot

NikolayS commented May 27, 2026

Uh oh!

NikolayS commented May 28, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NikolayS commented May 27, 2026

Summary

Patch structure

Design principles (from design review 2026-05-27 with Andrey Borodin)

Target

Test plan

Uh oh!

NikolayS commented May 28, 2026

samorev: concurrency/API design review

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Uh oh!

NikolayS commented Jun 3, 2026

Review session takeaways (action items)

Uh oh!

NikolayS commented Jun 3, 2026

Test verdict: core behavior VALIDATED; the timeout was test flakiness (not a patch bug)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants