Skip to content

p2 of recovery_pause_on_logical_slot_conflict (auto-resume)#30

Open
NikolayS wants to merge 5 commits into
rfc-v1-recovery-pause-on-slot-conflictfrom
claude/document-test-steps-Kh1ty
Open

p2 of recovery_pause_on_logical_slot_conflict (auto-resume)#30
NikolayS wants to merge 5 commits into
rfc-v1-recovery-pause-on-slot-conflictfrom
claude/document-test-steps-Kh1ty

Conversation

@NikolayS

Copy link
Copy Markdown
Owner

No description provided.

Add a new GUC, recovery_pause_on_logical_slot_conflict (PGC_SIGHUP,
default off). When enabled, WAL replay on a standby pauses instead of
invalidating an active logical replication slot whose catalog_xmin
would be overtaken by a Heap2/PRUNE_ON_ACCESS record's
snapshotConflictHorizon. An operator can then drain the slot via
pg_logical_slot_get_changes and call pg_wal_replay_resume() to
continue. On resume, the patch advances the drained slot's
catalog_xmin past the conflict horizon so the subsequent
InvalidateObsoleteReplicationSlots call becomes a no-op; replay
continues to the next conflict and the cycle repeats.

This makes logical decoding from an archive-only standby (no streaming
replication link to the primary) viable for continuous CDC. Without
this GUC, slots on such standbys are invalidated the first time replay
applies a catalog vacuum record whose horizon exceeds the slot's
catalog_xmin — typically ~2 * autovacuum_naptime after slot creation.

Hooks into ResolveRecoveryConflictWithSnapshot(), the single choke
point in the replay path for RS_INVAL_HORIZON conflicts, via a new
MaybePauseOnLogicalSlotConflict() function. Reuses the existing
SetRecoveryPause / recoveryNotPausedCV machinery — no new
shared-memory state. Hot path when GUC off is one boolean
early-return.

Edge cases handled:
- Slots still inside DecodingContextFindStartpoint
  (effective_catalog_xmin not yet valid) are skipped. Pausing for them
  would deadlock: snapbuild needs WAL to advance, pause holds it back.
  Invalidating an in-progress slot is harmless — the caller retries.
- Pause-check uses TransactionIdPrecedesOrEquals to match the
  semantics of DetermineSlotInvalidationCause. Without that, a slot
  whose catalog_xmin was just advanced to horizon+1 by a previous
  pause cycle would fail to re-pause on a subsequent record with
  horizon == horizon+1, yet would still be invalidated by the
  fall-through.
- CheckForStandbyTrigger() is called in the wait loop so pg_promote()
  does not stall while paused. Mirrors the existing recoveryPausesHere
  escape loop.
- Synced slots (data.synced == true, i.e. managed by the slot-sync
  worker per sync_replication_slots) are skipped in both the
  pause-check and advance scans. Writing to their fields from the
  startup process would race with the slot-sync worker, and ALTER /
  DROP_REPLICATION_SLOT on a synced slot errors out — so the
  operator-facing "drain or drop" recipe does not apply.

ConfirmRecoveryPaused() and CheckForStandbyTrigger() are made extern
for use by MaybePauseOnLogicalSlotConflict's wait loop — the pause is
entered from inside ResolveRecoveryConflictWithSnapshot rather than
the main replay loop, so we need to transition
RECOVERY_PAUSE_REQUESTED -> RECOVERY_PAUSED ourselves and consume
PROMOTE_SIGNAL_FILE ourselves.

Known limitation: the advance marks slots dirty but does not force an
immediate SaveSlotToPath. If the standby crashes between resume and
the next restartpoint, the advance is lost — on restart replay
re-encounters the same conflict record, re-pauses, and the operator
re-drains (idempotent). A future iteration could tighten this.
10 assertions, ~30 wallclock seconds.

Two-phase flow: Phase 1 sets up an archive-only standby from a clean
basebackup + pg_log_standby_snapshot and creates logical slots on
TWO standbys while the archive contains no catalog-prune records.
One standby has the GUC on, the other off. Phase 2 then runs catalog-
churning workload on the primary (transient tables + VACUUM on
pg_class, pg_attribute, pg_type, pg_depend, pg_statistic) and waits
for those segments to archive.

When the standbys replay through those segments, the GUC-on one
pauses; a Perl orchestrator drains the slot with
pg_logical_slot_get_changes and calls pg_wal_replay_resume. The
GUC-off baseline standby lets its slot invalidate — the upstream
default behavior, unchanged.

A third standby is created after Phase 2 archives (so its replay
will pause quickly on first conflict record). The test then calls
pg_promote(wait=>true, wait_seconds=>30) on the paused standby and
asserts that promote returns true in under 10 seconds. Guards the
CheckForStandbyTrigger() escape path — without that, pg_promote
stalls for the full wait_seconds and returns false.

Assertions:
  ok 1 - GUC is registered
  ok 2 - slot created cleanly in Phase 1 (GUC on, state: reserved)
  ok 3 - baseline slot created cleanly in Phase 1 (GUC off, reserved)
  ok 4 - slot survived catalog prune with GUC on (reserved)
  ok 5 - at least one pause event was handled
  ok 6 - at least 2000 decoded events
  ok 7 - baseline (GUC off): slot invalidates as expected (lost)
  ok 8 - promote-test standby reached paused state before promotion
  ok 9 - pg_promote returned true while standby was paused by GUC
  ok 10 - pg_promote completed in under 10s
@NikolayS NikolayS changed the title p2 of recovery_pause_on_logical_slot_conflict p2 of recovery_pause_on_logical_slot_conflict (auto-resume) Apr 22, 2026
claude added 2 commits April 22, 2026 18:17
Extract four named subs so the top-level script reads as a sequence of
phases rather than one long procedure. No behavior change: all 10
assertions are preserved verbatim, as are the load-bearing comments
(two-phase rationale, double pg_switch_wal rationale, GUC-off baseline
rationale, pg_promote escape-path rationale).

Helpers extracted:
  * setup_primary_with_clean_archive
  * create_archive_standby
  * run_catalog_churn
  * drain_and_resume_loop
  * wait_for_replay_paused
The previous behavior under recovery_pause_on_logical_slot_conflict
required the operator to both drain (or drop / advance) the slot AND
call pg_wal_replay_resume() to continue — two steps, even though the
first step is the one that matters semantically. That split also meant
the feature couldn't underpin a continuous-CDC service without
external orchestration to issue the resume.

Lift the scan predicate ("does any slot in `dboid` still block this
conflict?") out of the initial check into a helper
AnySlotStillBlocksConflict(). Call it again every 1s inside the
existing wait loop. When it returns false, flip the pause state to
NOT_PAUSED and let the loop exit; the existing post-wait advance then
bumps catalog_xmin past the horizon on drained slots so the
fall-through InvalidateObsoleteReplicationSlots() is a no-op.

"No longer blocking" covers every unblock path, not just drain:

  * drained past the pause LSN (confirmed_flush >= captured
    conflict_lsn) — the main case
  * slot dropped (pg_drop_replication_slot) — removed from the scan
  * slot advanced (pg_replication_slot_advance) — catalog_xmin moves
    past the horizon
  * slot invalidated for another reason (e.g. RS_INVAL_WAL_REMOVED
    from max_slot_wal_keep_size, applied by the checkpointer, which
    runs even while the startup process is asleep in our wait loop)
    — data.invalidated != RS_INVAL_NONE, scan skips it

Manual pg_wal_replay_resume() still works as the "give up on this
slot and let it invalidate" escape hatch, and CheckForStandbyTrigger
still breaks the loop for pg_promote().

Capture conflict_lsn once at pause time and reuse it for both the
in-wait predicate and the post-wait advance, replacing the redundant
second GetXLogReplayRecPtr() call.

GUC long_desc, postgresql.conf.sample comment, and the xlogrecovery.c
variable-decl comment updated to describe auto-resume.
@NikolayS NikolayS force-pushed the claude/document-test-steps-Kh1ty branch from 69fb1b6 to 39adedd Compare April 22, 2026 18:17
Combines PR 27 (pause-on-conflict + TAP test) and PR 30 (refactor +
auto-resume) into the story we would send to -hackers. Covers
motivation, mechanism, edge cases, known limitations, tests, files
touched, and open questions (GUC name, single-vs-mode flag,
persistence, scope). Draft only — not sent.
@NikolayS NikolayS force-pushed the rfc-v1-recovery-pause-on-slot-conflict branch from ffd897c to 0ce8b52 Compare May 27, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants