Skip to content

Fix two bugs in archive_mode=shared on standby#83

Open
x4m wants to merge 54 commits into
MDB_17_STABLEfrom
fix_sa_17
Open

Fix two bugs in archive_mode=shared on standby#83
x4m wants to merge 54 commits into
MDB_17_STABLEfrom
fix_sa_17

Conversation

@x4m
Copy link
Copy Markdown
Contributor

@x4m x4m commented Apr 21, 2026

  1. Checkpoint on standby deletes WAL with .ready status. XLogArchiveCheckDone() treated archive_mode=shared like archive_mode=on during recovery, returning true unconditionally and allowing checkpoint to remove WAL segments that the primary had not yet archived. Fix: exclude shared mode from the early-return path, same as "always".

  2. Walsender never sends archival status reports after archiving is restored. WalSndArchivalReport() calls pgstat_fetch_stat_archiver() whose result is cached per-session (PGSTAT_FETCH_CONSISTENCY_CACHE by default). The walsender has no transaction boundaries that would clear the cache, so last_archived_wal remained "" forever, and strcmp() suppressed all reports. Fix: call pgstat_clear_snapshot() before fetching archiver stats.

Add TAP tests in 051_archive_shared_checkpoint.pl that reproduce both bugs, and extend 050_archive_shared.pl with checkpoint/restore scenarios.

nathan-bossart and others added 30 commits April 15, 2026 12:50
Various parts of pg_dump consult the --schema-only and --data-only
options to determine whether to run a section of code.  While this
is simple enough for two mutually-exclusive options, it will become
progressively more complicated as more options are added.  In
anticipation of that, this commit introduces new internal flags
called dumpSchema and dumpData, which are derivatives of
--schema-only and --data-only.  This commit also removes the
schemaOnly and dataOnly members from the dump/restore options
structs to prevent their use elsewhere.

Note that this change neither adds new user-facing command-line
options nor changes the existing --schema-only and --data-only
options.

Author: Corey Huinker
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/CADkLM%3DcQgghMJOS8EcAVBwRO4s1dUVtxGZv5gLPfZkQ1nL1gzA%40mail.gmail.com
There's no reason to ensure that the files pg_upgrade generates
with pg_dump and pg_dumpall have been written safely to disk.  If
there is a crash during pg_upgrade, the upgrade must be restarted
from the beginning; dump files left behind by previous pg_upgrade
attempts cannot be reused.

Reviewed-by: Peter Eisentraut, Tom Lane, Michael Paquier, Daniel Gustafsson
Discussion: https://postgr.es/m/20240503171348.GA1731524%40nathanxps13
These functions are used to tweak statistics on any relation, provided
that the user has MAINTAIN privilege on the relation, or is the database
owner.

Bump catalog version.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=eErgzn7ECDpwFcptJKOk9SxZEk5Pot4d94eVTZsvj3gw@mail.gmail.com
While the default value for relpages is 0, if a partitioned table with
at least one child has been analyzed, then the partititoned table will
have a relpages value of -1.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=fajh1Lpcyr_XsMmq-9Z=SGk-u+_Zeac7Pt0RAN3uiVCg@mail.gmail.com
While we haven't observed any test instability, it seems like a good
idea to disable autovacuum during the stats import tests.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=fajh1Lpcyr_XsMmq-9Z=SGk-u+_Zeac7Pt0RAN3uiVCg@mail.gmail.com
These functions will either raise an ERROR or run to normal
completion, so no return value is necessary.

Bump catalog version.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=cBF8rnphuTyHFi3KYzB9ByDgx57HwK9Rz2yp7S+Om87w@mail.gmail.com
Enable manipulation of attribute statistics. Only superficial
validation is performed, so it's possible to add nonsense, and it's up
to the planner (or other users of statistics) to behave reasonably in
that case.

Bump catalog version.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=eErgzn7ECDpwFcptJKOk9SxZEk5Pot4d94eVTZsvj3gw@mail.gmail.com
Previously, an invalid attribute name was caught, but the error
message was unhelpful.
Similar to the pg_set_*_stats() functions, except with a variadic
signature that's designed to be more future-proof. Additionally, most
problems are reported as WARNINGs rather than ERRORs, allowing most
stats to be restored even if some cannot.

These functions are intended to be called from pg_dump to avoid the
need to run ANALYZE after an upgrade.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=eErgzn7ECDpwFcptJKOk9SxZEk5Pot4d94eVTZsvj3gw@mail.gmail.com
This matches the behavior of vac_update_relstats(), which is important
to avoid bloating pg_class.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=fc3je+ufv3gsHqjjSSf+t8674RXpuXW62EL55MUEQd-g@mail.gmail.com
Add support to pg_dump for dumping stats, and use that during
pg_upgrade so that statistics are transferred during upgrade. In most
cases this removes the need for a costly re-analyze after upgrade.

Some statistics are not transferred, such as extended statistics or
statistics with a custom stakind.

Now pg_dump accepts the options --schema-only, --no-schema,
--data-only, --no-data, --statistics-only, and --no-statistics; which
allow all combinations of schema, data, and/or stats. The options are
named this way to preserve compatibility with the previous
--schema-only and --data-only options.

Statistics are in SECTION_DATA, unless the object itself is in
SECTION_POST_DATA.

The stats are represented as calls to pg_restore_relation_stats() and
pg_restore_attribute_stats().

Author: Corey Huinker, Jeff Davis
Reviewed-by: Jian He
Discussion: https://postgr.es/m/CADkLM=fzX7QX6r78fShWDjNN3Vcr4PVAnvXxQ4DiGy6V=0bCUA@mail.gmail.com
Discussion: https://postgr.es/m/CADkLM%3DcB0rF3p_FuWRTMSV0983ihTRpsH%2BOCpNyiqE7Wk0vUWA%40mail.gmail.com
Patch adds new pre-defined role `pg_signal_autovacuum` and tap tests.
Role, granted with `pg_signal_autovacuum` is able to cancel running
autovacuum worker. Note that to cancel all other type of queries
role need to be granted with `pg_singal_backend`.
Currently we allow to cancel awaiting of syncronous commit.
Some drivers cancel query after timeout. If application will retry
idempotent query, it will get confirmation of written data.
This can lead to split-brain in HA scenarios. To prevent it this
we add synchronous_commit_cancelation setting disalowing cancelation
of syncronous replication wait

Version for PostgreSQL 16
Introduces 3 functions:
extern bool mdb_admin_allow_bypass_owner_checks(Oid userId,  Oid ownerId);
extern void check_mdb_admin_is_member_of_role(Oid member, Oid role);
extern bool mdb_admin_is_member_of_role(Oid member, Oid role);

To check mdb admin belongship and role-to-role ownership transfer
correctness.

Our mdb_admin ACL model is the following:

* Any roles user or/and roles can be granted with mdb_admin
* mdb_admin member can tranfser ownershup of relations,
namespaces and functions to other roles, if target role in neither:
superuser, pg_read_server_files, pg_write_server_files nor
pg_execute_server_program.

* Allow mdb_admin to create LEAKPROOF functions
* mdb admin sets session replication role
* [MDB-16648 + MDB-17910]: Allow mdb admin to kill specific superuser queries

MDB-27288: allow mdb_admin to kill autovac + tests

mdb-27228: fix expected output

One more fix

Fix warns
MDB replication role regression tests

Patch allows user with mdb_replication role to use
pg_create_logical_replication_slot, pg_replication_slot_advance,
pg_drop_replication_slot functions to manage logical replication slots.
Also, users with mdb_admin (which is memner of pg_create_subscription)
can create subscribptions. Slot names starting with MDB.* are forbidden.

Add run as owner tap tests

More test cases in mdb_102

Fix test

Fix tests once again

Never check for superuser in walsender
This commit introduces new mdb internal role mdb_superuser.

Role is capaple of:

GRANT/REVOKE any set of priviledges to/from any object in database.
Has power of pg_database_owner in any database, including:
DROP any object in database (except system catalog and stuff)

Role is NOT capaple of:

Create database, role, extension or alter other roles with such
priviledges.

Transfer ownership to /pass has_priv of roles:

PG_READ_ALL_DATA
PG_WRITE_ALL_DATA
PG_EXECUTE_SERVER_PROGRAM
PG_READ_SERVER_FILES
PG_WRITE_SERVER_FILES

Fix configure.ac USE_MDBLOCALES option handling

Apply autoreconf stuff

Set missing ok parameter ito true while acquiring mdb_superuser oid

In regress tests, nobody creates mdb_superuser role, so missing ok is
fine

Fix spelling

Applied suggestion

Allow mdb_superuser to have power of pg_database_owner

Allow mdb_superuser to alter objects and grant ACl to
objects, owned by pg_database_owner. Also, when acl check,
allow mdb_superuser use pg_database_owner role power to pass check

regression test fixes

MDB-32132: fix grantor selection for mdb_superuser (#6)
… non-superuser

It is well known that some of PostgreSQL-related security issues (CVE)
was related to COPY FROM/TO PROGRAM exploits for priviledge escalation
or other unwanted behaviour or consequences. Thus, proper usage of
this feature needed. For now, simply forbit this.

Add mdb copy test
Do not use mdb_locales and mdb_newlocale when configured without them.

Added ifdef codepath build without mdb-locales feature

Add mdb locales patch, restore COPY from/to files, enable regress.

Squashed commit of the following:

commit 9f8ea4a5f42e0fd6077061bafbb428e88499896f
Author: reshke kirill <reshke@double.cloud>
Date:   Wed Feb 22 09:27:53 2023 +0000

    Add mdb locales patch, restore COPY from/to files, enable regress.

    This commit does several things.
    * Enables back COPY from/to FILE functionality, becuase it is used
    by pg_regress
    * Enables pg_regress tests in deb build itself.
    * Add mdb locales function and checks that package build with
    mdb-locales support

Add configure ac target to define USE_MDBLOCALES properly

Refactor optional setlocale, fix minor issues
Accept new startup param _pq_.service_auth_role for service auth under
any user, which is niether superuser nor have some dangerous system
role priv

Add tap-test for mdb service role auth 👍👌😉

Fix tests after rebase

contrib tests 💅️️💅️️💅️️ now works

MDB-23247: debug ouput for testing purposes lowered to DEBUG5 elog level

Skip caching routines for pre-startup logic (SCRAM -service auth role)
Update mdb-patched.md

Update mdb-pacthes.md
x4m and others added 24 commits April 15, 2026 12:50
We learned from the field incidents, that 128Kb is not enough.
This time we are going to try to increase this value.
Use fadvise in walsender

remove bogus progress reporting

Fix fadvise patch
There is no need to log the entire query, because it may be large and take lots of space on disk. Parameter max_log_size set the maximum length for logged query. Everything beyond that length is truncated. Value 0 disables the parameter.
Merge in MDB/postgres-dev from MDB-31374-bump-llvm-18-pg17_0 to MDB_17_0

Squashed commit of the following:

commit 0931fd9d64288653cd950cd19d941bd4d51640d7
Author: Andrey Lyarskiy <aslyarskiy@yandex-team.ru>
Date:   Tue Oct 29 12:59:04 2024 +0300

    bump llvm to 18
Add new parameter to VACUUM command, FORCE, meaning VACUUM should
terminate all backends that prevents the execution by holding conflicting lock
Also truncate query to be logged in simple query

Remove unused functions in buf_internals.h
Do not run mdb locales tests in OS

Fix CI && fast CI circuit (#24)

Use mirror apt repo (#35)

Refactor regress Dockerfile

Update Dockerfile: try

Update Dockerfile
Allow usage on schema for mdb_read_all_data (#23)
pg_signal_autovacuum

Until pg18 we have to work with our internal implementation of
pg_signal_autovacuum_worker. Since explicit GRANT will cause
problems during pg_upgrade, use internal implicit grant logic
In the case of a large PGSS_TEXT_FILE, the work time of the qtext_load_file
function will be quite long, and the query to the pg_stat_statements table
will not be cancellable, as there is no CHECK_FOR_INTERRUPT in the function.

Also, the amount of bytes read can reach 1 GB, which leads to a slow read
system call that does not allow cancellation of the query. Testing the speed
of sequential read using fio with different block sizes shows that there is
no significant difference between 16 MB blocks and 1 GB blocks.

Therefore, this patch changes the maximum read value from 1 GB to 16 MB and
adds INTERRUPTS_PENDING_CONDITION() check in the read loop of qtext_load_file to make it cancellable.
For now, only statement execution is cancellable (fail_on_interrupt is true only for calls from pg_stat_statements_internal)

Signed-off-by: rkhapov <r.khapov@ya.ru>
Reviewed-by: reshke <reshke@double.cloud>
Introduce a new archive_mode setting "shared" to prevent WAL history
loss during standby promotion in HA streaming replication setups.

In shared mode, the primary proactively sends archival status updates
to standbys via the replication protocol. The standby creates .ready
files for received WAL segments but defers marking them as .done until
the primary confirms archival. This prevents WAL from being recycled
before it's safely archived, addressing a critical gap in PITR continuity
during failover.

Key implementation details:

- Primary periodically sends last archived WAL segment via new
  PqReplMsg_ArchiveStatusReport ('a') message
- Standby marks all segments <= reported segment as .done using
  alphanumeric comparison on segment part (timeline-safe)
- Archiver skips during recovery in shared mode, activates on promotion
- Cascading replication: each standby coordinates with immediate upstream
- Startup check rejects archive_mode=on during recovery

This "push" design (primary sends status) is more efficient than "pull"
(standby queries per-segment), avoiding directory scans and stat() calls.
Based on Heikki Linnakangas's 2014 design and Greenplum's production
implementation, modernized for PostgreSQL 19.

Includes TAP tests covering basic synchronization, promotion,
cascading replication, and multiple standbys scenarios.
When standby receives archive status report, check if .ready files
belong to ancestor timelines before the switch point and mark them
as .done if already archived by primary.
When archive status reports arrive sequentially on the same timeline,
directly generate expected WAL filenames and mark them as archived
instead of scanning the entire archive_status directory.

This optimization reduces overhead in the common case where the primary
continuously archives segments. Directory scan is still used when:
- Timeline changes (to handle ancestor timelines)
- First report received
- Non-sequential reports

XLogArchiveForceDone() handles all cases internally (checking if .done
exists, if .ready exists, or creating .done if neither exists), so no
pre-check is needed.
1. Checkpoint on standby deletes WAL with .ready status.
   XLogArchiveCheckDone() treated archive_mode=shared like archive_mode=on
   during recovery, returning true unconditionally and allowing checkpoint
   to remove WAL segments that the primary had not yet archived.
   Fix: exclude shared mode from the early-return path, same as "always".

2. Walsender never sends archival status reports after archiving is restored.
   WalSndArchivalReport() calls pgstat_fetch_stat_archiver() whose result is
   cached per-session (PGSTAT_FETCH_CONSISTENCY_CACHE by default).  The
   walsender has no transaction boundaries that would clear the cache, so
   last_archived_wal remained "" forever, and strcmp() suppressed all reports.
   Fix: call pgstat_clear_snapshot() before fetching archiver stats.

Add TAP tests in 051_archive_shared_checkpoint.pl that reproduce both bugs,
and extend 050_archive_shared.pl with checkpoint/restore scenarios.
@reshke reshke force-pushed the MDB_17_STABLE branch 3 times, most recently from 0be7768 to d3dfbf2 Compare April 30, 2026 13:02
@reshke reshke force-pushed the MDB_17_STABLE branch 3 times, most recently from 7de4b1d to c15ed3e Compare May 11, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants