Skip to content

wait_event_timing: reviewable patch series (stats / DSA / trace)#2

Open
DmitryNFomin wants to merge 7 commits into
wet-series-basefrom
wet-series
Open

wait_event_timing: reviewable patch series (stats / DSA / trace)#2
DmitryNFomin wants to merge 7 commits into
wet-series-basefrom
wet-series

Conversation

@DmitryNFomin

@DmitryNFomin DmitryNFomin commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Purpose

This PR holds the submission-ready 7-patch series for the wait_event_timing feature, staged here for Andrey's review before it goes to pgsql-hackers. It is the reworked form of the original 8k-line monolith (#1), restructured per Andrey Borodin's review: "send them all together, but break into chewable parts ... defer DSA refactoring ... so the committer can see value from a first commit without venturing into 8k lines."

Base: upstream master 89eafad297a (branch wet-series-base). Each commit builds and passes the full test suite on its own (bisectable); docs and tests travel with the code they cover; both build systems are updated in lockstep.

The 7 patches

# commit patch scope
0001 5041767 flag + GUC scaffold --enable-wait-event-timing (autoconf + meson) and the wait_event_capture GUC (off|stats); pure scaffolding, no behavior
0002 4bcad6d stats level per-(backend, event) count / total / max / 32-bucket log2 histogram in eagerly-allocated shared memory (deliberately no DSA); single load+branch inline gate, recording bodies out-of-line; pg_stat_wait_event_timing + histogram-buckets views; docs + tests
0003 966c6ac overflow + reset overflow-counter view; own-backend synchronous reset; cross-backend lock-free reset (atomic generation + latch); docs + tests
0004 53a714c CI enable the flag on one GitHub Actions task so instrumented and stub paths both stay covered
0005 f1ee415 lazy-DSA refactor eager shmem → lazily-created DSA (zero per-backend memory until first enable). Pure refactor: SQL surface unchanged, 0002/0003 tests pass as-is. The riskiest patch — lazy-attach guards (critical sections, LWLock wait queues, re-entrancy) and the proc_exit teardown gate — isolated by design
0006 6920aca trace level per-session DSA ring of individual waits; own-session + cross-backend readers; post-mortem ORPHANED-ring reads + reclaim; position-encoded identity seqlock with an injection-point TAP test proving stale-cycle rejection (255-vs-256 at a wedged wrapped ring); docs + tests
0007 983f0ec query attribution ExecStart/End + QueryStart/End markers (executor + protocol boundaries, incl. the pipelined-protocol flush, gated on trace); docs + tests

Groups for the cover letter: patches 1–4 = stats (independently committable), patch 5 = storage refactor, patches 6–7 = trace.

Reviewer-concern checklist (from the #1 review)

concern resolution
unlikely() on the inline gate Removed — A/B benchmark first: byte-identical codegen for LWLockAcquire/Release/WaitLatch on gcc 13 -O2, ±0.09 % TPS. Gates carry "no unlikely()" comments with the rationale
split with DSA deferred 0002 contains zero DSA; 0005 is a pure refactor touching no SQL surface; all trace machinery confined to 0006–0007
value without 8k lines First value-bearing patch (0002) is self-contained at ~1.8k lines incl. docs+tests; series totals 45 files, +5611/−10

Validation summary

Key facts (CCX43, 16 dedicated vCPU, gcc 13):

  • Correctness: full meson test (regression + isolation + recovery/subscription/pg_upgrade TAP): 366 Ok / 0 Fail timing build, 365 Ok / 0 Fail stub build (seqlock TAP test correctly skips in stub). Every intermediate patch stage rebuilt and re-verified (builds + regress; seqlock test at 0006). UBSan-clean.

  • Stress (cassert): 300-backend churn + parallel workers attaching in critical sections + concurrent TPC-B, capture forced cluster-wide at stats and at trace (incl. 20× trace↔off toggles, 102 orphan rings reclaimed) — zero asserts/panics.

  • Concurrency: two-session cross-backend reset serviced asynchronously; post-mortem orphan-ring read after owner exit + explicit reclaim.

  • Performance (pgbench, 5 interleaved runs/cell):

    workload WET/off vs baseline WET/stats vs off
    -S read-only (cached) −0.57 % −0.30 %
    TPC-B +0.18 % (8-run re-test) −0.51 %
    -S wait-saturated (32 MB SB) −0.42 % +0.29 %

    An initial TPC-B −3 % was proven a binary-layout artifact with a gate-compiled-out control (the gate-removed build measured slower than the gate-included one). Off-mode cost is below the measurement floor.

  • Mechanical integrity: the 7-patch set applies cleanly with git am onto 89eafad297a and reproduces a tree bit-identical to the fully-validated development tip. Adversarial reviews passed for each risky patch and for the whole series.

Fixes over the original #1 (surfaced by the split + reviews)

  1. pg_stat_reset_wait_event_timing documented DEFAULT NULL but lacked proargdefaults — no-arg call would error.
  2. The session-local trace SRF was PUBLIC-executable while its view was revoked (security).
  3. A SELECT on a never-enabled cluster created (and pinned) an unused DSA segment.
  4. The pipelined-protocol query_id flush perturbed pg_stat_activity even with the feature off — now gated on trace.
  5. The "bgwriter periodic sweep" claim in Add wait_event_timing: Oracle-style wait event instrumentation #1's description had no implementation — docs now describe the two real reclamation paths.
  6. The trace writer could self-deadlock under injection instrumentation (recursion through the injection machinery's own wait events) — found by the new seqlock test, fixed with a re-entrancy guard.

Status & deliverables

  • format-patch files + filled cover letter: patches-v1/ (local; nothing sent to pgsql-hackers yet).
  • Development history (10 commits, all review rounds): branch wet-series-10commit.
  • Commit messages carry no tooling trailers; AI-assistance disclosure is planned for the cover letter/thread.

…_capture GUC

Introduce the compile-time option --enable-wait-event-timing
(meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and
the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default
off).  This commit is scaffolding only: it wires the flag through both
build systems and registers the GUC with check/assign hooks, but adds
no instrumentation yet -- later commits in this series add the
recording hot path, the SQL surface, and the trace level.

In builds compiled without --enable-wait-event-timing, the GUC's check
hook rejects any value other than off (downgrading to off with a
warning for non-interactive sources), so the variable exists uniformly
for tooling but cannot be enabled.
…vel)

Implement wait_event_capture = stats.  When enabled, every transition
through pgstat_report_wait_start()/pgstat_report_wait_end() records the
wait duration and accumulates per-(backend, event) statistics -- count,
total and maximum duration, and a 32-bucket log2 duration histogram -- in
shared memory.

Storage is a per-backend slot array in the main shared memory segment,
sized at postmaster start from wait_event_timing_max_tranches.  Because
each slot lives for the entire life of its backend, the hot path needs no
lazy attach and no teardown gating: pgstat_report_wait_start_timing()
records a timestamp and the current event, and
pgstat_report_wait_end_timing() computes the duration and accumulates.
Each backend writes only to its own slot, so no locking is required and
the SRF reader is lock-free.

The inline gate in pgstat_report_wait_start()/_end() is a single load of
wait_event_capture plus a branch, with the bodies kept out-of-line so the
many inlined call sites stay compact; while capture is off the hot path
adds only that branch.

Non-LWLock wait events map to a flat array via a class table generated
from wait_event_names.txt by generate-wait_event_types.pl.  LWLock events,
whose tranche ids are unbounded, use a per-backend open-addressing hash
capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192).

SQL surface:
  - pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing
    view, one row per backend per event with a non-zero count;
  - pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins.

Builds compiled without --enable-wait-event-timing keep the GUC (its check
hook rejects any non-off value) and empty-result SQL stubs, so tooling
sees a uniform surface.

A later commit in the series exposes the per-backend overflow counters
(maintained here) and adds the reset functions.
Surface the per-backend truncation counters maintained by the recording
path, and add the ability to reset wait-event-timing statistics.

pg_stat_get_wait_event_timing_overflow(pid) and the
pg_stat_wait_event_timing_overflow view report, per backend,
lwlock_overflow_count (LWLock waits dropped because the per-backend
tranche hash was full), flat_overflow_count (events whose class index was
out of range), and reset_count.

Resets use a lock-free request/response so the hot path stays single
writer: each slot carries an atomic reset_generation, bumped by the
resetter; the owning backend compares it against a backend-local
last-seen value at its next wait_end and clears its own counters,
incrementing reset_count.  pg_stat_reset_wait_event_timing(pid) resets one
backend -- synchronously when it targets the caller's own session (any
user; pid defaults to NULL), or asynchronously for another backend
(requiring pg_signal_backend, matching pg_stat_reset_backend_stats).
pg_stat_reset_wait_event_timing_all() resets every backend and is
superuser-only.

Builds without --enable-wait-event-timing keep empty-result/feature-not-
supported stubs for the new functions.
Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions
task so the wait-event-timing build path -- including the expected output
src/test/regress/expected/wait_event_timing.out -- is exercised on every
push.  That task already runs check-world under the undefined/alignment
sanitizers with a small segment size, giving the timing code meaningful
coverage.  Every other CI task keeps building without the flag, so the
stub path and its alternate output wait_event_timing_1.out remain covered
as well.
Convert the per-backend wait-event-timing slot array from eager
main-segment shared memory to a lazily-allocated DSA region.  Only a
small control struct (a DSA handle plus an LWLock) now lives in fixed
shared memory; the large array -- ~30 KB per backend at the default
wait_event_timing_max_tranches -- is allocated the first time any backend
in the cluster sets wait_event_capture to a non-off value.  A build that
compiles the feature in but never enables it therefore pays no
per-backend memory, and a SELECT against the views on a cluster that
never enabled capture does not even create the DSA.

This is a pure refactor: the SQL surface and observable behavior are
unchanged, and the existing regression tests pass without modification.

Backends attach to the array on their first wait event under capture, in
pgstat_wait_event_timing_lazy_attach().  Because that runs from the
wait-event hot path, it carries the guards that make DSA work safe there:

  - skip while CritSectionCount > 0 (dsa_attach -> MemoryContextAlloc
    asserts inside a critical section);
  - skip while MyProc->lwWaiting != LW_WS_NOT_WAITING (a nested
    LWLockQueueSelf on the control lock would PANIC);
  - an in_attach re-entrancy guard, because dsa_create / dsa_allocate /
    the control-lock acquisition can themselves emit LWLock wait events
    that re-enter the hot path;
  - a before_shmem_exit gate (wait_event_timing_writes_disabled) so the
    hot path stops touching DSA once proc_exit begins tearing the
    mappings down -- shmem_exit runs all before_shmem_exit callbacks
    before dsm_backend_shutdown, so the gate is up before any unmap.

A new LWLock tranche, WaitEventTimingDSA, names the control lock.
Add the third capture level, wait_event_capture = trace.  On top of the
STATS aggregates, every completed wait is pushed into a per-session ring
buffer in DSA -- one record per wait -- allocated lazily on first use so
only sessions that enable trace pay the (default 4 MB) per-ring cost.

The ring is exposed two ways:

  - pg_get_backend_wait_event_trace() / the pg_backend_wait_event_trace
    view read the calling backend's own ring;
  - pg_get_wait_event_trace(procnumber) reads any backend's ring
    cross-process, including rings left behind by exited backends.

A backend's ring is not freed when it exits: the slot transitions to
ORPHANED and the ring stays in DSA so cross-backend consumers can read
the dying backend's final waits (important for short-lived parallel
workers).  An orphan is reclaimed when a new backend reuses the same
procNumber (clear-on-init) or by pg_stat_clear_orphaned_wait_event_rings()
(execution revoked from PUBLIC; delegable with GRANT).  Slot transitions
(FREE/OWNED/ORPHANED) are serialised by the WaitEventTraceControl lock;
the ring size is fixed cluster-wide at server start by
wait_event_trace_ring_size (power of two, default 4 MB).

The single-writer hot path writes each record under a seqlock (odd seq
while writing, even when complete).  Cross-backend readers use a
POSITION-ENCODED IDENTITY seqlock -- a record at ring index i is valid
only if its seq equals the writer's complete value for that exact position
-- which rejects stale previous-cycle reads that a parity-only seqlock
would accept under cross-process visibility lag.  A TAP test drives an
injection point between the writer's write_pos advance and its seq stamp
to prove exactly that: with the ring wrapped and the writer wedged
mid-record, a cross-backend read returns ring_size - 1 records, skipping
the in-flight slot whose stale prior-cycle record a parity-only check
would have emitted.

The trace ring writer carries the same teardown discipline as the stats
path: a wait_event_trace_writes_disabled gate (raised around slot
transitions and at proc_exit) plus in_attach/in_release re-entrancy guards
keep the DSA-internal LWLock waits that those operations emit from
recursing into a ring that is being freed or orphaned.  A further
re-entrancy guard on the record writer itself keeps wait events emitted
mid-record-write (reachable only via the injection point; the write is
plain stores in production) from recursing into the writer.

Privileges: both trace SRFs and the session-local view are REVOKE'd from
PUBLIC and GRANT'ed to pg_read_all_stats (reading a session's trace
exposes its wait sequence).

Query-attribution markers and their executor/protocol hooks are added in
the next commit; this commit records wait events only.
At the trace level, interleave query-boundary markers with the wait
events in the per-session ring so a reader can tell which query each wait
belongs to.  Two marker families are emitted:

  - ExecStart/ExecEnd bracket every executor run
    (ExecutorStart/ExecutorEnd), the primary attribution signal --
    every executable statement, including those in parallel workers and
    pipelined extended-protocol messages, is bracketed;
  - QueryStart/QueryEnd fire at top-level query_id transitions
    (pgstat_report_query_id) and at the transition to idle
    (send_ready_for_query), providing the inter-statement boundaries the
    executor markers cannot -- e.g. the ClientRead wait between
    statements.

In the pipelined extended protocol a Parse/Bind/Execute can arrive while
the previous query's id is still set and the session is still RUNNING
(no Sync->idle in between), so the prior id is flushed with force=true at
those message boundaries to fire its QUERY_END before the new query
starts.  That flush is gated on wait_event_capture = trace, so when trace
is off it is a no-op and pg_stat_activity.query_id behaves exactly as
before.

Marker emission requires a non-zero query_id (compute_query_id) and
track_activities; a WARNING is logged when trace is enabled without them.
The markers themselves are no-ops unless capture is at trace, so the only
cost when the feature is off is the inline gate already present.
@DmitryNFomin

Copy link
Copy Markdown
Owner Author

Series reshaped into the 7-patch deliverable (for review before pgsql-hackers submission).

The 10 development commits were redistributed into 7 submission-grade patches — the cleanup pass and the seqlock-test commit folded into their logical parents, and the trace docs/tests moved into the patches that introduce the code they cover:

0001  add --enable-wait-event-timing flag and wait_event_capture GUC
0002  record per-backend wait event statistics (stats level)      [no DSA]
0003  expose overflow counters and add reset functions
0004  ci: build one task with --enable-wait-event-timing
0005  allocate the per-backend array lazily in DSA                [pure refactor]
0006  add trace level with a per-session ring buffer              [+ seqlock TAP test]
0007  add query-attribution markers to the trace ring

Verification: every intermediate stage was rebuilt and re-verified on the box (timing + stub builds, regress+isolation; the seqlock TAP test at 0006), and the final tree is bit-identical to the previously validated tip (86593a6, full suite 366/0 + 365/0) — so all prior validation transfers. The whole set applies cleanly with git am onto upstream 89eafad297a and reproduces the tree exactly. Commit messages carry no tooling trailers.

The old 10-commit history is preserved at branch wet-series-10commit. A draft cover letter + format-patch files are prepared locally (patches-v1/); nothing has been sent to pgsql-hackers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant