wait_event_timing: reviewable patch series (stats / DSA / trace) by DmitryNFomin · Pull Request #2 · DmitryNFomin/postgres

DmitryNFomin · 2026-06-07T20:35:53Z

Purpose

This PR holds the submission-ready 7-patch series for the wait_event_timing feature, staged here for Andrey's review before it goes to pgsql-hackers. It is the reworked form of the original 8k-line monolith (#1), restructured per Andrey Borodin's review: "send them all together, but break into chewable parts ... defer DSA refactoring ... so the committer can see value from a first commit without venturing into 8k lines."

Base: upstream master 89eafad297a (branch wet-series-base). Each commit builds and passes the full test suite on its own (bisectable); docs and tests travel with the code they cover; both build systems are updated in lockstep.

The 7 patches

#	commit	patch	scope
0001	`5041767`	flag + GUC scaffold	`--enable-wait-event-timing` (autoconf + meson) and the `wait_event_capture` GUC (off\|stats); pure scaffolding, no behavior
0002	`4bcad6d`	stats level	per-(backend, event) count / total / max / 32-bucket log2 histogram in eagerly-allocated shared memory (deliberately no DSA); single load+branch inline gate, recording bodies out-of-line; `pg_stat_wait_event_timing` + histogram-buckets views; docs + tests
0003	`966c6ac`	overflow + reset	overflow-counter view; own-backend synchronous reset; cross-backend lock-free reset (atomic generation + latch); docs + tests
0004	`53a714c`	CI	enable the flag on one GitHub Actions task so instrumented and stub paths both stay covered
0005	`f1ee415`	lazy-DSA refactor	eager shmem → lazily-created DSA (zero per-backend memory until first enable). Pure refactor: SQL surface unchanged, 0002/0003 tests pass as-is. The riskiest patch — lazy-attach guards (critical sections, LWLock wait queues, re-entrancy) and the proc_exit teardown gate — isolated by design
0006	`6920aca`	trace level	per-session DSA ring of individual waits; own-session + cross-backend readers; post-mortem ORPHANED-ring reads + reclaim; position-encoded identity seqlock with an injection-point TAP test proving stale-cycle rejection (255-vs-256 at a wedged wrapped ring); docs + tests
0007	`983f0ec`	query attribution	ExecStart/End + QueryStart/End markers (executor + protocol boundaries, incl. the pipelined-protocol flush, gated on trace); docs + tests

Groups for the cover letter: patches 1–4 = stats (independently committable), patch 5 = storage refactor, patches 6–7 = trace.

Reviewer-concern checklist (from the #1 review)

concern	resolution
`unlikely()` on the inline gate	Removed — A/B benchmark first: byte-identical codegen for LWLockAcquire/Release/WaitLatch on gcc 13 `-O2`, ±0.09 % TPS. Gates carry "no unlikely()" comments with the rationale
split with DSA deferred	0002 contains zero DSA; 0005 is a pure refactor touching no SQL surface; all trace machinery confined to 0006–0007
value without 8k lines	First value-bearing patch (0002) is self-contained at ~1.8k lines incl. docs+tests; series totals 45 files, +5611/−10

Validation summary

Key facts (CCX43, 16 dedicated vCPU, gcc 13):

Correctness: full meson test (regression + isolation + recovery/subscription/pg_upgrade TAP): 366 Ok / 0 Fail timing build, 365 Ok / 0 Fail stub build (seqlock TAP test correctly skips in stub). Every intermediate patch stage rebuilt and re-verified (builds + regress; seqlock test at 0006). UBSan-clean.
Stress (cassert): 300-backend churn + parallel workers attaching in critical sections + concurrent TPC-B, capture forced cluster-wide at stats and at trace (incl. 20× trace↔off toggles, 102 orphan rings reclaimed) — zero asserts/panics.
Concurrency: two-session cross-backend reset serviced asynchronously; post-mortem orphan-ring read after owner exit + explicit reclaim.
Performance (pgbench, 5 interleaved runs/cell):

workload WET/off vs baseline WET/stats vs off

-S read-only (cached) −0.57 % −0.30 %

TPC-B +0.18 % (8-run re-test) −0.51 %

-S wait-saturated (32 MB SB) −0.42 % +0.29 %

An initial TPC-B −3 % was proven a binary-layout artifact with a gate-compiled-out control (the gate-removed build measured slower than the gate-included one). Off-mode cost is below the measurement floor.
Mechanical integrity: the 7-patch set applies cleanly with git am onto 89eafad297a and reproduces a tree bit-identical to the fully-validated development tip. Adversarial reviews passed for each risky patch and for the whole series.

Fixes over the original #1 (surfaced by the split + reviews)

pg_stat_reset_wait_event_timing documented DEFAULT NULL but lacked proargdefaults — no-arg call would error.
The session-local trace SRF was PUBLIC-executable while its view was revoked (security).
A SELECT on a never-enabled cluster created (and pinned) an unused DSA segment.
The pipelined-protocol query_id flush perturbed pg_stat_activity even with the feature off — now gated on trace.
The "bgwriter periodic sweep" claim in Add wait_event_timing: Oracle-style wait event instrumentation #1's description had no implementation — docs now describe the two real reclamation paths.
The trace writer could self-deadlock under injection instrumentation (recursion through the injection machinery's own wait events) — found by the new seqlock test, fixed with a re-entrancy guard.

Status & deliverables

format-patch files + filled cover letter: patches-v1/ (local; nothing sent to pgsql-hackers yet).
Development history (10 commits, all review rounds): branch wet-series-10commit.
Commit messages carry no tooling trailers; AI-assistance disclosure is planned for the cover letter/thread.

…_capture GUC Introduce the compile-time option --enable-wait-event-timing (meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default off). This commit is scaffolding only: it wires the flag through both build systems and registers the GUC with check/assign hooks, but adds no instrumentation yet -- later commits in this series add the recording hot path, the SQL surface, and the trace level. In builds compiled without --enable-wait-event-timing, the GUC's check hook rejects any value other than off (downgrading to off with a warning for non-interactive sources), so the variable exists uniformly for tooling but cannot be enabled.

…vel) Implement wait_event_capture = stats. When enabled, every transition through pgstat_report_wait_start()/pgstat_report_wait_end() records the wait duration and accumulates per-(backend, event) statistics -- count, total and maximum duration, and a 32-bucket log2 duration histogram -- in shared memory. Storage is a per-backend slot array in the main shared memory segment, sized at postmaster start from wait_event_timing_max_tranches. Because each slot lives for the entire life of its backend, the hot path needs no lazy attach and no teardown gating: pgstat_report_wait_start_timing() records a timestamp and the current event, and pgstat_report_wait_end_timing() computes the duration and accumulates. Each backend writes only to its own slot, so no locking is required and the SRF reader is lock-free. The inline gate in pgstat_report_wait_start()/_end() is a single load of wait_event_capture plus a branch, with the bodies kept out-of-line so the many inlined call sites stay compact; while capture is off the hot path adds only that branch. Non-LWLock wait events map to a flat array via a class table generated from wait_event_names.txt by generate-wait_event_types.pl. LWLock events, whose tranche ids are unbounded, use a per-backend open-addressing hash capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192). SQL surface: - pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing view, one row per backend per event with a non-zero count; - pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins. Builds compiled without --enable-wait-event-timing keep the GUC (its check hook rejects any non-off value) and empty-result SQL stubs, so tooling sees a uniform surface. A later commit in the series exposes the per-backend overflow counters (maintained here) and adds the reset functions.

Surface the per-backend truncation counters maintained by the recording path, and add the ability to reset wait-event-timing statistics. pg_stat_get_wait_event_timing_overflow(pid) and the pg_stat_wait_event_timing_overflow view report, per backend, lwlock_overflow_count (LWLock waits dropped because the per-backend tranche hash was full), flat_overflow_count (events whose class index was out of range), and reset_count. Resets use a lock-free request/response so the hot path stays single writer: each slot carries an atomic reset_generation, bumped by the resetter; the owning backend compares it against a backend-local last-seen value at its next wait_end and clears its own counters, incrementing reset_count. pg_stat_reset_wait_event_timing(pid) resets one backend -- synchronously when it targets the caller's own session (any user; pid defaults to NULL), or asynchronously for another backend (requiring pg_signal_backend, matching pg_stat_reset_backend_stats). pg_stat_reset_wait_event_timing_all() resets every backend and is superuser-only. Builds without --enable-wait-event-timing keep empty-result/feature-not- supported stubs for the new functions.

Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions task so the wait-event-timing build path -- including the expected output src/test/regress/expected/wait_event_timing.out -- is exercised on every push. That task already runs check-world under the undefined/alignment sanitizers with a small segment size, giving the timing code meaningful coverage. Every other CI task keeps building without the flag, so the stub path and its alternate output wait_event_timing_1.out remain covered as well.

Convert the per-backend wait-event-timing slot array from eager main-segment shared memory to a lazily-allocated DSA region. Only a small control struct (a DSA handle plus an LWLock) now lives in fixed shared memory; the large array -- ~30 KB per backend at the default wait_event_timing_max_tranches -- is allocated the first time any backend in the cluster sets wait_event_capture to a non-off value. A build that compiles the feature in but never enables it therefore pays no per-backend memory, and a SELECT against the views on a cluster that never enabled capture does not even create the DSA. This is a pure refactor: the SQL surface and observable behavior are unchanged, and the existing regression tests pass without modification. Backends attach to the array on their first wait event under capture, in pgstat_wait_event_timing_lazy_attach(). Because that runs from the wait-event hot path, it carries the guards that make DSA work safe there: - skip while CritSectionCount > 0 (dsa_attach -> MemoryContextAlloc asserts inside a critical section); - skip while MyProc->lwWaiting != LW_WS_NOT_WAITING (a nested LWLockQueueSelf on the control lock would PANIC); - an in_attach re-entrancy guard, because dsa_create / dsa_allocate / the control-lock acquisition can themselves emit LWLock wait events that re-enter the hot path; - a before_shmem_exit gate (wait_event_timing_writes_disabled) so the hot path stops touching DSA once proc_exit begins tearing the mappings down -- shmem_exit runs all before_shmem_exit callbacks before dsm_backend_shutdown, so the gate is up before any unmap. A new LWLock tranche, WaitEventTimingDSA, names the control lock.

Add the third capture level, wait_event_capture = trace. On top of the STATS aggregates, every completed wait is pushed into a per-session ring buffer in DSA -- one record per wait -- allocated lazily on first use so only sessions that enable trace pay the (default 4 MB) per-ring cost. The ring is exposed two ways: - pg_get_backend_wait_event_trace() / the pg_backend_wait_event_trace view read the calling backend's own ring; - pg_get_wait_event_trace(procnumber) reads any backend's ring cross-process, including rings left behind by exited backends. A backend's ring is not freed when it exits: the slot transitions to ORPHANED and the ring stays in DSA so cross-backend consumers can read the dying backend's final waits (important for short-lived parallel workers). An orphan is reclaimed when a new backend reuses the same procNumber (clear-on-init) or by pg_stat_clear_orphaned_wait_event_rings() (execution revoked from PUBLIC; delegable with GRANT). Slot transitions (FREE/OWNED/ORPHANED) are serialised by the WaitEventTraceControl lock; the ring size is fixed cluster-wide at server start by wait_event_trace_ring_size (power of two, default 4 MB). The single-writer hot path writes each record under a seqlock (odd seq while writing, even when complete). Cross-backend readers use a POSITION-ENCODED IDENTITY seqlock -- a record at ring index i is valid only if its seq equals the writer's complete value for that exact position -- which rejects stale previous-cycle reads that a parity-only seqlock would accept under cross-process visibility lag. A TAP test drives an injection point between the writer's write_pos advance and its seq stamp to prove exactly that: with the ring wrapped and the writer wedged mid-record, a cross-backend read returns ring_size - 1 records, skipping the in-flight slot whose stale prior-cycle record a parity-only check would have emitted. The trace ring writer carries the same teardown discipline as the stats path: a wait_event_trace_writes_disabled gate (raised around slot transitions and at proc_exit) plus in_attach/in_release re-entrancy guards keep the DSA-internal LWLock waits that those operations emit from recursing into a ring that is being freed or orphaned. A further re-entrancy guard on the record writer itself keeps wait events emitted mid-record-write (reachable only via the injection point; the write is plain stores in production) from recursing into the writer. Privileges: both trace SRFs and the session-local view are REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats (reading a session's trace exposes its wait sequence). Query-attribution markers and their executor/protocol hooks are added in the next commit; this commit records wait events only.

At the trace level, interleave query-boundary markers with the wait events in the per-session ring so a reader can tell which query each wait belongs to. Two marker families are emitted: - ExecStart/ExecEnd bracket every executor run (ExecutorStart/ExecutorEnd), the primary attribution signal -- every executable statement, including those in parallel workers and pipelined extended-protocol messages, is bracketed; - QueryStart/QueryEnd fire at top-level query_id transitions (pgstat_report_query_id) and at the transition to idle (send_ready_for_query), providing the inter-statement boundaries the executor markers cannot -- e.g. the ClientRead wait between statements. In the pipelined extended protocol a Parse/Bind/Execute can arrive while the previous query's id is still set and the session is still RUNNING (no Sync->idle in between), so the prior id is flushed with force=true at those message boundaries to fire its QUERY_END before the new query starts. That flush is gated on wait_event_capture = trace, so when trace is off it is a no-op and pg_stat_activity.query_id behaves exactly as before. Marker emission requires a non-zero query_id (compute_query_id) and track_activities; a WARNING is logged when trace is enabled without them. The markers themselves are no-ops unless capture is at trace, so the only cost when the feature is off is the inline gate already present.

DmitryNFomin · 2026-06-11T15:16:04Z

Series reshaped into the 7-patch deliverable (for review before pgsql-hackers submission).

The 10 development commits were redistributed into 7 submission-grade patches — the cleanup pass and the seqlock-test commit folded into their logical parents, and the trace docs/tests moved into the patches that introduce the code they cover:

0001  add --enable-wait-event-timing flag and wait_event_capture GUC
0002  record per-backend wait event statistics (stats level)      [no DSA]
0003  expose overflow counters and add reset functions
0004  ci: build one task with --enable-wait-event-timing
0005  allocate the per-backend array lazily in DSA                [pure refactor]
0006  add trace level with a per-session ring buffer              [+ seqlock TAP test]
0007  add query-attribution markers to the trace ring

Verification: every intermediate stage was rebuilt and re-verified on the box (timing + stub builds, regress+isolation; the seqlock TAP test at 0006), and the final tree is bit-identical to the previously validated tip (86593a6, full suite 366/0 + 365/0) — so all prior validation transfers. The whole set applies cleanly with git am onto upstream 89eafad297a and reproduces the tree exactly. Commit messages carry no tooling trailers.

The old 10-commit history is preserved at branch wet-series-10commit. A draft cover letter + format-patch files are prepared locally (patches-v1/); nothing has been sent to pgsql-hackers.

DmitryNFomin force-pushed the wet-series branch from 79775b0 to 74bd22c Compare June 9, 2026 20:00

DmitryNFomin added 7 commits June 11, 2026 14:34

DmitryNFomin force-pushed the wet-series branch from 86593a6 to 983f0ec Compare June 11, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_event_timing: reviewable patch series (stats / DSA / trace)#2

wait_event_timing: reviewable patch series (stats / DSA / trace)#2
DmitryNFomin wants to merge 7 commits into
wet-series-basefrom
wet-series

DmitryNFomin commented Jun 7, 2026 •

edited

Loading

Uh oh!

DmitryNFomin commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

workload	WET/off vs baseline	WET/stats vs off
`-S` read-only (cached)	−0.57 %	−0.30 %
TPC-B	+0.18 % (8-run re-test)	−0.51 %
`-S` wait-saturated (32 MB SB)	−0.42 %	+0.29 %

Conversation

DmitryNFomin commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

The 7 patches

Reviewer-concern checklist (from the #1 review)

Validation summary

Fixes over the original #1 (surfaced by the split + reviews)

Status & deliverables

Uh oh!

DmitryNFomin commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DmitryNFomin commented Jun 7, 2026 •

edited

Loading