wait_event_timing: reviewable patch series (stats / DSA / trace)#2
wait_event_timing: reviewable patch series (stats / DSA / trace)#2DmitryNFomin wants to merge 7 commits into
Conversation
…_capture GUC Introduce the compile-time option --enable-wait-event-timing (meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default off). This commit is scaffolding only: it wires the flag through both build systems and registers the GUC with check/assign hooks, but adds no instrumentation yet -- later commits in this series add the recording hot path, the SQL surface, and the trace level. In builds compiled without --enable-wait-event-timing, the GUC's check hook rejects any value other than off (downgrading to off with a warning for non-interactive sources), so the variable exists uniformly for tooling but cannot be enabled.
…vel)
Implement wait_event_capture = stats. When enabled, every transition
through pgstat_report_wait_start()/pgstat_report_wait_end() records the
wait duration and accumulates per-(backend, event) statistics -- count,
total and maximum duration, and a 32-bucket log2 duration histogram -- in
shared memory.
Storage is a per-backend slot array in the main shared memory segment,
sized at postmaster start from wait_event_timing_max_tranches. Because
each slot lives for the entire life of its backend, the hot path needs no
lazy attach and no teardown gating: pgstat_report_wait_start_timing()
records a timestamp and the current event, and
pgstat_report_wait_end_timing() computes the duration and accumulates.
Each backend writes only to its own slot, so no locking is required and
the SRF reader is lock-free.
The inline gate in pgstat_report_wait_start()/_end() is a single load of
wait_event_capture plus a branch, with the bodies kept out-of-line so the
many inlined call sites stay compact; while capture is off the hot path
adds only that branch.
Non-LWLock wait events map to a flat array via a class table generated
from wait_event_names.txt by generate-wait_event_types.pl. LWLock events,
whose tranche ids are unbounded, use a per-backend open-addressing hash
capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192).
SQL surface:
- pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing
view, one row per backend per event with a non-zero count;
- pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins.
Builds compiled without --enable-wait-event-timing keep the GUC (its check
hook rejects any non-off value) and empty-result SQL stubs, so tooling
sees a uniform surface.
A later commit in the series exposes the per-backend overflow counters
(maintained here) and adds the reset functions.
Surface the per-backend truncation counters maintained by the recording path, and add the ability to reset wait-event-timing statistics. pg_stat_get_wait_event_timing_overflow(pid) and the pg_stat_wait_event_timing_overflow view report, per backend, lwlock_overflow_count (LWLock waits dropped because the per-backend tranche hash was full), flat_overflow_count (events whose class index was out of range), and reset_count. Resets use a lock-free request/response so the hot path stays single writer: each slot carries an atomic reset_generation, bumped by the resetter; the owning backend compares it against a backend-local last-seen value at its next wait_end and clears its own counters, incrementing reset_count. pg_stat_reset_wait_event_timing(pid) resets one backend -- synchronously when it targets the caller's own session (any user; pid defaults to NULL), or asynchronously for another backend (requiring pg_signal_backend, matching pg_stat_reset_backend_stats). pg_stat_reset_wait_event_timing_all() resets every backend and is superuser-only. Builds without --enable-wait-event-timing keep empty-result/feature-not- supported stubs for the new functions.
Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions task so the wait-event-timing build path -- including the expected output src/test/regress/expected/wait_event_timing.out -- is exercised on every push. That task already runs check-world under the undefined/alignment sanitizers with a small segment size, giving the timing code meaningful coverage. Every other CI task keeps building without the flag, so the stub path and its alternate output wait_event_timing_1.out remain covered as well.
Convert the per-backend wait-event-timing slot array from eager
main-segment shared memory to a lazily-allocated DSA region. Only a
small control struct (a DSA handle plus an LWLock) now lives in fixed
shared memory; the large array -- ~30 KB per backend at the default
wait_event_timing_max_tranches -- is allocated the first time any backend
in the cluster sets wait_event_capture to a non-off value. A build that
compiles the feature in but never enables it therefore pays no
per-backend memory, and a SELECT against the views on a cluster that
never enabled capture does not even create the DSA.
This is a pure refactor: the SQL surface and observable behavior are
unchanged, and the existing regression tests pass without modification.
Backends attach to the array on their first wait event under capture, in
pgstat_wait_event_timing_lazy_attach(). Because that runs from the
wait-event hot path, it carries the guards that make DSA work safe there:
- skip while CritSectionCount > 0 (dsa_attach -> MemoryContextAlloc
asserts inside a critical section);
- skip while MyProc->lwWaiting != LW_WS_NOT_WAITING (a nested
LWLockQueueSelf on the control lock would PANIC);
- an in_attach re-entrancy guard, because dsa_create / dsa_allocate /
the control-lock acquisition can themselves emit LWLock wait events
that re-enter the hot path;
- a before_shmem_exit gate (wait_event_timing_writes_disabled) so the
hot path stops touching DSA once proc_exit begins tearing the
mappings down -- shmem_exit runs all before_shmem_exit callbacks
before dsm_backend_shutdown, so the gate is up before any unmap.
A new LWLock tranche, WaitEventTimingDSA, names the control lock.
Add the third capture level, wait_event_capture = trace. On top of the
STATS aggregates, every completed wait is pushed into a per-session ring
buffer in DSA -- one record per wait -- allocated lazily on first use so
only sessions that enable trace pay the (default 4 MB) per-ring cost.
The ring is exposed two ways:
- pg_get_backend_wait_event_trace() / the pg_backend_wait_event_trace
view read the calling backend's own ring;
- pg_get_wait_event_trace(procnumber) reads any backend's ring
cross-process, including rings left behind by exited backends.
A backend's ring is not freed when it exits: the slot transitions to
ORPHANED and the ring stays in DSA so cross-backend consumers can read
the dying backend's final waits (important for short-lived parallel
workers). An orphan is reclaimed when a new backend reuses the same
procNumber (clear-on-init) or by pg_stat_clear_orphaned_wait_event_rings()
(execution revoked from PUBLIC; delegable with GRANT). Slot transitions
(FREE/OWNED/ORPHANED) are serialised by the WaitEventTraceControl lock;
the ring size is fixed cluster-wide at server start by
wait_event_trace_ring_size (power of two, default 4 MB).
The single-writer hot path writes each record under a seqlock (odd seq
while writing, even when complete). Cross-backend readers use a
POSITION-ENCODED IDENTITY seqlock -- a record at ring index i is valid
only if its seq equals the writer's complete value for that exact position
-- which rejects stale previous-cycle reads that a parity-only seqlock
would accept under cross-process visibility lag. A TAP test drives an
injection point between the writer's write_pos advance and its seq stamp
to prove exactly that: with the ring wrapped and the writer wedged
mid-record, a cross-backend read returns ring_size - 1 records, skipping
the in-flight slot whose stale prior-cycle record a parity-only check
would have emitted.
The trace ring writer carries the same teardown discipline as the stats
path: a wait_event_trace_writes_disabled gate (raised around slot
transitions and at proc_exit) plus in_attach/in_release re-entrancy guards
keep the DSA-internal LWLock waits that those operations emit from
recursing into a ring that is being freed or orphaned. A further
re-entrancy guard on the record writer itself keeps wait events emitted
mid-record-write (reachable only via the injection point; the write is
plain stores in production) from recursing into the writer.
Privileges: both trace SRFs and the session-local view are REVOKE'd from
PUBLIC and GRANT'ed to pg_read_all_stats (reading a session's trace
exposes its wait sequence).
Query-attribution markers and their executor/protocol hooks are added in
the next commit; this commit records wait events only.
At the trace level, interleave query-boundary markers with the wait
events in the per-session ring so a reader can tell which query each wait
belongs to. Two marker families are emitted:
- ExecStart/ExecEnd bracket every executor run
(ExecutorStart/ExecutorEnd), the primary attribution signal --
every executable statement, including those in parallel workers and
pipelined extended-protocol messages, is bracketed;
- QueryStart/QueryEnd fire at top-level query_id transitions
(pgstat_report_query_id) and at the transition to idle
(send_ready_for_query), providing the inter-statement boundaries the
executor markers cannot -- e.g. the ClientRead wait between
statements.
In the pipelined extended protocol a Parse/Bind/Execute can arrive while
the previous query's id is still set and the session is still RUNNING
(no Sync->idle in between), so the prior id is flushed with force=true at
those message boundaries to fire its QUERY_END before the new query
starts. That flush is gated on wait_event_capture = trace, so when trace
is off it is a no-op and pg_stat_activity.query_id behaves exactly as
before.
Marker emission requires a non-zero query_id (compute_query_id) and
track_activities; a WARNING is logged when trace is enabled without them.
The markers themselves are no-ops unless capture is at trace, so the only
cost when the feature is off is the inline gate already present.
|
Series reshaped into the 7-patch deliverable (for review before pgsql-hackers submission). The 10 development commits were redistributed into 7 submission-grade patches — the cleanup pass and the seqlock-test commit folded into their logical parents, and the trace docs/tests moved into the patches that introduce the code they cover: Verification: every intermediate stage was rebuilt and re-verified on the box (timing + stub builds, regress+isolation; the seqlock TAP test at 0006), and the final tree is bit-identical to the previously validated tip ( The old 10-commit history is preserved at branch |
Purpose
This PR holds the submission-ready 7-patch series for the wait_event_timing feature, staged here for Andrey's review before it goes to pgsql-hackers. It is the reworked form of the original 8k-line monolith (#1), restructured per Andrey Borodin's review: "send them all together, but break into chewable parts ... defer DSA refactoring ... so the committer can see value from a first commit without venturing into 8k lines."
Base: upstream master
89eafad297a(branchwet-series-base). Each commit builds and passes the full test suite on its own (bisectable); docs and tests travel with the code they cover; both build systems are updated in lockstep.The 7 patches
5041767--enable-wait-event-timing(autoconf + meson) and thewait_event_captureGUC (off|stats); pure scaffolding, no behavior4bcad6dpg_stat_wait_event_timing+ histogram-buckets views; docs + tests966c6ac53a714cf1ee4156920aca983f0ecGroups for the cover letter: patches 1–4 = stats (independently committable), patch 5 = storage refactor, patches 6–7 = trace.
Reviewer-concern checklist (from the #1 review)
unlikely()on the inline gate-O2, ±0.09 % TPS. Gates carry "no unlikely()" comments with the rationaleValidation summary
Key facts (CCX43, 16 dedicated vCPU, gcc 13):
Correctness: full
meson test(regression + isolation + recovery/subscription/pg_upgrade TAP): 366 Ok / 0 Fail timing build, 365 Ok / 0 Fail stub build (seqlock TAP test correctly skips in stub). Every intermediate patch stage rebuilt and re-verified (builds + regress; seqlock test at 0006). UBSan-clean.Stress (cassert): 300-backend churn + parallel workers attaching in critical sections + concurrent TPC-B, capture forced cluster-wide at stats and at trace (incl. 20× trace↔off toggles, 102 orphan rings reclaimed) — zero asserts/panics.
Concurrency: two-session cross-backend reset serviced asynchronously; post-mortem orphan-ring read after owner exit + explicit reclaim.
Performance (pgbench, 5 interleaved runs/cell):
-Sread-only (cached)-Swait-saturated (32 MB SB)An initial TPC-B −3 % was proven a binary-layout artifact with a gate-compiled-out control (the gate-removed build measured slower than the gate-included one). Off-mode cost is below the measurement floor.
Mechanical integrity: the 7-patch set applies cleanly with
git amonto89eafad297aand reproduces a tree bit-identical to the fully-validated development tip. Adversarial reviews passed for each risky patch and for the whole series.Fixes over the original #1 (surfaced by the split + reviews)
pg_stat_reset_wait_event_timingdocumentedDEFAULT NULLbut lackedproargdefaults— no-arg call would error.SELECTon a never-enabled cluster created (and pinned) an unused DSA segment.query_idflush perturbedpg_stat_activityeven with the feature off — now gated on trace.Status & deliverables
format-patchfiles + filled cover letter:patches-v1/(local; nothing sent to pgsql-hackers yet).wet-series-10commit.