Declare GPU stream ordering in one edge table by nclack · Pull Request #143 · acquire-project/chucky

nclack · 2026-06-12T01:23:35Z

Step 1 of the GPU orchestration migration described in
docs/gpu-orchestration.md (included as the first commit — read it first;
it is the spec for this and the four steps that follow).

Why

#140 and #141 were both missing-ordering bugs that shipped silent
corruption. The rules existed only as cuEventRecord/cuStreamWaitEvent
pairs spread across six files plus undeclared host call-order assumptions;
several events were recorded and never waited (dead edges,
indistinguishable from load-bearing ones). #141's dependency was not even
expressible as an event. This PR makes every cross-stream and host→stream
ordering rule a named, declared, asserted, and metered entry in one table.

What

src/gpu/ordering.{h,c} — 13-edge declared table (the full table with
producer/consumer streams, guarded resource, kind, and backing is in the
design doc's appendix). Kinds: EVENT, GEN_COUNTER (Fix tail-state upload race on the page-aligned GPU path #142's tail-generation
gate, relocated behind this API with semantics preserved — including
graceful degradation and teardown release), HOST_RULE (drain-before-rekick
and oldest-first delivery, previously written nowhere, now debug-asserted).
Timing-only events are excluded from the table — metrics can no longer
masquerade as ordering.
Call-site migration — every ordering record/wait/publish in src/gpu/
and src/multiarray/stream.gpu.c goes through edge_record / edge_wait
/ edge_publish / edge_release_all. Behavior-neutral: same operations,
same order, same streams.
Debug asserts (zero release overhead): consumer-stream-matches-
declaration; end-of-run dead-edge accounting. Mutation-tested on an L40: a
deliberately removed wait → dead-edge warning at destroy; a wait on an
undeclared stream → assert abort.
Per-edge stall metrics for the host-poll edges: the bench's opaque
KickSync 45.7 ms now decomposes (ChunkIndex 41.7 + D2HDone 4.0), and
the previously unmetered staging busy-wait is visible (StagingFree 1.2 ms).
Dead-edge deletion (separate commit, −79/+16): pool_state.ready[2],
aggregate_slot.ready (only waiter was the abandoned Phase 3 macro-aggregation: scaffolding + cap=16 for codec=none #139 cap-stacking),
and a never-polled passthrough-path record — each verified dead against
current main before removal.

Test harnesses fake producers in one line via gpu_ordering_bind /
edge_record / edge_publish — the property the migration plan requires of
every stage boundary.

Evidence (L40, sm_89)

ctest -E "(s3)": 46/46; test_cross_validate ×8 reps (covers the
determinism, zstd round-trip, and page-aligned tail-carry race tests);
test_compress_agg ×3; zero new build warnings (RelWithDebInfo and Debug).
Mutation checks above (temporary patches, never committed).
Bench sanity run shows the new stall rows; no throughput change expected
or observed (release path adds no synchronization).

Commit guide

Add GPU orchestration design doc — the migration spec (steps 1–5, end
state, escape hatch, non-goals).
Add gpu ordering edge table — ordering.{h,c}: enum, descriptor table,
runtime state, helpers.
Route ordering through edge table — the call-site migration
(behavior-neutral; suite green at this commit).
Add per-edge stall metrics — host-poll blocked-time attribution,
surfaced in the bench report.
Delete verified-dead ordering edges — the three dead edges.
Record final edge table in design doc — the as-implemented appendix.

codecov · 2026-06-12T01:29:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-06-12T01:51:05Z

Reviewed against upstream/main with an independent wait-site audit plus an L40 validation run (sm_89, RelWithDebInfo + Debug).

Verdict: ship-with-nits.

Behavior-neutrality

Audited every ordering primitive on main (13 cuStreamWaitEvent/cuEventQuery-poll/cuStreamWaitValue64 sites across the six GPU files plus multiarray): each maps 1:1 onto a table edge with the same backing event, recorded/waited on the same stream, same per-fc/per-slot index, same call order inside compress_agg_kick, d2h_deliver_kick/sync_and_deliver, drain_kick_and_swap, and both ingest dispatch paths. Alias resolution (owner_of) is applied on the wait/query side only; records assert alias_of < 0, so STAGING_FREE→STAGING_H2D_DONE, POOL_CONSUMED→AGG_DONE, D2H_DONE→SLOT_DRAINED hit the same physical events as main's t_h2d_end, t_aggregate_end[fc], ready[fc], while waits/metrics stay attributed per named edge (gpu_ordering_destroy reads owner->records but edge[e].waits). Seeding moved from per-stage init records to gpu_ordering_init on the compute stream — same point in create, before the trailing cuStreamSynchronize. make_compress_input's lod_active reproduces main's (enable_multiscale && t_end) truth table via the bound edge event.

#142 gate relocation

GEQ threshold math identical: gpu_edge_wait_gen early-returns before kick_seq++ when the counter is absent (matches arm_tail_gate's !d_tail_seq return), waits at pre-increment kick_seq, and the enable conjunction (total_batch_chunks>0 && total_shards>0 && page_size>0, gated on tail_gate_supported) is unchanged.
Publish exactly-once on success and failure drains: finish_drain calls gpu_edge_publish on both exits of sync_and_deliver, same h_tail_seq_flag guard as main (F1 preserved).
Both release-before-blocking-sync sites survive: tile_stream_gpu_destroy (after auto-flush, before the four stream syncs) and compress_agg_destroy (before cuCtxSynchronize), with gpu_edge_release_all writing kick_seq exactly as compress_agg_release_tail_gate did, signed-GEQ comment included (F2 preserved).
stream.flush.c:138-140 fallback condition keys on gpu_ordering_gate_supported with the same alloc/map/probe failure semantics (gate_init degrades to supported=0 and returns 0 on NOT_SUPPORTED); SYNC_MEMOPS attributes untouched; test_compress_agg's ca_ctx_publish_tail is equivalent to main's manual increment. Multiarray never initializes the gate on main or here, so wait_gen/publish stay no-ops there.

Dead-edge deletions

Independently confirmed on main: pool_state.ready[2] (records in ingest/partial-append/seed only), aggregate_slot.ready (create/record/destroy only — the cap-stacking waiter died with #139), and the passthrough-path h_chunk_index_ready record (only poll is in the compressed drain branch) have zero wait/query consumers anywhere, multiarray included. Deletions are correctly isolated in their own commit.

Release overhead

All assert/counter machinery is #ifndef NDEBUG; release adds one cross-TU call per record/wait around the driver call, and the host poll loops are byte-for-byte the old 50 µs cuEventQuery loops plus a single clock read on first not-ready. The DEINITIALIZED-at-teardown swallow survives in gpu_edge_host_wait with the same log-and-break semantics.

Findings (minor)

src/gpu/ordering.c:287-290 — gpu_edge_wait on an unbound edge returns 0 (success) in release builds. Every current call site guards (lod_active, gpu_ordering_event != NULL), but this turns a future unguarded misuse into a silently dropped ordering edge — the exact failure class the table exists to prevent. Returning 1 (or logging) in release would fail loudly instead.
src/gpu/ordering.c:341-343 — gpu_edge_host_wait accumulates a 0 ms sample even when the event was already complete, so the StagingFree/ChunkIndex/D2HDone rows count polls, not stalls (avg dilutes, best pins at 0). If intentional ("fraction of polls that block" is readable from it), fine; worth a comment either way.
src/types.stream.h:49-51 — the kick_sync_stall comment still references ready[fc]/h_chunk_index_ready by their deleted field names.
Debug dead-edge warnings fire by design for POOL_CONSUMED (sync/multiarray) and CHUNK_INDEX_READY (passthrough) — doc acknowledges; per-config expected-edge sets would quiet this in later steps.

Doc appendix table matches the enum/DESC exactly (producers, aliases, seeded flags, instancing); HOST_RULE asserts are present at both kick sites (flush_kick_batch, drain_kick_and_swap) and in drain_fc. Commit messages and comment style conform.

Validation (L40, CUDA 13.1 build / 13.0 driver)

RelWithDebInfo: ctest -E "(s3)" 46/46; test_cross_validate 8/8 reps.
Debug: full suite 46/46, no ordering asserts, no host-rule or stream-declaration failures. Direct Debug runs confirm the dead-edge accounting behaves as documented: agg_done waited in the real pipeline, its alias pool_consumed flagged idle only on never-reused final batches / harness configs.
bench_stream_256cube_single --codec zstd: new stall rows render (ChunkIndex ~42 ms avg tracks KickSync, closing the unmetered poll gap). Throughput PR vs main, alternating same-node runs, 64-frame x5: PR 4.53/4.33/4.52/4.34/4.57 vs main 4.63/4.59/4.40/4.45/4.61 GiB/s — means 4.46 vs 4.54 (-1.7%), inside the ±2.7% run-to-run spread with overlapping ranges. No regression beyond noise.
Zero new build warnings (only the pre-existing bench_util/-Wmaybe-uninitialized and test_zarr_fs_sink/-Wformat-truncation ones).

Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>

Nathan Clack added 6 commits June 12, 2026 01:18

Add GPU orchestration design doc

cf0b518

Add gpu ordering edge table

ac79ed1

Route ordering through edge table

3d19eee

Add per-edge stall metrics

423e1e6

Delete verified-dead ordering edges

987a3ab

Record final edge table in design doc

41697bc

Nathan Clack added 2 commits June 12, 2026 01:58

Harden edge wait and stall sampling

d6f329c

Fix stale event names in comment

57da349

nclack merged commit 5ff0654 into acquire-project:main Jun 12, 2026
8 checks passed

nclack deleted the gpu-ordering-table branch June 12, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declare GPU stream ordering in one edge table#143

Declare GPU stream ordering in one edge table#143
nclack merged 8 commits into
acquire-project:mainfrom
nclack:gpu-ordering-table

nclack commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

nclack commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented Jun 12, 2026

Why

What

Evidence (L40, sm_89)

Commit guide

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

nclack commented Jun 12, 2026

Behavior-neutrality

#142 gate relocation

Dead-edge deletions

Release overhead

Findings (minor)

Validation (L40, CUDA 13.1 build / 13.0 driver)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant