Declare GPU stream ordering in one edge table#143
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
|
Reviewed against upstream/main with an independent wait-site audit plus an L40 validation run (sm_89, RelWithDebInfo + Debug). Verdict: ship-with-nits. Behavior-neutralityAudited every ordering primitive on main (13 #142 gate relocation
Dead-edge deletionsIndependently confirmed on main: Release overheadAll assert/counter machinery is Findings (minor)
Doc appendix table matches the enum/DESC exactly (producers, aliases, seeded flags, instancing); HOST_RULE asserts are present at both kick sites ( Validation (L40, CUDA 13.1 build / 13.0 driver)
|
Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>
Step 1 of the GPU orchestration migration described in
docs/gpu-orchestration.md(included as the first commit — read it first;it is the spec for this and the four steps that follow).
Why
#140 and #141 were both missing-ordering bugs that shipped silent
corruption. The rules existed only as
cuEventRecord/cuStreamWaitEventpairs spread across six files plus undeclared host call-order assumptions;
several events were recorded and never waited (dead edges,
indistinguishable from load-bearing ones). #141's dependency was not even
expressible as an event. This PR makes every cross-stream and host→stream
ordering rule a named, declared, asserted, and metered entry in one table.
What
src/gpu/ordering.{h,c}— 13-edge declared table (the full table withproducer/consumer streams, guarded resource, kind, and backing is in the
design doc's appendix). Kinds: EVENT, GEN_COUNTER (Fix tail-state upload race on the page-aligned GPU path #142's tail-generation
gate, relocated behind this API with semantics preserved — including
graceful degradation and teardown release), HOST_RULE (drain-before-rekick
and oldest-first delivery, previously written nowhere, now debug-asserted).
Timing-only events are excluded from the table — metrics can no longer
masquerade as ordering.
src/gpu/and
src/multiarray/stream.gpu.cgoes throughedge_record/edge_wait/
edge_publish/edge_release_all. Behavior-neutral: same operations,same order, same streams.
declaration; end-of-run dead-edge accounting. Mutation-tested on an L40: a
deliberately removed wait → dead-edge warning at destroy; a wait on an
undeclared stream → assert abort.
KickSync 45.7 msnow decomposes (ChunkIndex 41.7 + D2HDone 4.0), andthe previously unmetered staging busy-wait is visible (
StagingFree 1.2 ms).pool_state.ready[2],aggregate_slot.ready(only waiter was the abandoned Phase 3 macro-aggregation: scaffolding + cap=16 for codec=none #139 cap-stacking),and a never-polled passthrough-path record — each verified dead against
current main before removal.
Test harnesses fake producers in one line via
gpu_ordering_bind/edge_record/edge_publish— the property the migration plan requires ofevery stage boundary.
Evidence (L40, sm_89)
ctest -E "(s3)": 46/46;test_cross_validate×8 reps (covers thedeterminism, zstd round-trip, and page-aligned tail-carry race tests);
test_compress_agg×3; zero new build warnings (RelWithDebInfo and Debug).or observed (release path adds no synchronization).
Commit guide
Add GPU orchestration design doc— the migration spec (steps 1–5, endstate, escape hatch, non-goals).
Add gpu ordering edge table—ordering.{h,c}: enum, descriptor table,runtime state, helpers.
Route ordering through edge table— the call-site migration(behavior-neutral; suite green at this commit).
Add per-edge stall metrics— host-poll blocked-time attribution,surfaced in the bench report.
Delete verified-dead ordering edges— the three dead edges.Record final edge table in design doc— the as-implemented appendix.