Skip to content

Declare GPU stream ordering in one edge table#143

Merged
nclack merged 8 commits into
acquire-project:mainfrom
nclack:gpu-ordering-table
Jun 12, 2026
Merged

Declare GPU stream ordering in one edge table#143
nclack merged 8 commits into
acquire-project:mainfrom
nclack:gpu-ordering-table

Conversation

@nclack

@nclack nclack commented Jun 12, 2026

Copy link
Copy Markdown
Member

Step 1 of the GPU orchestration migration described in
docs/gpu-orchestration.md (included as the first commit — read it first;
it is the spec for this and the four steps that follow).

Why

#140 and #141 were both missing-ordering bugs that shipped silent
corruption. The rules existed only as cuEventRecord/cuStreamWaitEvent
pairs spread across six files plus undeclared host call-order assumptions;
several events were recorded and never waited (dead edges,
indistinguishable from load-bearing ones). #141's dependency was not even
expressible as an event. This PR makes every cross-stream and host→stream
ordering rule a named, declared, asserted, and metered entry in one table.

What

  • src/gpu/ordering.{h,c} — 13-edge declared table (the full table with
    producer/consumer streams, guarded resource, kind, and backing is in the
    design doc's appendix). Kinds: EVENT, GEN_COUNTER (Fix tail-state upload race on the page-aligned GPU path #142's tail-generation
    gate, relocated behind this API with semantics preserved — including
    graceful degradation and teardown release), HOST_RULE (drain-before-rekick
    and oldest-first delivery, previously written nowhere, now debug-asserted).
    Timing-only events are excluded from the table — metrics can no longer
    masquerade as ordering.
  • Call-site migration — every ordering record/wait/publish in src/gpu/
    and src/multiarray/stream.gpu.c goes through edge_record / edge_wait
    / edge_publish / edge_release_all. Behavior-neutral: same operations,
    same order, same streams.
  • Debug asserts (zero release overhead): consumer-stream-matches-
    declaration; end-of-run dead-edge accounting. Mutation-tested on an L40: a
    deliberately removed wait → dead-edge warning at destroy; a wait on an
    undeclared stream → assert abort.
  • Per-edge stall metrics for the host-poll edges: the bench's opaque
    KickSync 45.7 ms now decomposes (ChunkIndex 41.7 + D2HDone 4.0), and
    the previously unmetered staging busy-wait is visible (StagingFree 1.2 ms).
  • Dead-edge deletion (separate commit, −79/+16): pool_state.ready[2],
    aggregate_slot.ready (only waiter was the abandoned Phase 3 macro-aggregation: scaffolding + cap=16 for codec=none #139 cap-stacking),
    and a never-polled passthrough-path record — each verified dead against
    current main before removal.

Test harnesses fake producers in one line via gpu_ordering_bind /
edge_record / edge_publish — the property the migration plan requires of
every stage boundary.

Evidence (L40, sm_89)

  • ctest -E "(s3)": 46/46; test_cross_validate ×8 reps (covers the
    determinism, zstd round-trip, and page-aligned tail-carry race tests);
    test_compress_agg ×3; zero new build warnings (RelWithDebInfo and Debug).
  • Mutation checks above (temporary patches, never committed).
  • Bench sanity run shows the new stall rows; no throughput change expected
    or observed (release path adds no synchronization).

Commit guide

  • Add GPU orchestration design doc — the migration spec (steps 1–5, end
    state, escape hatch, non-goals).
  • Add gpu ordering edge tableordering.{h,c}: enum, descriptor table,
    runtime state, helpers.
  • Route ordering through edge table — the call-site migration
    (behavior-neutral; suite green at this commit).
  • Add per-edge stall metrics — host-poll blocked-time attribution,
    surfaced in the bench report.
  • Delete verified-dead ordering edges — the three dead edges.
  • Record final edge table in design doc — the as-implemented appendix.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack

nclack commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Reviewed against upstream/main with an independent wait-site audit plus an L40 validation run (sm_89, RelWithDebInfo + Debug).

Verdict: ship-with-nits.

Behavior-neutrality

Audited every ordering primitive on main (13 cuStreamWaitEvent/cuEventQuery-poll/cuStreamWaitValue64 sites across the six GPU files plus multiarray): each maps 1:1 onto a table edge with the same backing event, recorded/waited on the same stream, same per-fc/per-slot index, same call order inside compress_agg_kick, d2h_deliver_kick/sync_and_deliver, drain_kick_and_swap, and both ingest dispatch paths. Alias resolution (owner_of) is applied on the wait/query side only; records assert alias_of < 0, so STAGING_FREESTAGING_H2D_DONE, POOL_CONSUMEDAGG_DONE, D2H_DONESLOT_DRAINED hit the same physical events as main's t_h2d_end, t_aggregate_end[fc], ready[fc], while waits/metrics stay attributed per named edge (gpu_ordering_destroy reads owner->records but edge[e].waits). Seeding moved from per-stage init records to gpu_ordering_init on the compute stream — same point in create, before the trailing cuStreamSynchronize. make_compress_input's lod_active reproduces main's (enable_multiscale && t_end) truth table via the bound edge event.

#142 gate relocation

  • GEQ threshold math identical: gpu_edge_wait_gen early-returns before kick_seq++ when the counter is absent (matches arm_tail_gate's !d_tail_seq return), waits at pre-increment kick_seq, and the enable conjunction (total_batch_chunks>0 && total_shards>0 && page_size>0, gated on tail_gate_supported) is unchanged.
  • Publish exactly-once on success and failure drains: finish_drain calls gpu_edge_publish on both exits of sync_and_deliver, same h_tail_seq_flag guard as main (F1 preserved).
  • Both release-before-blocking-sync sites survive: tile_stream_gpu_destroy (after auto-flush, before the four stream syncs) and compress_agg_destroy (before cuCtxSynchronize), with gpu_edge_release_all writing kick_seq exactly as compress_agg_release_tail_gate did, signed-GEQ comment included (F2 preserved).
  • stream.flush.c:138-140 fallback condition keys on gpu_ordering_gate_supported with the same alloc/map/probe failure semantics (gate_init degrades to supported=0 and returns 0 on NOT_SUPPORTED); SYNC_MEMOPS attributes untouched; test_compress_agg's ca_ctx_publish_tail is equivalent to main's manual increment. Multiarray never initializes the gate on main or here, so wait_gen/publish stay no-ops there.

Dead-edge deletions

Independently confirmed on main: pool_state.ready[2] (records in ingest/partial-append/seed only), aggregate_slot.ready (create/record/destroy only — the cap-stacking waiter died with #139), and the passthrough-path h_chunk_index_ready record (only poll is in the compressed drain branch) have zero wait/query consumers anywhere, multiarray included. Deletions are correctly isolated in their own commit.

Release overhead

All assert/counter machinery is #ifndef NDEBUG; release adds one cross-TU call per record/wait around the driver call, and the host poll loops are byte-for-byte the old 50 µs cuEventQuery loops plus a single clock read on first not-ready. The DEINITIALIZED-at-teardown swallow survives in gpu_edge_host_wait with the same log-and-break semantics.

Findings (minor)

  1. src/gpu/ordering.c:287-290gpu_edge_wait on an unbound edge returns 0 (success) in release builds. Every current call site guards (lod_active, gpu_ordering_event != NULL), but this turns a future unguarded misuse into a silently dropped ordering edge — the exact failure class the table exists to prevent. Returning 1 (or logging) in release would fail loudly instead.
  2. src/gpu/ordering.c:341-343gpu_edge_host_wait accumulates a 0 ms sample even when the event was already complete, so the StagingFree/ChunkIndex/D2HDone rows count polls, not stalls (avg dilutes, best pins at 0). If intentional ("fraction of polls that block" is readable from it), fine; worth a comment either way.
  3. src/types.stream.h:49-51 — the kick_sync_stall comment still references ready[fc]/h_chunk_index_ready by their deleted field names.
  4. Debug dead-edge warnings fire by design for POOL_CONSUMED (sync/multiarray) and CHUNK_INDEX_READY (passthrough) — doc acknowledges; per-config expected-edge sets would quiet this in later steps.

Doc appendix table matches the enum/DESC exactly (producers, aliases, seeded flags, instancing); HOST_RULE asserts are present at both kick sites (flush_kick_batch, drain_kick_and_swap) and in drain_fc. Commit messages and comment style conform.

Validation (L40, CUDA 13.1 build / 13.0 driver)

  • RelWithDebInfo: ctest -E "(s3)" 46/46; test_cross_validate 8/8 reps.
  • Debug: full suite 46/46, no ordering asserts, no host-rule or stream-declaration failures. Direct Debug runs confirm the dead-edge accounting behaves as documented: agg_done waited in the real pipeline, its alias pool_consumed flagged idle only on never-reused final batches / harness configs.
  • bench_stream_256cube_single --codec zstd: new stall rows render (ChunkIndex ~42 ms avg tracks KickSync, closing the unmetered poll gap). Throughput PR vs main, alternating same-node runs, 64-frame x5: PR 4.53/4.33/4.52/4.34/4.57 vs main 4.63/4.59/4.40/4.45/4.61 GiB/s — means 4.46 vs 4.54 (-1.7%), inside the ±2.7% run-to-run spread with overlapping ranges. No regression beyond noise.
  • Zero new build warnings (only the pre-existing bench_util/-Wmaybe-uninitialized and test_zarr_fs_sink/-Wformat-truncation ones).

@nclack nclack merged commit 5ff0654 into acquire-project:main Jun 12, 2026
8 checks passed
@nclack nclack deleted the gpu-ordering-table branch June 12, 2026 02:21
nclack added a commit that referenced this pull request Jun 17, 2026
Two small, behavior-neutral cleanups flagged during the
orchestration-migration reviews (#143/#148/#149/#151).

- **Remove dead `flush.helpers.h`.** The header had no remaining users
after the scheduler work; deleted the file, its two `#include`s, and the
CMakeLists entry. Verified no symbol it declared is referenced anywhere.
- **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick
(multiarray sync) schedule, `schedule_accumulate_epoch` recorded the
pool-filled edge and then `schedule_flush_accumulated` recorded it again
on the same stream. Skip the first on that path; the flush-accumulated
release is the load-bearing one before its kick. The pipelined and
drain-before-kick paths are unchanged.

Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero
new warnings.

---------

Co-authored-by: Nathan Clack <nclack@biohub.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant