Extract GPU orchestration into a scheduler#149
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
Member
Author
|
Reviewed against upstream/main (d963364) as a behavior-neutrality audit: reconstructed the enqueue sequence per stream for every schedule shape (PIPELINED / DRAIN_BEFORE_KICK / DRAIN_AFTER_KICK × passthrough/compressed × lod_active on/off × partial-batch flush) from the old Verdict: ship (one doc nit, one description nit below — nothing blocking). What checks out
Findings
Independent validation (L40, sm_89, CUDA 13.1)
|
This was referenced Jun 13, 2026
nclack
added a commit
that referenced
this pull request
Jun 17, 2026
Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Step 4 of the migration in
docs/gpu-orchestration.md— extract thescheduler — following #143 (edge table), #144 (engine unification), and
#148 (resource pools). One module now owns GPU orchestration: stream
creation, pipeline depth, which stages run for the active configuration,
and degraded schedules. Stages are payload only.
What
src/gpu/schedule.{h,c}(170 + ~620 lines):gpu_streams), includingthe drain stream with its deadlock constraint documented at the
declaration (drain-time copies must not share the d2h stream — by drain
time it can hold the next kick's AGG_DONE wait, parked behind the tail
gate until this very drain publishes).
schedule_select():SCHEDULE_PIPELINED— depth 2; drain a slot only when about to refillit.
SCHEDULE_DRAIN_BEFORE_KICK— depth 1; page-aligned tail state withouta working gate cannot be ordered device-side.
SCHEDULE_DRAIN_AFTER_KICK— depth 1; multiarray, whose immediatedrains host-order the tail uploads.
The other dependency-shape selections live with it:
lod_active(secondproducer edge into the chunk pool) selected at bind; passthrough-vs-
compressed rides each batch's handoff; the per-batch tail-gate arm
predicate is inside the kick schedule.
flush_pipeline+batch_state+pools.currentsync_flushcollapse intogpu_scheduler(3-fieldschedule_slot×2,fill index, depth).
stream.flush.{c,h}is deleted (474 lines); itsLOD payload moved to
lod_emit_partial_appendin the lod stage.schedule_compress_agg_kick/schedule_d2h_kick/schedule_d2h_drainplace every cross-stageacquire, release, tail-gate arm, and publish around the stages' payload
phases (compress_agg: prepare/compress/aggregate/fill_handoff; d2h:
kick/drain_copy/drain_sink/update_metadata) — in the exact original
stream order.
flush.d2h_deliver.cnow contains zero pool/edge calls.Four
gpu_pool_athost-ordered sites became schedule-owned; theremaining ones are ingest-internal (step 5's territory) or documented
host-ordered wiring.
double buffer, not cross-stage shape, and step 5 rebuilds that path.
PR #139's output_slot ledger was evaluated and not adopted: its
substance is cap-stacking bookkeeping (cursors, plan/commit split, a
5-state lifecycle this pipeline never distinguishes); at one batch per
slot it collapses to what
schedule_slotexpresses in 3 fields. Alsodeleted as verified-dead:
flush_slot_gpu.batch_epoch_count(write-only)and
flush_pipeline.current(never read).Three knowingly-accepted deltas, none adding/removing/reordering
primitives on any happy-path stream: all five streams are created together
at init (drain was previously created inside d2h init); the tail
produce-acquire (a pure payload fetch — its consumed edge is a host rule)
moved a few host statements earlier in the drain; and on the
lod_actual_byteserror path the drain now records SLOT_DRAINED where theold code skipped it — strictly safer, since the next acquire can no longer
park on a never-recorded edge.
Validation (L40, sm_89)
ctest -E "(s3)": 47/47 RelWithDebInfo and Debug at the tip; fullRel suite green at every commit; Debug clean of ordering asserts.
test_cross_validate×8 (determinism, zstd round-trip, page-alignedtail-carry);
test_multiarray_gpuRel+Debug; two_streams bench bothcodecs.
weaken what the pools carry): dropping the swap-time chunk-pool acquire →
gpu_zstd_determinismfails 8/8; forcing the tail gate off →gpu_page_aligned_tail_carryfails 8/8. Restored: green.256cube/medfmt × zstd/none (one noisy codec-none rep re-tested clean).
Commit guide
Add schedule module owning streams— gpu_streams + depth enum +scheduler state; selection logic.
Move flush orchestration into scheduler— stream.flush.{c,h} deleted;entry points become schedule_accumulate_epoch / schedule_flush_* /
schedule_drain_pending.
Lift compress acquires into schedule— kick acquires/tail-arm/releasesout of flush.compress_agg.c; stage keeps payload phases.
Lift d2h acquires into schedule— kick/drain acquires, SLOT_DRAINEDreleases, exactly-once tail publish out of flush.d2h_deliver.c.
Move partial-append emit into lod stage— the LOD payload that lived instream.flush.c.
Trim comments to why-only— comment pass; also documentsgpu_pool_release_all's whole-table scope (review nit from Carry buffer ordering in resource pools #148).