Skip to content

Extract GPU orchestration into a scheduler#149

Merged
nclack merged 6 commits into
acquire-project:mainfrom
nclack:gpu-scheduler
Jun 12, 2026
Merged

Extract GPU orchestration into a scheduler#149
nclack merged 6 commits into
acquire-project:mainfrom
nclack:gpu-scheduler

Conversation

@nclack

@nclack nclack commented Jun 12, 2026

Copy link
Copy Markdown
Member

Step 4 of the migration in docs/gpu-orchestration.mdextract the
scheduler
— following #143 (edge table), #144 (engine unification), and
#148 (resource pools). One module now owns GPU orchestration: stream
creation, pipeline depth, which stages run for the active configuration,
and degraded schedules. Stages are payload only.

What

src/gpu/schedule.{h,c} (170 + ~620 lines):

  • All five CUDA streams created in one place (gpu_streams), including
    the drain stream with its deadlock constraint documented at the
    declaration (drain-time copies must not share the d2h stream — by drain
    time it can hold the next kick's AGG_DONE wait, parked behind the tail
    gate until this very drain publishes).
  • Pipeline depth as an enum, selected once per array by
    schedule_select():
    • SCHEDULE_PIPELINED — depth 2; drain a slot only when about to refill
      it.
    • SCHEDULE_DRAIN_BEFORE_KICK — depth 1; page-aligned tail state without
      a working gate cannot be ordered device-side.
    • SCHEDULE_DRAIN_AFTER_KICK — depth 1; multiarray, whose immediate
      drains host-order the tail uploads.
      The other dependency-shape selections live with it: lod_active (second
      producer edge into the chunk pool) selected at bind; passthrough-vs-
      compressed rides each batch's handoff; the per-batch tail-gate arm
      predicate is inside the kick schedule.
  • Slot bookkeeping: flush_pipeline + batch_state + pools.current
    • sync_flush collapse into gpu_scheduler (3-field schedule_slot ×2,
      fill index, depth). stream.flush.{c,h} is deleted (474 lines); its
      LOD payload moved to lod_emit_partial_append in the lod stage.
  • Acquire/release lifted out of stages. schedule_compress_agg_kick /
    schedule_d2h_kick / schedule_d2h_drain place every cross-stage
    acquire, release, tail-gate arm, and publish around the stages' payload
    phases (compress_agg: prepare/compress/aggregate/fill_handoff; d2h:
    kick/drain_copy/drain_sink/update_metadata) — in the exact original
    stream order. flush.d2h_deliver.c now contains zero pool/edge calls.
    Four gpu_pool_at host-ordered sites became schedule-owned; the
    remaining ones are ingest-internal (step 5's territory) or documented
    host-ordered wiring.
  • Ingest staging is deliberately untouched: its acquires order its own
    double buffer, not cross-stage shape, and step 5 rebuilds that path.

PR #139's output_slot ledger was evaluated and not adopted: its
substance is cap-stacking bookkeeping (cursors, plan/commit split, a
5-state lifecycle this pipeline never distinguishes); at one batch per
slot it collapses to what schedule_slot expresses in 3 fields. Also
deleted as verified-dead: flush_slot_gpu.batch_epoch_count (write-only)
and flush_pipeline.current (never read).

Three knowingly-accepted deltas, none adding/removing/reordering
primitives on any happy-path stream: all five streams are created together
at init (drain was previously created inside d2h init); the tail
produce-acquire (a pure payload fetch — its consumed edge is a host rule)
moved a few host statements earlier in the drain; and on the
lod_actual_bytes error path the drain now records SLOT_DRAINED where the
old code skipped it — strictly safer, since the next acquire can no longer
park on a never-recorded edge.

Validation (L40, sm_89)

  • ctest -E "(s3)": 47/47 RelWithDebInfo and Debug at the tip; full
    Rel suite green at every commit; Debug clean of ordering asserts.
  • test_cross_validate ×8 (determinism, zstd round-trip, page-aligned
    tail-carry); test_multiarray_gpu Rel+Debug; two_streams bench both
    codecs.
  • Mutation checks re-run on the final commit (the scheduler must not
    weaken what the pools carry): dropping the swap-time chunk-pool acquire →
    gpu_zstd_determinism fails 8/8; forcing the tail gate off →
    gpu_page_aligned_tail_carry fails 8/8. Restored: green.
  • Zero new build warnings. Perf vs branch base: means within ±1.5% on
    256cube/medfmt × zstd/none (one noisy codec-none rep re-tested clean).

Commit guide

  • Add schedule module owning streams — gpu_streams + depth enum +
    scheduler state; selection logic.
  • Move flush orchestration into scheduler — stream.flush.{c,h} deleted;
    entry points become schedule_accumulate_epoch / schedule_flush_* /
    schedule_drain_pending.
  • Lift compress acquires into schedule — kick acquires/tail-arm/releases
    out of flush.compress_agg.c; stage keeps payload phases.
  • Lift d2h acquires into schedule — kick/drain acquires, SLOT_DRAINED
    releases, exactly-once tail publish out of flush.d2h_deliver.c.
  • Move partial-append emit into lod stage — the LOD payload that lived in
    stream.flush.c.
  • Trim comments to why-only — comment pass; also documents
    gpu_pool_release_all's whole-table scope (review nit from Carry buffer ordering in resource pools #148).

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack

nclack commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Reviewed against upstream/main (d963364) as a behavior-neutrality audit: reconstructed the enqueue sequence per stream for every schedule shape (PIPELINED / DRAIN_BEFORE_KICK / DRAIN_AFTER_KICK × passthrough/compressed × lod_active on/off × partial-batch flush) from the old stream.flush.c + stage files and compared against schedule.c, then re-ran the validation matrix independently on an L40.

Verdict: ship (one doc nit, one description nit below — nothing blocking).

What checks out

  • Op order is primitive-for-primitive identical on all five streams in every shape, including the no-op drain of the unkicked fill slot in DRAIN_BEFORE_KICK and the CHUNK_INDEX_READY/SLOT_DRAINED record positions in the d2h kick (compressed vs passthrough never co-occur, so folding the index release after the payload is order-neutral).
  • schedule_select() reproduces the old decisions at every call site: gate_ord is NULL exactly where sync_flush=1 was set (src/multiarray/stream.gpu.c:115) and non-NULL for single-array (src/gpu/stream.init.c:378-382); the old per-swap fallback predicate (page_size>0 && total_shards>0 && !gate_supported) reads only values immutable after array init, so one-shot selection is equivalent — including page-aligned-with-zero-shards, non-page-aligned-with-gate-unsupported, and unbounded streams. schedule_lod_active's fc=0-only event check is sound because stream.init.c:156-158 binds both fc instances together iff lim->any_multiscale, which is also exactly when lod_shared.d_linear exists (the old epoch-path condition).
  • Exactly-once tail publish: all eight exit paths of schedule_d2h_drain (has_error early-out, three host-acquire failures, drain-copy dispatch error, release failure, drain_sink error, success) funnel through the single gpu_pool_release_produce_gen at Done (src/gpu/schedule.c:272-283), with err correct on each; metadata update stays after the publish and only on success, as before. Destroy/auto-flush and schedule_quiesce_output are verbatim relocations.
  • Deletion completeness: HOST_RULE asserts survived (DELIVER_OLDEST_FIRST in drain_slot, schedule.c:368-371; DRAIN_BEFORE_REKICK in kick_batch, schedule.c:399-400, covering both old assert sites); LUT cache, flush metrics, io-fence wait, and the partial-append emit (now lod_emit_partial_append, identical pointer math via gpu_pool_at base) all map 1:1. batch_epoch_count (3 writes, 0 reads) and flush_pipeline.current (0 references) confirmed dead on main.
  • Both declared deltas verified safe: nothing observed the drain stream between old creation points (registration is debug-assert-only; first use is the first drain), and the tail produce-acquire is a pure payload fetch (tail.consumed == GPU_EDGE_COUNTgpu_pool_host_acquire_produce queues and blocks nothing, pool.c:75-85).

Findings

  1. (minor — third behavior delta, error-path only) When lod_actual_bytes fails, the new drain records SLOT_DRAINED where the old one did not: old sync_and_deliver jumped to Error before its gpu_pool_release_consume; new d2h_deliver_drain_copy folds that failure into its return (src/gpu/flush.d2h_deliver.c:199) and schedule.c:237 releases unconditionally. Strictly safer (old left the slot-reuse edge stale for the next acquire), but the PR text says "two knowingly-accepted deltas" — worth a sentence there for the audit trail.
  2. (nit, next doc touch) docs/gpu-orchestration.md:3 still says "migration in progress (step 2)"; this PR completes step 4's exit criteria.

Independent validation (L40, sm_89, CUDA 13.1)

  • Zero new build warnings (warning sets byte-identical to base, RelWithDebInfo + Debug).
  • ctest -E "(s3)": 47/47 RelWithDebInfo and 47/47 Debug.
  • test_cross_validate ×8 and test_multiarray_gpu ×4: all green. 256cube_two_streams PASS for zstd (2.87 GiB/s combined) and none (0.84 GiB/s).
  • Mutation checks (independent of the PR's own): dropping the swap-time chunk-pool produce-acquire (schedule.c:457-460) → gpu_zstd_determinism red 6/6 (round-trip also red, consistent with Fix corrupted output from GPU compressed writes #140-class corruption); forcing the tail-gate arm's enable off (schedule.c:117-120) → gpu_page_aligned_tail_carry red 6/6. Both restore to green.
  • Perf, interleaved ×3 vs base, zstd: 256cube_single +1.2% (2.820→2.853 GiB/s median, within noise); medfmt_single +5.9% (2.154→2.281) — flagged by the >3% rule but in the favorable direction and within base's own 6% rep spread. No regression.

@nclack nclack merged commit a5f7f7a into acquire-project:main Jun 12, 2026
8 checks passed
@nclack nclack deleted the gpu-scheduler branch June 12, 2026 19:02
nclack added a commit that referenced this pull request Jun 17, 2026
Two small, behavior-neutral cleanups flagged during the
orchestration-migration reviews (#143/#148/#149/#151).

- **Remove dead `flush.helpers.h`.** The header had no remaining users
after the scheduler work; deleted the file, its two `#include`s, and the
CMakeLists entry. Verified no symbol it declared is referenced anywhere.
- **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick
(multiarray sync) schedule, `schedule_accumulate_epoch` recorded the
pool-filled edge and then `schedule_flush_accumulated` recorded it again
on the same stream. Skip the first on that path; the flush-accumulated
release is the load-bearing one before its kick. The pipelined and
drain-before-kick paths are unchanged.

Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero
new warnings.

---------

Co-authored-by: Nathan Clack <nclack@biohub.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant