Extract GPU orchestration into a scheduler by nclack · Pull Request #149 · acquire-project/chucky

nclack · 2026-06-12T15:57:43Z

Step 4 of the migration in docs/gpu-orchestration.md — extract the
scheduler — following #143 (edge table), #144 (engine unification), and
#148 (resource pools). One module now owns GPU orchestration: stream
creation, pipeline depth, which stages run for the active configuration,
and degraded schedules. Stages are payload only.

What

src/gpu/schedule.{h,c} (170 + ~620 lines):

All five CUDA streams created in one place (gpu_streams), including
the drain stream with its deadlock constraint documented at the
declaration (drain-time copies must not share the d2h stream — by drain
time it can hold the next kick's AGG_DONE wait, parked behind the tail
gate until this very drain publishes).
Pipeline depth as an enum, selected once per array by
schedule_select():
- SCHEDULE_PIPELINED — depth 2; drain a slot only when about to refill
  it.
- SCHEDULE_DRAIN_BEFORE_KICK — depth 1; page-aligned tail state without
  a working gate cannot be ordered device-side.
- SCHEDULE_DRAIN_AFTER_KICK — depth 1; multiarray, whose immediate
  drains host-order the tail uploads.
  The other dependency-shape selections live with it: lod_active (second
  producer edge into the chunk pool) selected at bind; passthrough-vs-
  compressed rides each batch's handoff; the per-batch tail-gate arm
  predicate is inside the kick schedule.
Slot bookkeeping: flush_pipeline + batch_state + pools.current
- sync_flush collapse into gpu_scheduler (3-field schedule_slot ×2,
  fill index, depth). stream.flush.{c,h} is deleted (474 lines); its
  LOD payload moved to lod_emit_partial_append in the lod stage.
Acquire/release lifted out of stages. schedule_compress_agg_kick /
schedule_d2h_kick / schedule_d2h_drain place every cross-stage
acquire, release, tail-gate arm, and publish around the stages' payload
phases (compress_agg: prepare/compress/aggregate/fill_handoff; d2h:
kick/drain_copy/drain_sink/update_metadata) — in the exact original
stream order. flush.d2h_deliver.c now contains zero pool/edge calls.
Four gpu_pool_at host-ordered sites became schedule-owned; the
remaining ones are ingest-internal (step 5's territory) or documented
host-ordered wiring.
Ingest staging is deliberately untouched: its acquires order its own
double buffer, not cross-stage shape, and step 5 rebuilds that path.

PR #139's output_slot ledger was evaluated and not adopted: its
substance is cap-stacking bookkeeping (cursors, plan/commit split, a
5-state lifecycle this pipeline never distinguishes); at one batch per
slot it collapses to what schedule_slot expresses in 3 fields. Also
deleted as verified-dead: flush_slot_gpu.batch_epoch_count (write-only)
and flush_pipeline.current (never read).

Three knowingly-accepted deltas, none adding/removing/reordering
primitives on any happy-path stream: all five streams are created together
at init (drain was previously created inside d2h init); the tail
produce-acquire (a pure payload fetch — its consumed edge is a host rule)
moved a few host statements earlier in the drain; and on the
lod_actual_bytes error path the drain now records SLOT_DRAINED where the
old code skipped it — strictly safer, since the next acquire can no longer
park on a never-recorded edge.

Validation (L40, sm_89)

ctest -E "(s3)": 47/47 RelWithDebInfo and Debug at the tip; full
Rel suite green at every commit; Debug clean of ordering asserts.
test_cross_validate ×8 (determinism, zstd round-trip, page-aligned
tail-carry); test_multiarray_gpu Rel+Debug; two_streams bench both
codecs.
Mutation checks re-run on the final commit (the scheduler must not
weaken what the pools carry): dropping the swap-time chunk-pool acquire →
gpu_zstd_determinism fails 8/8; forcing the tail gate off →
gpu_page_aligned_tail_carry fails 8/8. Restored: green.
Zero new build warnings. Perf vs branch base: means within ±1.5% on
256cube/medfmt × zstd/none (one noisy codec-none rep re-tested clean).

Commit guide

Add schedule module owning streams — gpu_streams + depth enum +
scheduler state; selection logic.
Move flush orchestration into scheduler — stream.flush.{c,h} deleted;
entry points become schedule_accumulate_epoch / schedule_flush_* /
schedule_drain_pending.
Lift compress acquires into schedule — kick acquires/tail-arm/releases
out of flush.compress_agg.c; stage keeps payload phases.
Lift d2h acquires into schedule — kick/drain acquires, SLOT_DRAINED
releases, exactly-once tail publish out of flush.d2h_deliver.c.
Move partial-append emit into lod stage — the LOD payload that lived in
stream.flush.c.
Trim comments to why-only — comment pass; also documents
gpu_pool_release_all's whole-table scope (review nit from Carry buffer ordering in resource pools #148).

codecov · 2026-06-12T16:03:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-06-12T16:38:16Z

Reviewed against upstream/main (d963364) as a behavior-neutrality audit: reconstructed the enqueue sequence per stream for every schedule shape (PIPELINED / DRAIN_BEFORE_KICK / DRAIN_AFTER_KICK × passthrough/compressed × lod_active on/off × partial-batch flush) from the old stream.flush.c + stage files and compared against schedule.c, then re-ran the validation matrix independently on an L40.

Verdict: ship (one doc nit, one description nit below — nothing blocking).

What checks out

Op order is primitive-for-primitive identical on all five streams in every shape, including the no-op drain of the unkicked fill slot in DRAIN_BEFORE_KICK and the CHUNK_INDEX_READY/SLOT_DRAINED record positions in the d2h kick (compressed vs passthrough never co-occur, so folding the index release after the payload is order-neutral).
schedule_select() reproduces the old decisions at every call site: gate_ord is NULL exactly where sync_flush=1 was set (src/multiarray/stream.gpu.c:115) and non-NULL for single-array (src/gpu/stream.init.c:378-382); the old per-swap fallback predicate (page_size>0 && total_shards>0 && !gate_supported) reads only values immutable after array init, so one-shot selection is equivalent — including page-aligned-with-zero-shards, non-page-aligned-with-gate-unsupported, and unbounded streams. schedule_lod_active's fc=0-only event check is sound because stream.init.c:156-158 binds both fc instances together iff lim->any_multiscale, which is also exactly when lod_shared.d_linear exists (the old epoch-path condition).
Exactly-once tail publish: all eight exit paths of schedule_d2h_drain (has_error early-out, three host-acquire failures, drain-copy dispatch error, release failure, drain_sink error, success) funnel through the single gpu_pool_release_produce_gen at Done (src/gpu/schedule.c:272-283), with err correct on each; metadata update stays after the publish and only on success, as before. Destroy/auto-flush and schedule_quiesce_output are verbatim relocations.
Deletion completeness: HOST_RULE asserts survived (DELIVER_OLDEST_FIRST in drain_slot, schedule.c:368-371; DRAIN_BEFORE_REKICK in kick_batch, schedule.c:399-400, covering both old assert sites); LUT cache, flush metrics, io-fence wait, and the partial-append emit (now lod_emit_partial_append, identical pointer math via gpu_pool_at base) all map 1:1. batch_epoch_count (3 writes, 0 reads) and flush_pipeline.current (0 references) confirmed dead on main.
Both declared deltas verified safe: nothing observed the drain stream between old creation points (registration is debug-assert-only; first use is the first drain), and the tail produce-acquire is a pure payload fetch (tail.consumed == GPU_EDGE_COUNT → gpu_pool_host_acquire_produce queues and blocks nothing, pool.c:75-85).

Findings

(minor — third behavior delta, error-path only) When lod_actual_bytes fails, the new drain records SLOT_DRAINED where the old one did not: old sync_and_deliver jumped to Error before its gpu_pool_release_consume; new d2h_deliver_drain_copy folds that failure into its return (src/gpu/flush.d2h_deliver.c:199) and schedule.c:237 releases unconditionally. Strictly safer (old left the slot-reuse edge stale for the next acquire), but the PR text says "two knowingly-accepted deltas" — worth a sentence there for the audit trail.
(nit, next doc touch) docs/gpu-orchestration.md:3 still says "migration in progress (step 2)"; this PR completes step 4's exit criteria.

Independent validation (L40, sm_89, CUDA 13.1)

Zero new build warnings (warning sets byte-identical to base, RelWithDebInfo + Debug).
ctest -E "(s3)": 47/47 RelWithDebInfo and 47/47 Debug.
test_cross_validate ×8 and test_multiarray_gpu ×4: all green. 256cube_two_streams PASS for zstd (2.87 GiB/s combined) and none (0.84 GiB/s).
Mutation checks (independent of the PR's own): dropping the swap-time chunk-pool produce-acquire (schedule.c:457-460) → gpu_zstd_determinism red 6/6 (round-trip also red, consistent with Fix corrupted output from GPU compressed writes #140-class corruption); forcing the tail-gate arm's enable off (schedule.c:117-120) → gpu_page_aligned_tail_carry red 6/6. Both restore to green.
Perf, interleaved ×3 vs base, zstd: 256cube_single +1.2% (2.820→2.853 GiB/s median, within noise); medfmt_single +5.9% (2.154→2.281) — flagged by the >3% rule but in the favorable direction and within base's own 6% rep spread. No regression.

Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>

Nathan Clack added 6 commits June 12, 2026 15:49

Add schedule module owning streams

603565d

Move flush orchestration into scheduler

39927c4

Lift compress acquires into schedule

d941c8a

Lift d2h acquires into schedule

8d1062c

Move partial-append emit into lod stage

8e37e58

Trim comments to why-only

a27fed2

nclack merged commit a5f7f7a into acquire-project:main Jun 12, 2026
8 checks passed

nclack deleted the gpu-scheduler branch June 12, 2026 19:02

This was referenced Jun 13, 2026

Move staging copies and sink delivery to worker threads #151

Merged

Remove dead header and a redundant edge record #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract GPU orchestration into a scheduler#149

Extract GPU orchestration into a scheduler#149
nclack merged 6 commits into
acquire-project:mainfrom
nclack:gpu-scheduler

nclack commented Jun 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

nclack commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Validation (L40, sm_89)

Commit guide

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

nclack commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nclack commented Jun 12, 2026 •

edited

Loading