Skip to content

Move staging copies and sink delivery to worker threads#151

Merged
nclack merged 7 commits into
acquire-project:mainfrom
nclack:gpu-workers
Jun 13, 2026
Merged

Move staging copies and sink delivery to worker threads#151
nclack merged 7 commits into
acquire-project:mainfrom
nclack:gpu-workers

Conversation

@nclack

@nclack nclack commented Jun 13, 2026

Copy link
Copy Markdown
Member

Step 5 — the last step — of the migration in docs/gpu-orchestration.md:
move data movement off the producer thread. The June L40 baseline showed
wall time is almost entirely host-side, with the staging copy as the
single biggest lever. Steps 1–4 (#143, #144, #148, #149) made the ordering
explicit and machine-checked so this step could add host threads without
re-creating the bug class behind #140/#141/#145.

What

  • Staging copies run on a small helper pool (Parallelize staging copy in append): the producer's user→pinned copy is split across helpers when
    the payload is large enough to pay for the dispatch. The copy is
    bandwidth-bound, so a few helpers saturate it. If the pool can't start,
    copies fall back to running serially on the producer.
  • Sink delivery runs on a worker (Queue sink delivery on a worker):
    the pipelined schedule queues each kicked batch's delivery at kick time
    and joins it before refilling that slot, so the polls and the sink writes
    leave the producer thread. One job slot per pipeline slot, run strictly
    oldest-first (the tail counter's threshold and the declared
    deliver-oldest-first rule depend on that order). Ownership follows the
    same single-writer rule as the pools: the producer owns a job before
    enqueue and after join; the worker owns it in between. Depth-1 schedules
    and multiarray keep draining inline — the host ordering they exist for
    must not move to another thread.
  • Teardown: destroy runs every queued delivery to completion (each
    publishes its tail generation, failure or not) before the forced gate
    release — a late publish after the release would re-park the gate. A new
    test (Add mid-stream destroy abuse test) destroys a stream with a
    kicked, undelivered batch while sink IO is stalled, in both build
    configurations.

Deliberately out of scope, per the plan: sharing one orchestration between
CPU and GPU backends, and multiarray-as-composition. Both remain follow-ups.

Performance (L40, 5 runs per number, base = main before this PR)

--prefill is a new bench flag: fill the input buffer once instead of
regenerating synthetic data before every append. It separates the cost of
the writer (what this library does) from the cost of the benchmark's own
data generation, which the baseline measured at roughly half of wall time.

scenario uncompressed uncompressed, prefill zstd zstd, prefill
256cube_single 6.41 → 9.07 (+41%) 16.95 → 20.55 (+21%) 6.23 → 7.37 (+18%) 9.35 → 9.35
medfmt_single 5.43 → 7.03 (+29%) 16.03 → 19.81 (+24%) 4.01 → 3.30 (−18%, see below) 4.39 → 4.40
orca2_single 6.53 → 9.06 (+39%) 15.55 → 19.02 (+22%) 4.76 → 4.83 4.82 → 4.85
256cube_two_streams 6.59 → 9.13 (+39%) 3.87 → 4.21 (+9%)

The zstd columns sit at or near the compression ceiling (compress stage
time is unchanged), which is the expected next bottleneck.

The one regression — medfmt zstd without prefill — is contention with
the benchmark's own data generation, not a pipeline defect.
The metrics
show the copy itself got twice as fast per operation, but on the most
bandwidth-hungry scenario the benchmark's fill thread, the copy helpers,
and compression all compete for memory bandwidth and everyone slows. With
prefill — closer to a real producer, whose data already exists — the same
scenario is dead even (4.39 → 4.40). The review re-ran this comparison on
two other nodes and measured flat (4.31 → 4.35) — the regression is
environment-dependent contention, not inherent. Benchmark methodology is
tracked in #150.

Correctness evidence (L40, sm_89)

  • Full test suite: 48/48 in both Release and Debug at the tip (a new
    test joins the suite), 47/47 at the copy-only commit; zero new build
    warnings; Debug runs clean of ordering asserts.
  • test_cross_validate ×20 — twenty repetitions of the determinism,
    zstd round-trip, and page-aligned tail-carry tests under the new host
    concurrency (exactly the failure mode those tests exist to catch).
  • Both ordering mutation checks re-run: removing the chunk-pool wait →
    determinism test fails 8/8; disabling the tail counter → tail-carry test
    fails 8/8. Restored: green.
  • Mid-stream destroy abuse test passes in both configurations.
  • test_multiarray_gpu + two_streams bench, both codecs.

Commit guide

  • Parallelize staging copy in append — the copy helper pool; suite green
    alone at this commit.
  • Queue sink delivery on a worker — the delivery worker, oldest-first
    queue, teardown drain-and-join.
  • Add mid-stream destroy abuse test — destroy with queued work and
    stalled sink IO.
  • Trim comments; update doc status — comment pass; doc status now
    reflects steps 1–4 merged.
  • Make sink pending-byte counters atomic — review finding: queued_bytes
    is now written by the delivery worker while the producer reads it for
    backpressure; retired_bytes was already atomic.
  • Add prefill option to stream benches — the --prefill flag used by
    the table above, so those columns are reproducible from this branch.
  • Address worker review nits — metric doc updated for worker-side
    overlap; worker context-set failure now warns.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/zarr/shard_pool_fs.c 60.90% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack

nclack commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

Reviewed against docs/gpu-orchestration.md §5: ownership/ordering audit of the delivery worker and the staging-copy split, plus a fresh L40 run (suite in both configs, cross-validate stress x20, both ordering mutation checks, destroy abuse test, ASAN pass, perf A/B against main).

Verdict: ship with fixes. The worker design holds: I could not find a path that breaks the one-job-per-slot ownership rule, the join-before-refill discipline, or oldest-first delivery. Every field the queued drain touches (slot handoff, agg slot, shard cursors, tail buffers/counter, its metric fields) is single-writer with mutex handoff at enqueue/join; producer-side state during a drain is disjoint, including the tail-gate split (kick_seq producer-only, tail_seq/flag worker-only). Teardown's stop_join-before-release_all order is right and the documented hazard is real (a late publish would move the flag backwards past parked GEQ waits); run-out can't deadlock because the worker drains oldest-first, so each queued drain's gate threshold is already satisfied. Depth-1 and multiarray structurally cannot queue (enqueue is gated on SCHEDULE_PIPELINED; multiarray's NULL gate_ord pins DRAIN_AFTER_KICK at array-state init). The copy split is byte-exact (threadpool_for_n covers [0,n) exactly and blocks), and the staging generation can't flip mid-copy because the flip happens in dispatch, after the copy returns.

Findings, ranked:

  1. Fix before merge — new cross-thread race on the sink pending-byte counter. shard_pool_fs.queued_bytes is a plain uint64_t (src/zarr/shard_pool_fs.c:24) incremented on the write path (:105, :162), which this PR moves onto the delivery worker — while the producer reads it every epoch boundary for backpressure (src/gpu/stream.c:196,209 via pool_fs_pending_bytes, shard_pool_fs.c:331). retired_bytes is already _Atomic; queued_bytes needs the same (relaxed is fine). Harmless on today's targets but it is exactly the bug class this migration exists to kill, and it lands with the same change that creates it.
  2. Description vs branch — the prefill control isn't in the PR. The performance table's prefill columns and the medfmt-zstd exoneration rest on a --prefill bench flag that this branch doesn't contain (grep -r prefill bench/ is empty here and on main). The control's logic is sound — and my own runs back it up more directly: on two different L40 nodes the medfmt-zstd scenario was dead even without prefill (base 4.29-4.34 vs PR 4.34-4.37 GiB/s, 7 interleaved reps), i.e. the regression is condition-dependent contention, not intrinsic to the change. But none of that is reproducible from what's submitted; land the bench commit with this PR or ahead of it under Streaming benches spend ~half their time generating input; throughput conflates harness and library #150.
  3. Stale metric doc. src/types.stream.h:42-48 still says flush_stall "wraps the entire schedule_d2h_drain call" and warns about double-counting against kick_sync_stall/sink. On the pipelined path those now accrue on the worker (overlapped time, not producer stall) and flush_stall is just the join wait — the comment now misleads anyone reading bench output. The comment-trim commit updated schedule.c but missed this header.
  4. Nits. delivery_main ignores cuCtxSetCurrent's return (src/gpu/schedule.c:325) — a log line would make a bad context diagnosable instead of every drain erroring mysteriously. gpu_delivery_join loops forever if ever called for an empty slot (:413-415); an assert(job.queued || job.done) would make misuse loud. gpu_delivery_stop_join drops job results by design — worth one line in its header that sink-side errors persist via has_error.
  5. Test gap, non-blocking. test_destroy_midstream is genuinely sensitive (a missed join in destroy's flush, a use-after-free of queued refs, and it gates sink IO the same way Drain sink IO in CPU writer_flush #146's test did) — but neither subtest reaches stop_join with a job still queued, because the auto-flush joins everything first. The run-out-before-release_all line (src/gpu/stream.init.c:345) is exercised only when a flush fails with a queued job, which needs fault injection. Fine as a follow-up.

Validation (L40, sm_89, fresh builds at 7f45e1e):

  • ctest -E "(s3)": 48/48 RelWithDebInfo, 48/48 Debug; no new build warnings.
  • test_cross_validate x20 RelWithDebInfo + x5 Debug (host-rule asserts live): 0 failures.
  • test_destroy_midstream x8 Rel + x2 Debug: 0 failures. test_multiarray_gpu x4: 0 failures.
  • ASAN build (host code instrumented): destroy_midstream x3, cross_validate, test_batch — clean.
  • Mutation checks re-run: dropping the chunk-pool produce wait (the Fix corrupted output from GPU compressed writes #140 edge) -> cross_validate red 8/8 (4 subtests); disabling the tail-counter wait -> red 8/8 with page-aligned tail-carry failing each time; restored -> green 3/3. The gates still bite under the new threads.
  • Perf A/B vs main (prebuilt base, interleaved reps, --frames 300, GiB/s):
    • 256cube_single zstd: 5.87 -> 7.11 (+21%) [PR: +18%]
    • 256cube_single none: 6.11 -> 8.23 (+35%) [PR: +41%]
    • medfmt_single none: ~6.1 -> ~8.5 (+38%, two nodes) [PR: +29%]
    • medfmt_single zstd: 4.31 -> 4.35 (flat, two nodes) [PR: -18%; did not reproduce here — see finding 2]
    • 256cube_two_streams: zstd 4.24, none 8.87 (PR post-change: 4.21 / 9.13)

@nclack nclack merged commit 5af7ead into acquire-project:main Jun 13, 2026
8 checks passed
@nclack nclack deleted the gpu-workers branch June 13, 2026 03:24
nclack added a commit that referenced this pull request Jun 17, 2026
Two small, behavior-neutral cleanups flagged during the
orchestration-migration reviews (#143/#148/#149/#151).

- **Remove dead `flush.helpers.h`.** The header had no remaining users
after the scheduler work; deleted the file, its two `#include`s, and the
CMakeLists entry. Verified no symbol it declared is referenced anywhere.
- **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick
(multiarray sync) schedule, `schedule_accumulate_epoch` recorded the
pool-filled edge and then `schedule_flush_accumulated` recorded it again
on the same stream. Skip the first on that path; the flush-accumulated
release is the load-bearing one before its kick. The pipelined and
drain-before-kick paths are unchanged.

Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero
new warnings.

---------

Co-authored-by: Nathan Clack <nclack@biohub.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant