Move staging copies and sink delivery to worker threads#151
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
|
Reviewed against Verdict: ship with fixes. The worker design holds: I could not find a path that breaks the one-job-per-slot ownership rule, the join-before-refill discipline, or oldest-first delivery. Every field the queued drain touches (slot handoff, agg slot, shard cursors, tail buffers/counter, its metric fields) is single-writer with mutex handoff at enqueue/join; producer-side state during a drain is disjoint, including the tail-gate split (kick_seq producer-only, tail_seq/flag worker-only). Teardown's stop_join-before-release_all order is right and the documented hazard is real (a late publish would move the flag backwards past parked GEQ waits); run-out can't deadlock because the worker drains oldest-first, so each queued drain's gate threshold is already satisfied. Depth-1 and multiarray structurally cannot queue (enqueue is gated on Findings, ranked:
Validation (L40, sm_89, fresh builds at 7f45e1e):
|
Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>
Step 5 — the last step — of the migration in
docs/gpu-orchestration.md:move data movement off the producer thread. The June L40 baseline showed
wall time is almost entirely host-side, with the staging copy as the
single biggest lever. Steps 1–4 (#143, #144, #148, #149) made the ordering
explicit and machine-checked so this step could add host threads without
re-creating the bug class behind #140/#141/#145.
What
Parallelize staging copy in append): the producer's user→pinned copy is split across helpers whenthe payload is large enough to pay for the dispatch. The copy is
bandwidth-bound, so a few helpers saturate it. If the pool can't start,
copies fall back to running serially on the producer.
Queue sink delivery on a worker):the pipelined schedule queues each kicked batch's delivery at kick time
and joins it before refilling that slot, so the polls and the sink writes
leave the producer thread. One job slot per pipeline slot, run strictly
oldest-first (the tail counter's threshold and the declared
deliver-oldest-first rule depend on that order). Ownership follows the
same single-writer rule as the pools: the producer owns a job before
enqueue and after join; the worker owns it in between. Depth-1 schedules
and multiarray keep draining inline — the host ordering they exist for
must not move to another thread.
publishes its tail generation, failure or not) before the forced gate
release — a late publish after the release would re-park the gate. A new
test (
Add mid-stream destroy abuse test) destroys a stream with akicked, undelivered batch while sink IO is stalled, in both build
configurations.
Deliberately out of scope, per the plan: sharing one orchestration between
CPU and GPU backends, and multiarray-as-composition. Both remain follow-ups.
Performance (L40, 5 runs per number, base = main before this PR)
--prefillis a new bench flag: fill the input buffer once instead ofregenerating synthetic data before every append. It separates the cost of
the writer (what this library does) from the cost of the benchmark's own
data generation, which the baseline measured at roughly half of wall time.
The zstd columns sit at or near the compression ceiling (compress stage
time is unchanged), which is the expected next bottleneck.
The one regression — medfmt zstd without prefill — is contention with
the benchmark's own data generation, not a pipeline defect. The metrics
show the copy itself got twice as fast per operation, but on the most
bandwidth-hungry scenario the benchmark's fill thread, the copy helpers,
and compression all compete for memory bandwidth and everyone slows. With
prefill — closer to a real producer, whose data already exists — the same
scenario is dead even (4.39 → 4.40). The review re-ran this comparison on
two other nodes and measured flat (4.31 → 4.35) — the regression is
environment-dependent contention, not inherent. Benchmark methodology is
tracked in #150.
Correctness evidence (L40, sm_89)
test joins the suite), 47/47 at the copy-only commit; zero new build
warnings; Debug runs clean of ordering asserts.
test_cross_validate×20 — twenty repetitions of the determinism,zstd round-trip, and page-aligned tail-carry tests under the new host
concurrency (exactly the failure mode those tests exist to catch).
determinism test fails 8/8; disabling the tail counter → tail-carry test
fails 8/8. Restored: green.
test_multiarray_gpu+ two_streams bench, both codecs.Commit guide
Parallelize staging copy in append— the copy helper pool; suite greenalone at this commit.
Queue sink delivery on a worker— the delivery worker, oldest-firstqueue, teardown drain-and-join.
Add mid-stream destroy abuse test— destroy with queued work andstalled sink IO.
Trim comments; update doc status— comment pass; doc status nowreflects steps 1–4 merged.
Make sink pending-byte counters atomic— review finding:queued_bytesis now written by the delivery worker while the producer reads it for
backpressure;
retired_byteswas already atomic.Add prefill option to stream benches— the--prefillflag used bythe table above, so those columns are reproducible from this branch.
Address worker review nits— metric doc updated for worker-sideoverlap; worker context-set failure now warns.