Move staging copies and sink delivery to worker threads by nclack · Pull Request #151 · acquire-project/chucky

nclack · 2026-06-13T00:11:02Z

Step 5 — the last step — of the migration in docs/gpu-orchestration.md:
move data movement off the producer thread. The June L40 baseline showed
wall time is almost entirely host-side, with the staging copy as the
single biggest lever. Steps 1–4 (#143, #144, #148, #149) made the ordering
explicit and machine-checked so this step could add host threads without
re-creating the bug class behind #140/#141/#145.

What

Staging copies run on a small helper pool (Parallelize staging copy in append): the producer's user→pinned copy is split across helpers when
the payload is large enough to pay for the dispatch. The copy is
bandwidth-bound, so a few helpers saturate it. If the pool can't start,
copies fall back to running serially on the producer.
Sink delivery runs on a worker (Queue sink delivery on a worker):
the pipelined schedule queues each kicked batch's delivery at kick time
and joins it before refilling that slot, so the polls and the sink writes
leave the producer thread. One job slot per pipeline slot, run strictly
oldest-first (the tail counter's threshold and the declared
deliver-oldest-first rule depend on that order). Ownership follows the
same single-writer rule as the pools: the producer owns a job before
enqueue and after join; the worker owns it in between. Depth-1 schedules
and multiarray keep draining inline — the host ordering they exist for
must not move to another thread.
Teardown: destroy runs every queued delivery to completion (each
publishes its tail generation, failure or not) before the forced gate
release — a late publish after the release would re-park the gate. A new
test (Add mid-stream destroy abuse test) destroys a stream with a
kicked, undelivered batch while sink IO is stalled, in both build
configurations.

Deliberately out of scope, per the plan: sharing one orchestration between
CPU and GPU backends, and multiarray-as-composition. Both remain follow-ups.

Performance (L40, 5 runs per number, base = main before this PR)

--prefill is a new bench flag: fill the input buffer once instead of
regenerating synthetic data before every append. It separates the cost of
the writer (what this library does) from the cost of the benchmark's own
data generation, which the baseline measured at roughly half of wall time.

scenario	uncompressed	uncompressed, prefill	zstd	zstd, prefill
256cube_single	6.41 → 9.07 (+41%)	16.95 → 20.55 (+21%)	6.23 → 7.37 (+18%)	9.35 → 9.35
medfmt_single	5.43 → 7.03 (+29%)	16.03 → 19.81 (+24%)	4.01 → 3.30 (−18%, see below)	4.39 → 4.40
orca2_single	6.53 → 9.06 (+39%)	15.55 → 19.02 (+22%)	4.76 → 4.83	4.82 → 4.85
256cube_two_streams	6.59 → 9.13 (+39%)	—	3.87 → 4.21 (+9%)	—

The zstd columns sit at or near the compression ceiling (compress stage
time is unchanged), which is the expected next bottleneck.

The one regression — medfmt zstd without prefill — is contention with
the benchmark's own data generation, not a pipeline defect. The metrics
show the copy itself got twice as fast per operation, but on the most
bandwidth-hungry scenario the benchmark's fill thread, the copy helpers,
and compression all compete for memory bandwidth and everyone slows. With
prefill — closer to a real producer, whose data already exists — the same
scenario is dead even (4.39 → 4.40). The review re-ran this comparison on
two other nodes and measured flat (4.31 → 4.35) — the regression is
environment-dependent contention, not inherent. Benchmark methodology is
tracked in #150.

Correctness evidence (L40, sm_89)

Full test suite: 48/48 in both Release and Debug at the tip (a new
test joins the suite), 47/47 at the copy-only commit; zero new build
warnings; Debug runs clean of ordering asserts.
test_cross_validate ×20 — twenty repetitions of the determinism,
zstd round-trip, and page-aligned tail-carry tests under the new host
concurrency (exactly the failure mode those tests exist to catch).
Both ordering mutation checks re-run: removing the chunk-pool wait →
determinism test fails 8/8; disabling the tail counter → tail-carry test
fails 8/8. Restored: green.
Mid-stream destroy abuse test passes in both configurations.
test_multiarray_gpu + two_streams bench, both codecs.

Commit guide

Parallelize staging copy in append — the copy helper pool; suite green
alone at this commit.
Queue sink delivery on a worker — the delivery worker, oldest-first
queue, teardown drain-and-join.
Add mid-stream destroy abuse test — destroy with queued work and
stalled sink IO.
Trim comments; update doc status — comment pass; doc status now
reflects steps 1–4 merged.
Make sink pending-byte counters atomic — review finding: queued_bytes
is now written by the delivery worker while the producer reads it for
backpressure; retired_bytes was already atomic.
Add prefill option to stream benches — the --prefill flag used by
the table above, so those columns are reproducible from this branch.
Address worker review nits — metric doc updated for worker-side
overlap; worker context-set failure now warns.

codecov · 2026-06-13T00:16:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/zarr/shard_pool_fs.c	`60.90% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-06-13T01:02:51Z

Reviewed against docs/gpu-orchestration.md §5: ownership/ordering audit of the delivery worker and the staging-copy split, plus a fresh L40 run (suite in both configs, cross-validate stress x20, both ordering mutation checks, destroy abuse test, ASAN pass, perf A/B against main).

Verdict: ship with fixes. The worker design holds: I could not find a path that breaks the one-job-per-slot ownership rule, the join-before-refill discipline, or oldest-first delivery. Every field the queued drain touches (slot handoff, agg slot, shard cursors, tail buffers/counter, its metric fields) is single-writer with mutex handoff at enqueue/join; producer-side state during a drain is disjoint, including the tail-gate split (kick_seq producer-only, tail_seq/flag worker-only). Teardown's stop_join-before-release_all order is right and the documented hazard is real (a late publish would move the flag backwards past parked GEQ waits); run-out can't deadlock because the worker drains oldest-first, so each queued drain's gate threshold is already satisfied. Depth-1 and multiarray structurally cannot queue (enqueue is gated on SCHEDULE_PIPELINED; multiarray's NULL gate_ord pins DRAIN_AFTER_KICK at array-state init). The copy split is byte-exact (threadpool_for_n covers [0,n) exactly and blocks), and the staging generation can't flip mid-copy because the flip happens in dispatch, after the copy returns.

Findings, ranked:

Fix before merge — new cross-thread race on the sink pending-byte counter. shard_pool_fs.queued_bytes is a plain uint64_t (src/zarr/shard_pool_fs.c:24) incremented on the write path (:105, :162), which this PR moves onto the delivery worker — while the producer reads it every epoch boundary for backpressure (src/gpu/stream.c:196,209 via pool_fs_pending_bytes, shard_pool_fs.c:331). retired_bytes is already _Atomic; queued_bytes needs the same (relaxed is fine). Harmless on today's targets but it is exactly the bug class this migration exists to kill, and it lands with the same change that creates it.
Description vs branch — the prefill control isn't in the PR. The performance table's prefill columns and the medfmt-zstd exoneration rest on a --prefill bench flag that this branch doesn't contain (grep -r prefill bench/ is empty here and on main). The control's logic is sound — and my own runs back it up more directly: on two different L40 nodes the medfmt-zstd scenario was dead even without prefill (base 4.29-4.34 vs PR 4.34-4.37 GiB/s, 7 interleaved reps), i.e. the regression is condition-dependent contention, not intrinsic to the change. But none of that is reproducible from what's submitted; land the bench commit with this PR or ahead of it under Streaming benches spend ~half their time generating input; throughput conflates harness and library #150.
Stale metric doc. src/types.stream.h:42-48 still says flush_stall "wraps the entire schedule_d2h_drain call" and warns about double-counting against kick_sync_stall/sink. On the pipelined path those now accrue on the worker (overlapped time, not producer stall) and flush_stall is just the join wait — the comment now misleads anyone reading bench output. The comment-trim commit updated schedule.c but missed this header.
Nits. delivery_main ignores cuCtxSetCurrent's return (src/gpu/schedule.c:325) — a log line would make a bad context diagnosable instead of every drain erroring mysteriously. gpu_delivery_join loops forever if ever called for an empty slot (:413-415); an assert(job.queued || job.done) would make misuse loud. gpu_delivery_stop_join drops job results by design — worth one line in its header that sink-side errors persist via has_error.
Test gap, non-blocking. test_destroy_midstream is genuinely sensitive (a missed join in destroy's flush, a use-after-free of queued refs, and it gates sink IO the same way Drain sink IO in CPU writer_flush #146's test did) — but neither subtest reaches stop_join with a job still queued, because the auto-flush joins everything first. The run-out-before-release_all line (src/gpu/stream.init.c:345) is exercised only when a flush fails with a queued job, which needs fault injection. Fine as a follow-up.

Validation (L40, sm_89, fresh builds at 7f45e1e):

ctest -E "(s3)": 48/48 RelWithDebInfo, 48/48 Debug; no new build warnings.
test_cross_validate x20 RelWithDebInfo + x5 Debug (host-rule asserts live): 0 failures.
test_destroy_midstream x8 Rel + x2 Debug: 0 failures. test_multiarray_gpu x4: 0 failures.
ASAN build (host code instrumented): destroy_midstream x3, cross_validate, test_batch — clean.
Mutation checks re-run: dropping the chunk-pool produce wait (the Fix corrupted output from GPU compressed writes #140 edge) -> cross_validate red 8/8 (4 subtests); disabling the tail-counter wait -> red 8/8 with page-aligned tail-carry failing each time; restored -> green 3/3. The gates still bite under the new threads.
Perf A/B vs main (prebuilt base, interleaved reps, --frames 300, GiB/s):
- 256cube_single zstd: 5.87 -> 7.11 (+21%) [PR: +18%]
- 256cube_single none: 6.11 -> 8.23 (+35%) [PR: +41%]
- medfmt_single none: ~6.1 -> ~8.5 (+38%, two nodes) [PR: +29%]
- medfmt_single zstd: 4.31 -> 4.35 (flat, two nodes) [PR: -18%; did not reproduce here — see finding 2]
- 256cube_two_streams: zstd 4.24, none 8.87 (PR post-change: 4.21 / 9.13)

Two small, behavior-neutral cleanups flagged during the orchestration-migration reviews (#143/#148/#149/#151). - **Remove dead `flush.helpers.h`.** The header had no remaining users after the scheduler work; deleted the file, its two `#include`s, and the CMakeLists entry. Verified no symbol it declared is referenced anywhere. - **Drop a redundant `POOL_FILLED` record.** On the drain-after-kick (multiarray sync) schedule, `schedule_accumulate_epoch` recorded the pool-filled edge and then `schedule_flush_accumulated` recorded it again on the same stream. Skip the first on that path; the flush-accumulated release is the load-bearing one before its kick. The pipelined and drain-before-kick paths are unchanged. Validation (L40, sm_89, RelWithDebInfo): `ctest -E "(s3)"` 48/48; zero new warnings. --------- Co-authored-by: Nathan Clack <nclack@biohub.org>

Nathan Clack added 4 commits June 13, 2026 00:07

Parallelize staging copy in append

c0c4980

Queue sink delivery on a worker

54e5f1c

Add mid-stream destroy abuse test

8ba1a67

Trim comments; update doc status

7f45e1e

Nathan Clack added 3 commits June 13, 2026 01:05

Make sink pending-byte counters atomic

3410645

Add prefill option to stream benches

c1eb0ce

Address worker review nits

885f573

nclack merged commit 5af7ead into acquire-project:main Jun 13, 2026
8 checks passed

nclack deleted the gpu-workers branch June 13, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move staging copies and sink delivery to worker threads#151

Move staging copies and sink delivery to worker threads#151
nclack merged 7 commits into
acquire-project:mainfrom
nclack:gpu-workers

nclack commented Jun 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

nclack commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Performance (L40, 5 runs per number, base = main before this PR)

Correctness evidence (L40, sm_89)

Commit guide

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nclack commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nclack commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading