[Core] Fail-fast on streaming generator replay object-count mismatch#64394
[Core] Fail-fast on streaming generator replay object-count mismatch#64394dragongu wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to detect and fail streaming generator tasks whose replays produce a different number of objects than their initial successful execution, preventing silent hangs or data loss. This is achieved by adding FailStreamingGeneratorReplayIfInconsistent in TaskManager and introducing a new error type STREAMING_GENERATOR_REPLAY_INCONSISTENT. The review feedback suggests moving the inconsistency check to the very beginning of CompletePendingTask to prevent side-effects from inconsistent replays being written to memory before the task is failed. Additionally, it recommends simplifying the method signature by checking the execution count internally and adding defensive checks against empty return objects.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 54b50cd. Configure here.
54b50cd to
5812a1f
Compare
A streaming generator task that is replayed (e.g. for lineage reconstruction) after its first successful attempt can produce a different number of objects if the generator output is non-deterministic. Downstream consumers were already created against the original object count, so a replay with fewer objects hangs them on indices that are never produced, and a replay with more objects silently drops the extras beyond the pinned EOF. Detect the mismatch at the start of CompletePendingTask, before any return object is written to the store, and fail the task with a new STREAMING_GENERATOR_REPLAY_INCONSISTENT error type so the failure propagates through lineage instead of silently hanging or dropping data. Adds C++ unit tests covering fewer/more/same object counts on replay. Signed-off-by: dragongu <andrewgu@vip.qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5812a1f to
e139a5a
Compare

Description
When a streaming generator task's replay emits a different number of objects than the original successful attempt, downstream consumers break — fewer objects leaves the pipeline silently hanging (the scenario we hit in production), and more objects silently drops data past the pinned end-of-stream index (theoretical). Nothing detects this mismatch today. This PR fails the task fast with a new
STREAMING_GENERATOR_REPLAY_INCONSISTENTerror so the failure propagates through lineage to downstream consumers instead. The underlying object-count non-determinism is not fixed here; #64393 removes one Ray-internal source, and a future opt-inrows_per_blockparameter would let users with unstable UDFs make the count deterministic.Symptom. We hit this on an elastic-resource pipeline (worker pods can be preempted at any time) running
read_parquet → tokenize → predict → write_parquet. The job occasionally hung: one lastpredicttask sat inPENDING_ARGS_AVAILfor 5h+ with no error logged, blocked on object 12 from an upstreamtokenizetask that was never produced.Root cause. A worker preemption had triggered lineage reconstruction of that
tokenizetask — but the replay produced only 11 objects instead of the original 12, so object 12 was never created.We added logging in production to confirm this is systematic, not a one-off. For the same task, two attempts received identical input (
in_rows/in_bytesequal) and produced an identical row count, yet emitted a different number of objects (one per output block):out_rowsis identical across attempts butout_blocksis not — a replay genuinely produces a different object count.A task's object count is non-deterministic: block boundaries drift across attempts (see Why the object count drifts below), so a replay can emit fewer or more objects than the original successful attempt. Downstream consumers were created against the original count:
ObjectRefStream::InsertToStream.Why this goes undetected: the first successful execution pins the stream's end-of-stream index (EOF),
MarkEndOfStreamearly-returns afterward so EOF can never move, andCompletePendingTaskonly re-checks replays that fail with an application error — a normally-completing replay with a different count falls through unhandled.Why the object count drifts. A streaming generator yields one object per output block, and
OutputBuffer.next()cuts a new block once accumulated rows reachtarget_num_rows— a value derived fromsize_bytes(). The object count therefore tracks the running size estimate rather than the input, so anything that changes that estimate or the data across attempts shifts the boundaries and changes the count:size_bytes()estimate (e.g.PandasBlock.size_bytessampling without a fixed seed — addressed in [Data] Make PandasBlock.size_bytes deterministic #64393);Fix. This PR surfaces the mismatch loudly: detect it in
CompletePendingTaskand fail the task with a newSTREAMING_GENERATOR_REPLAY_INCONSISTENTerror type, so the failure propagates through lineage to downstream tasks instead of leaving the pipeline stuck or dropping data silently. The behavior change is strictly hang/silent-loss → explicit error; no caller relies on the old silent behavior.Implementation notes.
FailStreamingGeneratorReplayIfInconsistenthelper called early inCompletePendingTask— before any return object is written to the store (so downstream consumers can't observe the inconsistent objects before the failure propagates) and beforeSetTaskStatus(FINISHED)(FailPendingTaskRAY_CHECKsIsPending()).expected_countcomes fromspec.NumStreamingGeneratorReturns()(recorded on the first successful attempt);actual_countisreply.streaming_generator_return_ids_size().Related issues
None.
Additional information
Tests. Adds C++ unit tests in
task_manager_test.ccthat driveCompletePendingTaskwith controlled object counts — the deterministic repro of the bug:STREAMING_GENERATOR_REPLAY_INCONSISTENT;End-to-end illustration. How the hang arises in practice — a streaming generator whose object count drifts across attempts, forced through lineage reconstruction by killing the producing node. The unit tests above are the reliable repro; this script is timing-dependent (reconstruction depends on when the owner detects the lost objects) and is here for intuition only.