[DNR] proto spill directly to disk#64377
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements direct (proactive) spilling for task return objects, allowing them to be written to external storage immediately after creation instead of waiting for memory pressure. This is achieved by adding a _spill_immediately option to tasks, which is plumbed from Python through Cython and the core worker to the raylet's PinObjectIDs RPC. The raylet then triggers immediate spilling via the LocalObjectManager. Feedback on the changes identifies a bug in the updated spill-timing calculation within LocalObjectManager::SpillObjectsInternal, where concurrent, out-of-order spill operations can lead to an underestimation of the total spill time. A transition-based tracking approach using the active worker count is recommended to resolve this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Account for spill throughput here (shared by both the reactive | ||
| // threshold-based path and the direct/proactive spill path) so that | ||
| // all spills are reflected in spill_time_total_s_ and surfaced in | ||
| // stats. Use max(start, last_finish) to avoid double-counting | ||
| // concurrent spill operations. | ||
| auto now = absl::GetCurrentTimeNanos(); | ||
| spill_time_total_s_ += | ||
| (now - std::max(spill_start_time, last_spill_finish_ns_)) / 1e9; | ||
| last_spill_finish_ns_ = now; | ||
| } |
There was a problem hiding this comment.
Underestimation of Spill Time with Concurrent Spills
There is a subtle but significant bug in how spill_time_total_s_ is calculated when multiple spill operations run concurrently and finish out of order.
The Bug:
When a shorter spill (Spill B) starts after a longer spill (Spill A) but finishes before Spill A, it updates last_spill_finish_ns_ to its finish time. When Spill A subsequently finishes, std::max(spill_start_time, last_spill_finish_ns_) evaluates to Spill B's finish time. As a result, the entire active spilling duration of Spill A before Spill B finished is completely discarded, leading to a severe underestimation of the total spill time.
Concrete Example:
- t=0: Spill A starts (
spill_start_time= 0). - t=5: Spill B starts (
spill_start_time= 5). - t=8: Spill B finishes.
spill_time_total_s_increases by(8 - std::max(5, 0)) = 3seconds.last_spill_finish_ns_becomes 8. - t=10: Spill A finishes.
spill_time_total_s_increases by(10 - std::max(0, 8)) = 2seconds.last_spill_finish_ns_becomes 10.
Total accounted spill time: 3 + 2 = 5 seconds.
Actual active spilling time: 10 seconds (from t=0 to t=10, spilling was continuously active).
The interval [0, 5] is completely lost.
Recommended Solution:
Instead of using the complex and error-prone std::max logic with last_spill_finish_ns_, we can leverage the existing num_active_workers_ counter to track the union of active spilling intervals:
- When
num_active_workers_transitions from0to1(i.e., the first spill starts), record the start time:last_spill_start_ns_ = absl::GetCurrentTimeNanos();. - When
num_active_workers_transitions from1to0(i.e., the last active spill finishes), add the elapsed time tospill_time_total_s_:auto now = absl::GetCurrentTimeNanos(); spill_time_total_s_ += (now - last_spill_start_ns_) / 1e9;
This transition-based approach is 100% accurate, handles any level of concurrency/overlap, and completely eliminates the out-of-order callback bug.
Adds an opt-in `@ray.remote(_spill_immediately=True)` / `.options(...)` task option that spills a task's plasma-backed return objects to external storage immediately after creation, bypassing the reactive object_spilling_threshold. The flag rides on TaskSpec, is read in CoreWorker::SealReturnObject off the current task spec, forwarded on the PinObjectIDs RPC, and the raylet's HandlePinObjectIDs calls LocalObjectManager::SpillObjects right after pinning. Gated by the enable_direct_spill config (default true). Spill-timing accounting is moved into SpillObjectsInternal so direct spills are reflected in spill stats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
3c17127 to
95b3b26
Compare
Description
Related issues
Additional information