[Data] Add per-stage training-thread blocking attribution to iter_batches by OneSizeFitsQuorum · Pull Request #64183 · ray-project/ray

OneSizeFitsQuorum · 2026-06-17T15:50:12Z

Summary

Decomposes iter_total_blocked_s into per-stage contributions so users can see where the training thread is blocked. Closes #64132, part of RFC #63911.

Design

Each pipeline stage records an independent (start_s, end_s) window via TimeSpan, a lightweight value yielded by Timer.timer(). The training thread records its blocked window around next(batch_iter). Per-stage attribution = overlap of the two windows:

overlap = min(stage.end_s, blocked_end) - max(stage.start_s, blocked_start)

Invariant: sum(iter_blocked_*) approximates iter_total_blocked_s. It is a lower bound in the common case, but split fetch stages (production_wait + data_transfer) may overlap by design for multi-block batches — see _attribute_blocked_time docstring for the full rationale.

The fetch stage is split into two: production_wait spans the upstream wait (blocked on the data pipeline), and data_transfer spans the cross-node transfer via ray.get(). This lets users differentiate data production stall from cross-node transfer stall.

New Prometheus Metrics (9)

Metric	Description
`data_iter_blocked_production_wait_seconds`	Upstream wait (blocked on data pipeline)
`data_iter_blocked_data_transfer_seconds`	Cross-node transfer (ray.get)
`data_iter_blocked_batching_seconds`	Batch creation
`data_iter_blocked_format_seconds`	Format conversion
`data_iter_blocked_collate_seconds`	collate_fn
`data_iter_blocked_finalize_seconds`	finalize_fn
`data_iter_batches_total`	Batches delivered to training loop
`data_iter_rows_total`	Rows delivered (exit-side, respects drop_last)
`data_iter_total_seconds`	Total iterator wall-clock time

Per-rank Prometheus label (rank extracted from dataset tag's trailing split_N) is deferred to a follow-up PR — see TODO in stats.py.

Changes

interfaces.py — TimeSpan, BlockFetchTiming (per-block: production_wait + data_transfer), BlockFetchResult (block + Optional timing), BatchTimings (per-batch: all 6 stages), merge_fetch / _merge_span
iter_batches.py — _attribute_blocked_time() overlap computation with docstring documenting known limitations (split-fetch overlap, reorder buffer unattributed gap)
util.py — resolve_block_refs returns Iterator[BlockFetchResult]; production_wait captured as a bare TimeSpan (no accumulation, to avoid double-counting with prefetch_batches_locally.get_next_ref_bundle); data_transfer accumulated into iter_get_s
stats.py — 9 Prometheus Gauges, Timer.timer() yields a fresh TimeSpan per call (thread-safe), PipelineStage enum, get_blocked_timer(stage) via match, IterStatsSummary per-stage breakdown
iterator.py — Final metrics flush on iteration end

Example Output

Per-stage training-thread blocked time breakdown:
    * block fetch (ray.get): 1.74ms
    * batching: 428.36us
    * format: 4.1ms
Total batches consumed: 10
Total rows consumed: 100

Performance

~6 μs/batch overhead. At 10k batches/sec: <0.04% impact on a 10ms training step.

Tests

56 unit tests (iter_batches) + 7 unit tests (stats/summary) — all passing
Integration verified with Ray Train v2 DataParallelTrainer + TorchTrainer

References

Closes #64132 · Part of RFC #63911

cc @JasonLi1909

gemini-code-assist

Code Review

This pull request introduces detailed per-stage wall-clock timing and statistics tracking (including fetch, batching, formatting, collating, finalization, and order restoration) for Ray Data batch iteration, along with corresponding Prometheus metrics and unit tests. The review feedback suggests updating the type annotation of dataset_tag in _create_iteration_tags to Optional[str] to align with the underlying function and avoid static type checking errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

JasonLi1909

Thank you for the PR @OneSizeFitsQuorum! The core idea of capturing the start and end of each pipeline stage to compute overlap with the training thread stall makes sense. That said, let's keep the scope of this PR to just the training thread attribution metrics. We can compartmentalize other metrics for a later PR, but at a glance some of them such as data_iter_prefetch_queue_depth are more of an implementation detail, which we should avoid. Also it would be great if you could simplify the PR description so it's shorter and easier to understand. Thanks!

…e observability to iter_batches Implements overlap-based latency attribution for Ray Data's iter_batches pipeline, addressing ray-project#64132 and RFC ray-project#63911. Each pipeline stage (fetch, batching, format, collate, finalize, restore_order) records an independent (start_s, end_s) time window. The training thread captures its own blocked window around next(). Attribution per stage is the overlap of the two windows, correctly handling prefetch > 1. New Prometheus metrics (14 total): - data_iter_blocked_{fetch,batching,format,collate,finalize,restore_order}_seconds - data_iter_batches_total, data_iter_rows_total - data_iter_total_seconds, data_iter_restore_order_buffer_peak - data_iter_shuffle_buffer_{rows,compactions_total,compaction_seconds} - data_iter_prefetch_queue_depth Also adds: - Per-stage breakdown rendering in IterStatsSummary.to_string() - Rank extraction from dataset tags for Prometheus labels - Final metrics flush on iterator completion Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

OneSizeFitsQuorum · 2026-06-20T04:34:03Z

Thanks for the review @JasonLi1909! I've made the following changes:

Reduced scope — removed the 5 implementation-detail metrics (data_iter_prefetch_queue_depth, data_iter_restore_order_buffer_peak, data_iter_shuffle_buffer_*). The PR now only contains the 9 core training-thread attribution metrics. The removed metrics are preserved on a separate branch for a future PR.
Simplified PR description — cut it down from ~300 lines to ~80 lines, focusing on design, the 9 metrics, and test results.

Also fixed a regex issue flagged by Cursor Bugbot: the rank extraction now uses the last split_N match (instead of the first) to avoid false matches when the user-defined dataset name contains split_<digits>.

Reverts batcher.py changes that were only needed for the shuffle buffer metrics which have been removed from this PR's scope per reviewer feedback. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

JasonLi1909

Hey @OneSizeFitsQuorum, thanks for the cleanup. I've left some more comments. Overall, make sure to include docstrings, comments, and that names of data classes/functions communicate their purpose. I'll take another look after you're done, thanks!

Per reviewer feedback, restore_order is an implementation detail rather than an actionable user-facing metric. Reverts restore_original_order() to the original simple for-loop and removes the data_iter_blocked_restore_order_seconds Prometheus metric along with all related fields, exports, and tests. The PR now exposes 8 core metrics (5 blocked stages + batches/rows/total). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, consolidates the dual timing mechanism: - StageTiming now supports context manager protocol (__enter__/ __exit__) to automatically capture start_s/end_s - Timer gains start_s/end_s fields populated by timer() - Pipeline functions (resolve_block_refs, _format_batch, _collate_batch, _finalize_batch) use nested context managers instead of redundant perf_counter() + _record_stage_window() - resolve_block_refs always returns BlockWithTiming, removing the record_timings parameter, Union types, and isinstance branching - Removed _record_stage_window helper (no longer needed) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

The fetch timing window in resolve_block_refs now spans from when we start waiting for the upstream iterator (blocked on the data pipeline) through ray.get() completion. This captures cross-node transfer and upstream production delays, giving a more complete picture of what blocks the training thread. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, adds clear docstrings to: - BatchTimings (per-batch pipeline-stage timing windows) - BlockWithTiming (resolved block with fetch timing) - BatchTimings.merge_fetch() (multi-block fetch window expansion) - BatchTimings.stages() (stage name/timing iterator) - _report_batch_timings() (overlap-based attribution algorithm) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

_BatchingIterator can receive blocks from paths other than resolve_block_refs (e.g., doctest examples that pass raw pyarrow Tables). Restore the isinstance check to handle both BlockWithTiming and raw Block objects gracefully. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, removed isinstance check and Union type from _BatchingIterator by ensuring all entry points wrap blocks in BlockWithTiming: - batch_blocks() now wraps raw blocks in BlockWithTiming with zero timing before passing to blocks_to_batches() - _BatchingIterator now assumes all blocks are BlockWithTiming - Removed Union import from util.py This provides a uniform type throughout the batching pipeline while maintaining backward compatibility for external callers of batch_blocks(). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

- test_util.py: Updated test_resolve_block_refs to expect BlockWithTiming objects and test_blocks_to_batches to wrap raw blocks - block_batching.py: Changed generator expression to map() to avoid holding references to blocks, fixing test_chained_transforms_release_intermediates Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

JasonLi1909 · 2026-06-25T11:10:54Z

+    """
+
+    block: Block
+    timings: BatchTimings = field(default_factory=BatchTimings)


Question: The data model feels off here, an entire block will not populate all of the BatchTimings. The data models should cleanly reflect what is possible and intrinsic to the component, is there a better way to organize this?

JasonLi1909 · 2026-06-25T11:20:43Z

+    def _wrap_block(b):
+        return BlockWithTiming(block=b, timings=BatchTimings())
+
+    wrapped_blocks = map(_wrap_block, blocks)


Is there a way to avoid this? This is purely a shim for the typing. Somehow, we should make the timing optional into BatchIterator. If it's too complex and out of scope, leave a detailed TODO comment for the future.

JasonLi1909 · 2026-06-25T11:45:43Z

+            ("finalize", self.finalize),
+        )
+
+    def merge_fetch(self, other: "BatchTimings") -> None:


Two things here:

The naming here is confusing as this function is used to merge timings at the block level but the type is BatchTimings. It reads as if batches rather than blocks are being merged. See comment below about the data model of BlockWithTiming.

If we split merge into data wait + transfer, we will need to change this to merge the stages properly. This may result in some double counting between the wait and transfer timers in the case of batches that span multiple blocks, so we will need to be careful here. For instance:

Block 1: wait=[0,1], transfer=[1,2]
Block 2: wait=[2,3], transfer=[3,4]

Say a batch spans these two blocks, then when we merge these blocks we get
wait=[0,3], transfer=[1,4], resulting in double counting. We will need to avoid this somehow

Per reviewer feedback, consolidates the dual timing mechanism: - New TimeSpan dataclass (pure data: start_s, end_s, duration) - Timer.timer() now yields a thread-local TimeSpan (safe for concurrent use from thread pools) - Removed thread-unsafe start_s/end_s fields from Timer - New _timed() helper yields Optional[TimeSpan] (None when no stats) - Removed StageTiming class entirely - BatchTimings fields are now Optional[TimeSpan] (None = stage didn't execute, eliminating magic-number 0.0 checks) - Pipeline functions use single _timed() context instead of nested StageTiming + Timer.timer() context managers - _merge_span helper handles Optional[TimeSpan] merge logic Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, splits the single fetch stage into two distinct stages to differentiate data production stall vs cross-node transfer stall: - production_wait: blocked on upstream data pipeline (next on ref bundle iterator) - data_transfer: cross-node transfer via ray.get() New Prometheus metrics: - data_iter_blocked_production_wait_seconds - data_iter_blocked_data_transfer_seconds Removed: data_iter_blocked_fetch_seconds (replaced by the two above) resolve_block_refs now times each sub-stage independently using _timed() with the appropriate Timer (iter_get_ref_bundles_s for production_wait, iter_get_s for data_transfer). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, refines the data model to accurately reflect what each component populates: - New BlockFetchTiming: only production_wait + data_transfer (what a single block actually has) - New BlockFetchResult: replaces BlockWithTiming, carries block + Optional[BlockFetchTiming] (None when no fetch timing) - Removed BlockWithTiming (data model was inaccurate — it carried full BatchTimings but only fetch was populated) - Moved num_rows from BatchTimings to BatchMetadata (it's not a timing, it's batch metadata) - batch_blocks() shim simplified: BlockFetchResult(block=b) with fetch=None instead of creating fake BatchTimings - merge_fetch now accepts BlockFetchTiming (not BatchTimings) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Per reviewer feedback, replaces string-based stage identifiers and getattr with type-safe alternatives: - New PipelineStage(Enum) in stats.py (avoids circular import) - BatchTimings.stages() returns Tuple[PipelineStage, Optional[TimeSpan]] - DatasetStats.get_blocked_timer(stage) uses match statement (no getattr, no string formatting) - _StatsActor stores blocked gauges in Dict[PipelineStage, Gauge] instead of individual attributes - update_iteration_metrics iterates the dict instead of calling 6 individual .set() methods - Renamed _report_batch_timings → _attribute_blocked_time (more descriptive of what the method does) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

OneSizeFitsQuorum · 2026-06-26T16:35:31Z

@JasonLi1909 I've addressed all your feedback across 4 commits (TimeSpan consolidation, fetch split, data model refactoring, PipelineStage enum). Cursor Bugbot then flagged 3 remaining issues that I'd like to discuss before fixing:

1. Nested timer double-counts iter_get_ref_bundles_s

When prefetch_batches > 0, resolve_block_refs wraps next(block_ref_iter) with _timed(iter_get_ref_bundles_s), but prefetch_batches_locally's get_next_ref_bundle() already times the same next(ref_bundles) call with the same Timer. The upstream wait gets accumulated twice.

Proposed fix: In resolve_block_refs, create the production_wait TimeSpan manually (without _timed()) so it doesn't accumulate to iter_get_ref_bundles_s. The cumulative stats remain handled by get_next_ref_bundle()'s existing timer. The per-batch TimeSpan is only used for overlap attribution, not accumulation. Does this separation make sense?

2. Split fetch spans overlap, violating sum ≤ total

After splitting fetch into production_wait + data_transfer, multi-block span merge creates overlapping windows (stages are interleaved across blocks). For example:
```
Block 1: prod [0,100], xfer [100,200]
Block 2: prod [200,300], xfer [300,400]
Merged: prod [0,300], xfer [100,400] ← overlap [100,300]
```

My view: In a pipelined system, multiple stages can be active simultaneously (data_transfer for block N + production_wait for block N+1). Attributing the same blocked interval to both stages is semantically correct — the training thread was indeed blocked while both stages were running. I'd relax the invariant to allow sum > total and document this behavior. I also considered using interval lists (List[TimeSpan]) for precise non-overlapping merge, but that would significantly increase complexity. Do you agree with the relaxed invariant, or do you prefer a different approach?

3. Reorder buffer skews blocked attribution

With preserve_order=True, stage timings are recorded when format/collate finish, but the batch may sit in restore_original_order's buffer before being yielded. The blocked window includes this reorder wait, but no stage TimeSpan covers it, so it becomes unattributed.

My view: Since you asked to remove restore_order timing, this unattributed gap is expected. If users see high unattributed ratios, they can investigate further. Is this acceptable, or would you prefer a different approach?

Per reviewer feedback, the per-streaming-split-worker `rank` label on iteration metrics is out of scope for this PR. Reverts the rank label addition so existing iteration gauges keep their master label set (`("dataset",)`), and leaves a TODO for a follow-up PR to add `rank` across all iteration metrics (including the new blocked-attribution gauges) at once. - Removes module-level `_create_iteration_tags` and the `_StatsActor._create_iteration_tags` method - Reverts `iter_tag_keys` from `("dataset", "rank")` to `("dataset",)` - `update_iteration_metrics` uses `self._create_tags(dataset_tag)` again - Drops the `re` import (only used by the removed function) - Removes `test_create_iteration_tags_extracts_rank` and drops `rank` from `expected_tags` in `test_update_iteration_metrics_exports_new_iter_metrics` Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

When prefetching is enabled, `iter_get_ref_bundles_s` was accumulated twice per block: once by `prefetch_batches_locally.get_next_ref_bundle` (which times `next(ref_bundles)`) and again by `resolve_block_refs` (which timed `next(block_ref_iter)` with the same Timer). The upstream wait was therefore double-counted in Prometheus `data_iter_get_ref_bundles_seconds`. `production_wait` is now captured as a bare `TimeSpan` (start/end `perf_counter` only) for overlap attribution, without calling `Timer.add()`. The cumulative `iter_get_ref_bundles_s` remains driven solely by `get_next_ref_bundle` when prefetch is on, and is not tracked when prefetch is off — matching master behavior. `data_transfer` (`iter_get_s`, the `ray.get()` call) is unaffected. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Adds `test_resolve_block_refs_does_not_accumulate_ref_bundles_timer` to guard against re-introducing the nested-timer double-counting bug fixed in the previous commit. The test asserts that: - `iter_get_ref_bundles_s` stays at 0 after `resolve_block_refs` processes a slow upstream iterator (i.e. the function does not accumulate into that Timer; `prefetch_batches_locally` owns it). - `production_wait` `TimeSpan` is still captured per block, so overlap-based blocked attribution continues to work. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

… TODO Addresses review feedback from JasonLi1909 and Cursor on design-level threads that don't require code changes, only clarification: - block_batching.py: expand the BlockFetchResult wrapping comment into a detailed TODO explaining why the typing shim exists (uniform iterator type across prefetch/non-prefetch paths) and how a future refactor could remove it. Per Jason's request to leave a TODO when the fix is out of scope. - iter_batches.py: relax the `sum(iter_blocked_*) ≤ iter_total_blocked_s` invariant in `_attribute_blocked_time`'s docstring to "approximates", and document two known cases that can push the sum above or below total by design: * Split fetch stages overlap (multi-block batches: production_wait for block N+1 concurrent with data_transfer for block N). * Reorder buffer wait is unattributed under preserve_order + prefetch (per-stage TimeSpan recorded at format/collate completion, before the batch leaves restore_original_order). These are documented as expected behavior, not bugs. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

…ed_timer Closes five test-coverage gaps identified during PR review: - TestTimerTimerSpan (test_stats.py): verifies Timer.timer() yields a fresh TimeSpan per call and accumulates its duration via add(); also covers _timed(None) skipping perf_counter and _timed(Timer) yielding a real span. - TestGetBlockedTimer (test_stats.py, parametrized over 6 stages): verifies DatasetStats.get_blocked_timer(stage) returns the correct Timer attribute via the match statement (previously only covered indirectly through update_iteration_metrics). - TestMergeFetch.data_transfer cases (test_iter_batches.py): extends existing production_wait-only tests to cover data_transfer merging (multiple blocks, overlapping windows, independence from production_wait, None src preservation). - test_resolve_block_refs_accumulates_data_transfer_timer (test_util.py): pairs with the existing ref_bundles non-accumulation test to verify the data_transfer (iter_get_s / ray.get) path IS accumulated and a per-block data_transfer TimeSpan is captured. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Fixes CI lint failures from unformatted code in the Cursor-2 fix and the new TestMergeFetch data_transfer cases. Black 22.10.0 (the version pinned in .pre-commit-config.yaml) collapses two patterns onto single lines: - util.py: `TimeSpan(...) if stats else None` ternary inside the prod_span assignment. - test_iter_batches.py: `BlockFetchTiming(data_transfer=TimeSpan(...))` constructor calls in the new merge tests. No behavior change. ruff and black --check both pass on all changed files. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

JasonLi1909 · 2026-06-27T08:26:55Z

+    BATCHING = "batching"
+    FORMAT = "format"
+    COLLATE = "collate"
+    FINALIZE = "finalize"


"IterationStage" is a better name for this. Also, document with a description for each stage. "Pipeline stages for blocked attribution" is vague- what pipeline stages? what blocked attribution?

doc added. BTW, I think either name works—they each emphasize a slightly different aspect, and both make sense in this context. Since you seem to prefer it, I've changed it to iteration stage.

…mments, remove gauge dict Addresses 5 new review threads from JasonLi1909 (2026-06-27) plus comment-hygiene cleanup: Renames (Jason #5, #6): - PipelineStage → IterationStage (stats.py, interfaces.py, tests) - _timed → _maybe_time (stats.py, util.py, tests) - IterationStage docstring expanded with per-stage descriptions and Prometheus label mapping note. Comment cleanup: - block_batching.py: typing-shim TODO compressed from 8 to 2 lines, "Out of scope for this PR" removed (Jason #4). - stats.py: rank-label TODO compressed, "this PR's scope" removed. - stats.py: Timer.timer() docstring "thread-local" → "fresh per call" (the span is a fresh instance, not TLS). - util.py: resolve_block_refs docstring Note compressed, "matching master behavior" removed; redundant inline comment replaced with a one-liner pointing at the docstring. - util.py: _pending_timings field gets a one-line comment. - iter_batches.py: _attribute_blocked_time docstring — the 15-line invariant/limitations section replaced with a brief TODO noting the design is open (interval lists as a future option). Structure simplification (Jason #7, #8): - _blocked_gauges: Dict[IterationStage, Gauge] replaced with 6 individual Gauge attributes (iter_blocked_*_s), matching the existing iteration gauge style. - for-loop in update_iteration_metrics replaced with 6 individual .set() calls. - get_blocked_timer() match statement retained (maps stage → Timer on DatasetStats, orthogonal to gauge storage). - Test fixture updated: FakeGauge setup uses individual attributes instead of a dict. All affected tests pass (43 in test_stats + test_util + test_iter_batches). black 22.10.0 and ruff both clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

…ocked_time Follow-up to the IterationStage rename (commit 4ac2b5c) — the enum was renamed but a handful of docstrings still said "pipeline stage". Also tightens _attribute_blocked_time per review feedback that it was still too long. - interfaces.py: 4 docstring references "pipeline stage" → "iteration stage" (BatchTimings class doc, stages() doc, BatchMetadata.timings attribute doc). - stats.py: get_blocked_timer docstring "pipeline stage" → "iteration stage". - iter_batches.py: _attribute_blocked_time docstring rewritten — drops the 3 prefetch-config bullets (obvious from the formula), drops the thread-safety paragraph (implementation detail), and folds the invariant + two limitations into a single TODO noting the design is open (interval lists as a future option). Also fixes "pipeline stage" → "stage" here. Net ~20 lines shorter. No behavior change. black 22.10.0 + ruff clean; 25 affected tests pass. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit b13112d. Configure here.}

Previously `iter_total_s.add(...)` and the final `_StatsManager.update_iteration_metrics` flush ran only after `yield from batch_iterator` completed normally. On early exit (e.g. `break` in the training loop), the `finally` block only shut down the executor, so `iter_total_s` stayed at zero and Prometheus never got a final flush despite partial iteration. Moves both calls into the `finally` block (before `_on_iteration_end` to keep total time excluding executor shutdown, matching master's placement). Normal completion is unaffected; early exit now records the wall-clock time up to the break and flushes metrics. Note: mid-run staleness (iter_total_s reads 0 during iteration because add() only runs at the end) is a separate pre-existing issue and remains a follow-up. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Six style fixes from self-review of test code: 1. Rename TestTimerTimerSpan → TestTimerSpan (the "Timer" duplication was awkward; the class tests Timer.timer() returning TimeSpan). 2. Remove stale `assert "restore order" not in text` from test_iter_stats_to_string_omits_zero_stages — the restore_order stage was removed in an earlier commit, so the assertion was vacuously true. 3. Merge the separate for-loop that set up blocked-gauge FakeGauges into the main gauge-setup loop in test_update_iteration_metrics_exports_new_iter_metrics, and drop the now-redundant "Blocked gauges are stored as individual attributes" comment (blocked gauges are just attributes now, like the others). 4. TestMergeFetch: the five original methods passed `BatchTimings()` as the `src` argument to `merge_fetch()`, whose signature takes `BlockFetchTiming`. It worked by duck-typing but was the wrong type. Switched to `BlockFetchTiming(production_wait=TimeSpan(...))` to match the signature and the style of the newer data_transfer tests. 5. test_overlap_invariant_sum_leq_total: docstring said "always holds" but the invariant was relaxed to "approximates" with a TODO noting cases that violate it. Tightened to "holds for non-overlapping stages" to reflect what the test actually exercises. 6. _make_span / _make_batch_with_timings: replaced the 0.0-as-sentinel pattern (start==0.0 and end==0.0 → None) with explicit Optional parameters defaulting to None. Mirrors the source-code cleanup Jason requested for the 0.0 sentinel. All affected tests pass (38). black 22.10.0 + ruff clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Review feedback noted the documentation read as agent-generated — excessive on trivial functions, missing on core ones, and heavy on Sphinx markup / em-dashes / "Note:" callouts that humans don't write. interfaces.py: - BlockFetchTiming: drop lifecycle sentence ("Produced by ... merged into ... by ..."). A 2-field dataclass doesn't need it. - BlockFetchResult: tighten the None-description. - BatchTimings: drop "overlap-based attribution" rationale and the :meth: cross-reference; keep just what the fields mean. - stages(): drop the redundant docstring restating the signature. - _merge_span: 5-line docstring → 1 line. The code is 7 lines. - BatchMetadata.timings: "Iteration-stage timing windows for this batch" → "Per-stage timing windows." stats.py: - IterationStage: drop the 3-line overlap-mechanism explanation (that's _attribute_blocked_time's job, not the enum's). - TimeSpan: drop "Created by Timer.timer and carried per-batch for overlap-based blocked attribution" — system-level rationale doesn't belong on a 2-field dataclass. - _maybe_time: 3-line docstring → 1 line; drop :class: markup and the em-dash aside. util.py: - resolve_block_refs: drop "Note:" callout prefix; drop :func: markup on prefetch_batches_locally.get_next_ref_bundle. - Inline comments: drop "(1)"/"(2)" numbering; compress the ray.get TODO to one line. iter_batches.py: - _iter_batches: add a short docstring (was missing). - _attribute_blocked_time: drop "recorded by background threads" implementation detail; drop the _iter_batches cross-ref; drop the "Interval lists would give precise..." suggestion from the TODO; plain-backtick Args instead of ``-wrapped. - get_next_batch_context / yield_batch_context: add short docstrings (were missing). Net: -22 docstring lines, no behavior change. black 22.10.0 + ruff clean; 38 affected tests pass. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

The inline comment "# production_wait: upstream wait (not accumulated; see docstring)" pointed back at the docstring above it — a circular reference that reads as filler. State the fact inline instead. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

…mings Same LLM-tell pattern as the "see docstring" pointer fixed in the previous commit — a "(see <method>)" cross-reference that reads as filler. The reader can find the consumer by grepping. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

The previous docstring cleanup pass missed this one — Timer.timer() still had ``:class:`TimeSpan`` and ``:meth:`add`` Sphinx markup while the rest of the PR's internal docstrings use plain backticks. Switched to plain backticks for consistency. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Naming (util.py): - prod_start_s / prod_span → production_wait_start / production_wait_span - xfer_span → data_transfer_span Variable names now match the BlockFetchTiming field names they're assigned to, so readers don't have to map abbreviations. Stale test names (test_iter_batches.py): - TestReportBatchTimingsEdgeCases → TestAttributeBlockedTimeEdgeCases (method was renamed _report_batch_timings → _attribute_blocked_time in an earlier commit; the test class name was missed). - _make_report_iterator → _make_test_iterator (same stale "report"). Test hygiene (test_util.py): - Move `from ray.data._internal.stats import DatasetStats` from inside two test functions to a top-level import. - Trim the 8-line regression-test docstring to 2 lines (the source docstring already has the rationale). - Drop the "Pairs with <other test>" cross-reference (brittle to renames, reads as filler). - Drop the weak `assert stats.iter_get_s.get() >= 0.0` (always true for non-negative Timer values; wasn't verifying anything). The structural checks (data_transfer span exists per block) remain. Test hygiene (test_stats.py): - test_each_call_returns_fresh_span: drop the unnecessary `spans` list; use s1/s2 directly. - test_maybe_time_skips_when_timer_none: docstring claimed "skips perf_counter entirely" but the test only checks `span is None`; dropped the unverified claim. All affected tests pass (39). black 22.10.0 + ruff clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

ruff's isort check (--select I) flagged the imports I added (IterationStage, TimeSpan, _maybe_time) as out of alphabetical order. pre-commit runs `ruff --select I --fix --exit-non-zero-on-fix`, which fixes and exits non-zero — failing microcheck CI. Reordered to satisfy ruff's case-insensitive isort. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

OneSizeFitsQuorum requested a review from a team as a code owner June 17, 2026 15:50

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch 2 times, most recently from b02e44d to 850ff58 Compare June 17, 2026 15:59

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 17, 2026

OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch 2 times, most recently from faaeb93 to 14ec378 Compare June 18, 2026 01:57

JasonLi1909 reviewed Jun 19, 2026

View reviewed changes

OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch from 14ec378 to 69c8cbb Compare June 20, 2026 03:21

OneSizeFitsQuorum changed the title ~~[Data] Add per-stage training-thread blocking attribution and pipeline observability to iter_batches~~ [Data] Add per-stage training-thread blocking attribution to iter_batches Jun 20, 2026

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch from 69c8cbb to 071ebd6 Compare June 20, 2026 03:26

[Data] Remove unused ShufflingBatcher compaction tracking

f3935ab

Reverts batcher.py changes that were only needed for the shuffle buffer metrics which have been removed from this PR's scope per reviewer feedback. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>