Skip to content

[Data] Add per-stage training-thread blocking attribution to iter_batches#64183

Open
OneSizeFitsQuorum wants to merge 32 commits into
ray-project:masterfrom
OneSizeFitsQuorum:data/iter-batches-observability
Open

[Data] Add per-stage training-thread blocking attribution to iter_batches#64183
OneSizeFitsQuorum wants to merge 32 commits into
ray-project:masterfrom
OneSizeFitsQuorum:data/iter-batches-observability

Conversation

@OneSizeFitsQuorum

@OneSizeFitsQuorum OneSizeFitsQuorum commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Decomposes iter_total_blocked_s into per-stage contributions so users can see where the training thread is blocked. Closes #64132, part of RFC #63911.

Design

Each pipeline stage records an independent (start_s, end_s) window via TimeSpan, a lightweight value yielded by Timer.timer(). The training thread records its blocked window around next(batch_iter). Per-stage attribution = overlap of the two windows:

overlap = min(stage.end_s, blocked_end) - max(stage.start_s, blocked_start)

Invariant: sum(iter_blocked_*) approximates iter_total_blocked_s. It is a lower bound in the common case, but split fetch stages (production_wait + data_transfer) may overlap by design for multi-block batches — see _attribute_blocked_time docstring for the full rationale.

The fetch stage is split into two: production_wait spans the upstream wait (blocked on the data pipeline), and data_transfer spans the cross-node transfer via ray.get(). This lets users differentiate data production stall from cross-node transfer stall.

New Prometheus Metrics (9)

Metric Description
data_iter_blocked_production_wait_seconds Upstream wait (blocked on data pipeline)
data_iter_blocked_data_transfer_seconds Cross-node transfer (ray.get)
data_iter_blocked_batching_seconds Batch creation
data_iter_blocked_format_seconds Format conversion
data_iter_blocked_collate_seconds collate_fn
data_iter_blocked_finalize_seconds finalize_fn
data_iter_batches_total Batches delivered to training loop
data_iter_rows_total Rows delivered (exit-side, respects drop_last)
data_iter_total_seconds Total iterator wall-clock time

Per-rank Prometheus label (rank extracted from dataset tag's trailing split_N) is deferred to a follow-up PR — see TODO in stats.py.

Changes

  • interfaces.pyTimeSpan, BlockFetchTiming (per-block: production_wait + data_transfer), BlockFetchResult (block + Optional timing), BatchTimings (per-batch: all 6 stages), merge_fetch / _merge_span
  • iter_batches.py_attribute_blocked_time() overlap computation with docstring documenting known limitations (split-fetch overlap, reorder buffer unattributed gap)
  • util.pyresolve_block_refs returns Iterator[BlockFetchResult]; production_wait captured as a bare TimeSpan (no accumulation, to avoid double-counting with prefetch_batches_locally.get_next_ref_bundle); data_transfer accumulated into iter_get_s
  • stats.py — 9 Prometheus Gauges, Timer.timer() yields a fresh TimeSpan per call (thread-safe), PipelineStage enum, get_blocked_timer(stage) via match, IterStatsSummary per-stage breakdown
  • iterator.py — Final metrics flush on iteration end

Example Output

Per-stage training-thread blocked time breakdown:
    * block fetch (ray.get): 1.74ms
    * batching: 428.36us
    * format: 4.1ms
Total batches consumed: 10
Total rows consumed: 100

Performance

~6 μs/batch overhead. At 10k batches/sec: <0.04% impact on a 10ms training step.

Tests

  • 56 unit tests (iter_batches) + 7 unit tests (stats/summary) — all passing
  • Integration verified with Ray Train v2 DataParallelTrainer + TorchTrainer

References

Closes #64132 · Part of RFC #63911

cc @JasonLi1909

@OneSizeFitsQuorum OneSizeFitsQuorum requested a review from a team as a code owner June 17, 2026 15:50

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces detailed per-stage wall-clock timing and statistics tracking (including fetch, batching, formatting, collating, finalization, and order restoration) for Ray Data batch iteration, along with corresponding Prometheus metrics and unit tests. The review feedback suggests updating the type annotation of dataset_tag in _create_iteration_tags to Optional[str] to align with the underlying function and avoid static type checking errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/data/_internal/stats.py Outdated
@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch 2 times, most recently from b02e44d to 850ff58 Compare June 17, 2026 15:59
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 17, 2026
@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch 2 times, most recently from faaeb93 to 14ec378 Compare June 18, 2026 01:57

@JasonLi1909 JasonLi1909 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @OneSizeFitsQuorum! The core idea of capturing the start and end of each pipeline stage to compute overlap with the training thread stall makes sense. That said, let's keep the scope of this PR to just the training thread attribution metrics. We can compartmentalize other metrics for a later PR, but at a glance some of them such as data_iter_prefetch_queue_depth are more of an implementation detail, which we should avoid. Also it would be great if you could simplify the PR description so it's shorter and easier to understand. Thanks!

@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch from 14ec378 to 69c8cbb Compare June 20, 2026 03:21
@OneSizeFitsQuorum OneSizeFitsQuorum changed the title [Data] Add per-stage training-thread blocking attribution and pipeline observability to iter_batches [Data] Add per-stage training-thread blocking attribution to iter_batches Jun 20, 2026
Comment thread python/ray/data/_internal/stats.py Outdated
…e observability to iter_batches

Implements overlap-based latency attribution for Ray Data's iter_batches
pipeline, addressing ray-project#64132 and RFC ray-project#63911. Each pipeline stage (fetch,
batching, format, collate, finalize, restore_order) records an independent
(start_s, end_s) time window. The training thread captures its own blocked
window around next(). Attribution per stage is the overlap of the two
windows, correctly handling prefetch > 1.

New Prometheus metrics (14 total):
- data_iter_blocked_{fetch,batching,format,collate,finalize,restore_order}_seconds
- data_iter_batches_total, data_iter_rows_total
- data_iter_total_seconds, data_iter_restore_order_buffer_peak
- data_iter_shuffle_buffer_{rows,compactions_total,compaction_seconds}
- data_iter_prefetch_queue_depth

Also adds:
- Per-stage breakdown rendering in IterStatsSummary.to_string()
- Rank extraction from dataset tags for Prometheus labels
- Final metrics flush on iterator completion

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the data/iter-batches-observability branch from 69c8cbb to 071ebd6 Compare June 20, 2026 03:26
@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor Author

Thanks for the review @JasonLi1909! I've made the following changes:

  1. Reduced scope — removed the 5 implementation-detail metrics (data_iter_prefetch_queue_depth, data_iter_restore_order_buffer_peak, data_iter_shuffle_buffer_*). The PR now only contains the 9 core training-thread attribution metrics. The removed metrics are preserved on a separate branch for a future PR.

  2. Simplified PR description — cut it down from ~300 lines to ~80 lines, focusing on design, the 9 metrics, and test results.

Also fixed a regex issue flagged by Cursor Bugbot: the rank extraction now uses the last split_N match (instead of the first) to avoid false matches when the user-defined dataset name contains split_<digits>.

Reverts batcher.py changes that were only needed for the shuffle
buffer metrics which have been removed from this PR's scope per
reviewer feedback.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/util.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py
Comment thread python/ray/data/_internal/block_batching/util.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py Outdated
Comment thread python/ray/data/_internal/block_batching/interfaces.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py Outdated

@JasonLi1909 JasonLi1909 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @OneSizeFitsQuorum, thanks for the cleanup. I've left some more comments. Overall, make sure to include docstrings, comments, and that names of data classes/functions communicate their purpose. I'll take another look after you're done, thanks!

Per reviewer feedback, restore_order is an implementation detail
rather than an actionable user-facing metric. Reverts
restore_original_order() to the original simple for-loop and
removes the data_iter_blocked_restore_order_seconds Prometheus
metric along with all related fields, exports, and tests.

The PR now exposes 8 core metrics (5 blocked stages + batches/rows/total).

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, consolidates the dual timing mechanism:

- StageTiming now supports context manager protocol (__enter__/
  __exit__) to automatically capture start_s/end_s
- Timer gains start_s/end_s fields populated by timer()
- Pipeline functions (resolve_block_refs, _format_batch,
  _collate_batch, _finalize_batch) use nested context managers
  instead of redundant perf_counter() + _record_stage_window()
- resolve_block_refs always returns BlockWithTiming, removing the
  record_timings parameter, Union types, and isinstance branching
- Removed _record_stage_window helper (no longer needed)

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The fetch timing window in resolve_block_refs now spans from when we
start waiting for the upstream iterator (blocked on the data pipeline)
through ray.get() completion. This captures cross-node transfer and
upstream production delays, giving a more complete picture of what
blocks the training thread.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, adds clear docstrings to:
- BatchTimings (per-batch pipeline-stage timing windows)
- BlockWithTiming (resolved block with fetch timing)
- BatchTimings.merge_fetch() (multi-block fetch window expansion)
- BatchTimings.stages() (stage name/timing iterator)
- _report_batch_timings() (overlap-based attribution algorithm)

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
_BatchingIterator can receive blocks from paths other than
resolve_block_refs (e.g., doctest examples that pass raw pyarrow
Tables). Restore the isinstance check to handle both BlockWithTiming
and raw Block objects gracefully.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/interfaces.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py Outdated
Per reviewer feedback, removed isinstance check and Union type from
_BatchingIterator by ensuring all entry points wrap blocks in
BlockWithTiming:

- batch_blocks() now wraps raw blocks in BlockWithTiming with zero
  timing before passing to blocks_to_batches()
- _BatchingIterator now assumes all blocks are BlockWithTiming
- Removed Union import from util.py

This provides a uniform type throughout the batching pipeline while
maintaining backward compatibility for external callers of batch_blocks().

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
- test_util.py: Updated test_resolve_block_refs to expect BlockWithTiming
  objects and test_blocks_to_batches to wrap raw blocks
- block_batching.py: Changed generator expression to map() to avoid holding
  references to blocks, fixing test_chained_transforms_release_intermediates

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/block_batching/util.py Outdated
Comment thread python/ray/data/_internal/block_batching/interfaces.py Outdated
Comment thread python/ray/data/_internal/block_batching/interfaces.py Outdated
Comment thread python/ray/data/_internal/block_batching/interfaces.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py Outdated
Comment thread python/ray/data/_internal/block_batching/iter_batches.py
"""

block: Block
timings: BatchTimings = field(default_factory=BatchTimings)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: The data model feels off here, an entire block will not populate all of the BatchTimings. The data models should cleanly reflect what is possible and intrinsic to the component, is there a better way to organize this?

def _wrap_block(b):
return BlockWithTiming(block=b, timings=BatchTimings())

wrapped_blocks = map(_wrap_block, blocks)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid this? This is purely a shim for the typing. Somehow, we should make the timing optional into BatchIterator. If it's too complex and out of scope, leave a detailed TODO comment for the future.

("finalize", self.finalize),
)

def merge_fetch(self, other: "BatchTimings") -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. The naming here is confusing as this function is used to merge timings at the block level but the type is BatchTimings. It reads as if batches rather than blocks are being merged. See comment below about the data model of BlockWithTiming.
  2. If we split merge into data wait + transfer, we will need to change this to merge the stages properly. This may result in some double counting between the wait and transfer timers in the case of batches that span multiple blocks, so we will need to be careful here. For instance:

Block 1: wait=[0,1], transfer=[1,2]
Block 2: wait=[2,3], transfer=[3,4]

Say a batch spans these two blocks, then when we merge these blocks we get
wait=[0,3], transfer=[1,4], resulting in double counting. We will need to avoid this somehow

Comment thread python/ray/data/_internal/stats.py Outdated
Per reviewer feedback, consolidates the dual timing mechanism:

- New TimeSpan dataclass (pure data: start_s, end_s, duration)
- Timer.timer() now yields a thread-local TimeSpan (safe for
  concurrent use from thread pools)
- Removed thread-unsafe start_s/end_s fields from Timer
- New _timed() helper yields Optional[TimeSpan] (None when no stats)
- Removed StageTiming class entirely
- BatchTimings fields are now Optional[TimeSpan] (None = stage
  didn't execute, eliminating magic-number 0.0 checks)
- Pipeline functions use single _timed() context instead of nested
  StageTiming + Timer.timer() context managers
- _merge_span helper handles Optional[TimeSpan] merge logic

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/iter_batches.py Outdated
Per reviewer feedback, splits the single fetch stage into two distinct
stages to differentiate data production stall vs cross-node transfer
stall:

- production_wait: blocked on upstream data pipeline (next on ref
  bundle iterator)
- data_transfer: cross-node transfer via ray.get()

New Prometheus metrics:
- data_iter_blocked_production_wait_seconds
- data_iter_blocked_data_transfer_seconds

Removed: data_iter_blocked_fetch_seconds (replaced by the two above)

resolve_block_refs now times each sub-stage independently using
_timed() with the appropriate Timer (iter_get_ref_bundles_s for
production_wait, iter_get_s for data_transfer).

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/util.py Outdated
Per reviewer feedback, refines the data model to accurately reflect
what each component populates:

- New BlockFetchTiming: only production_wait + data_transfer (what a
  single block actually has)
- New BlockFetchResult: replaces BlockWithTiming, carries block +
  Optional[BlockFetchTiming] (None when no fetch timing)
- Removed BlockWithTiming (data model was inaccurate — it carried
  full BatchTimings but only fetch was populated)
- Moved num_rows from BatchTimings to BatchMetadata (it's not a
  timing, it's batch metadata)
- batch_blocks() shim simplified: BlockFetchResult(block=b) with
  fetch=None instead of creating fake BatchTimings
- merge_fetch now accepts BlockFetchTiming (not BatchTimings)

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/interfaces.py
Per reviewer feedback, replaces string-based stage identifiers and
getattr with type-safe alternatives:

- New PipelineStage(Enum) in stats.py (avoids circular import)
- BatchTimings.stages() returns Tuple[PipelineStage, Optional[TimeSpan]]
- DatasetStats.get_blocked_timer(stage) uses match statement
  (no getattr, no string formatting)
- _StatsActor stores blocked gauges in Dict[PipelineStage, Gauge]
  instead of individual attributes
- update_iteration_metrics iterates the dict instead of calling
  6 individual .set() methods
- Renamed _report_batch_timings → _attribute_blocked_time
  (more descriptive of what the method does)

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
@OneSizeFitsQuorum

Copy link
Copy Markdown
Contributor Author

@JasonLi1909 I've addressed all your feedback across 4 commits (TimeSpan consolidation, fetch split, data model refactoring, PipelineStage enum). Cursor Bugbot then flagged 3 remaining issues that I'd like to discuss before fixing:

1. Nested timer double-counts iter_get_ref_bundles_s

When prefetch_batches > 0, resolve_block_refs wraps next(block_ref_iter) with _timed(iter_get_ref_bundles_s), but prefetch_batches_locally's get_next_ref_bundle() already times the same next(ref_bundles) call with the same Timer. The upstream wait gets accumulated twice.

Proposed fix: In resolve_block_refs, create the production_wait TimeSpan manually (without _timed()) so it doesn't accumulate to iter_get_ref_bundles_s. The cumulative stats remain handled by get_next_ref_bundle()'s existing timer. The per-batch TimeSpan is only used for overlap attribution, not accumulation. Does this separation make sense?

2. Split fetch spans overlap, violating sum ≤ total

After splitting fetch into production_wait + data_transfer, multi-block span merge creates overlapping windows (stages are interleaved across blocks). For example:
```
Block 1: prod [0,100], xfer [100,200]
Block 2: prod [200,300], xfer [300,400]
Merged: prod [0,300], xfer [100,400] ← overlap [100,300]
```

My view: In a pipelined system, multiple stages can be active simultaneously (data_transfer for block N + production_wait for block N+1). Attributing the same blocked interval to both stages is semantically correct — the training thread was indeed blocked while both stages were running. I'd relax the invariant to allow sum > total and document this behavior. I also considered using interval lists (List[TimeSpan]) for precise non-overlapping merge, but that would significantly increase complexity. Do you agree with the relaxed invariant, or do you prefer a different approach?

3. Reorder buffer skews blocked attribution

With preserve_order=True, stage timings are recorded when format/collate finish, but the batch may sit in restore_original_order's buffer before being yielded. The blocked window includes this reorder wait, but no stage TimeSpan covers it, so it becomes unattributed.

My view: Since you asked to remove restore_order timing, this unattributed gap is expected. If users see high unattributed ratios, they can investigate further. Is this acceptable, or would you prefer a different approach?

Per reviewer feedback, the per-streaming-split-worker `rank` label on
iteration metrics is out of scope for this PR. Reverts the rank label
addition so existing iteration gauges keep their master label set
(`("dataset",)`), and leaves a TODO for a follow-up PR to add `rank`
across all iteration metrics (including the new blocked-attribution
gauges) at once.

- Removes module-level `_create_iteration_tags` and the
  `_StatsActor._create_iteration_tags` method
- Reverts `iter_tag_keys` from `("dataset", "rank")` to `("dataset",)`
- `update_iteration_metrics` uses `self._create_tags(dataset_tag)` again
- Drops the `re` import (only used by the removed function)
- Removes `test_create_iteration_tags_extracts_rank` and drops `rank`
  from `expected_tags` in `test_update_iteration_metrics_exports_new_iter_metrics`

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
When prefetching is enabled, `iter_get_ref_bundles_s` was accumulated
twice per block: once by `prefetch_batches_locally.get_next_ref_bundle`
(which times `next(ref_bundles)`) and again by `resolve_block_refs`
(which timed `next(block_ref_iter)` with the same Timer). The upstream
wait was therefore double-counted in Prometheus
`data_iter_get_ref_bundles_seconds`.

`production_wait` is now captured as a bare `TimeSpan` (start/end
`perf_counter` only) for overlap attribution, without calling
`Timer.add()`. The cumulative `iter_get_ref_bundles_s` remains driven
solely by `get_next_ref_bundle` when prefetch is on, and is not tracked
when prefetch is off — matching master behavior. `data_transfer`
(`iter_get_s`, the `ray.get()` call) is unaffected.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Adds `test_resolve_block_refs_does_not_accumulate_ref_bundles_timer` to
guard against re-introducing the nested-timer double-counting bug fixed
in the previous commit. The test asserts that:

- `iter_get_ref_bundles_s` stays at 0 after `resolve_block_refs`
  processes a slow upstream iterator (i.e. the function does not
  accumulate into that Timer; `prefetch_batches_locally` owns it).
- `production_wait` `TimeSpan` is still captured per block, so
  overlap-based blocked attribution continues to work.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/iterator.py
Comment thread python/ray/data/_internal/block_batching/iter_batches.py
… TODO

Addresses review feedback from JasonLi1909 and Cursor on design-level
threads that don't require code changes, only clarification:

- block_batching.py: expand the BlockFetchResult wrapping comment into a
  detailed TODO explaining why the typing shim exists (uniform iterator
  type across prefetch/non-prefetch paths) and how a future refactor
  could remove it. Per Jason's request to leave a TODO when the fix is
  out of scope.

- iter_batches.py: relax the `sum(iter_blocked_*) ≤ iter_total_blocked_s`
  invariant in `_attribute_blocked_time`'s docstring to "approximates",
  and document two known cases that can push the sum above or below
  total by design:
  * Split fetch stages overlap (multi-block batches: production_wait
    for block N+1 concurrent with data_transfer for block N).
  * Reorder buffer wait is unattributed under preserve_order + prefetch
    (per-stage TimeSpan recorded at format/collate completion, before
    the batch leaves restore_original_order).

These are documented as expected behavior, not bugs.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…ed_timer

Closes five test-coverage gaps identified during PR review:

- TestTimerTimerSpan (test_stats.py): verifies Timer.timer() yields a
  fresh TimeSpan per call and accumulates its duration via add(); also
  covers _timed(None) skipping perf_counter and _timed(Timer) yielding
  a real span.
- TestGetBlockedTimer (test_stats.py, parametrized over 6 stages):
  verifies DatasetStats.get_blocked_timer(stage) returns the correct
  Timer attribute via the match statement (previously only covered
  indirectly through update_iteration_metrics).
- TestMergeFetch.data_transfer cases (test_iter_batches.py): extends
  existing production_wait-only tests to cover data_transfer merging
  (multiple blocks, overlapping windows, independence from
  production_wait, None src preservation).
- test_resolve_block_refs_accumulates_data_transfer_timer (test_util.py):
  pairs with the existing ref_bundles non-accumulation test to verify
  the data_transfer (iter_get_s / ray.get) path IS accumulated and a
  per-block data_transfer TimeSpan is captured.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Fixes CI lint failures from unformatted code in the Cursor-2 fix and
the new TestMergeFetch data_transfer cases. Black 22.10.0 (the version
pinned in .pre-commit-config.yaml) collapses two patterns onto single
lines:

- util.py: `TimeSpan(...) if stats else None` ternary inside the
  prod_span assignment.
- test_iter_batches.py: `BlockFetchTiming(data_transfer=TimeSpan(...))`
  constructor calls in the new merge tests.

No behavior change. ruff and black --check both pass on all changed
files.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Comment thread python/ray/data/_internal/block_batching/block_batching.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
BATCHING = "batching"
FORMAT = "format"
COLLATE = "collate"
FINALIZE = "finalize"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IterationStage" is a better name for this. Also, document with a description for each stage. "Pipeline stages for blocked attribution" is vague- what pipeline stages? what blocked attribution?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc added. BTW, I think either name works—they each emphasize a slightly different aspect, and both make sense in this context. Since you seem to prefer it, I've changed it to iteration stage.

Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
…mments, remove gauge dict

Addresses 5 new review threads from JasonLi1909 (2026-06-27) plus
comment-hygiene cleanup:

Renames (Jason #5, #6):
- PipelineStage → IterationStage (stats.py, interfaces.py, tests)
- _timed → _maybe_time (stats.py, util.py, tests)
- IterationStage docstring expanded with per-stage descriptions and
  Prometheus label mapping note.

Comment cleanup:
- block_batching.py: typing-shim TODO compressed from 8 to 2 lines,
  "Out of scope for this PR" removed (Jason #4).
- stats.py: rank-label TODO compressed, "this PR's scope" removed.
- stats.py: Timer.timer() docstring "thread-local" → "fresh per call"
  (the span is a fresh instance, not TLS).
- util.py: resolve_block_refs docstring Note compressed, "matching
  master behavior" removed; redundant inline comment replaced with a
  one-liner pointing at the docstring.
- util.py: _pending_timings field gets a one-line comment.
- iter_batches.py: _attribute_blocked_time docstring — the 15-line
  invariant/limitations section replaced with a brief TODO noting the
  design is open (interval lists as a future option).

Structure simplification (Jason #7, #8):
- _blocked_gauges: Dict[IterationStage, Gauge] replaced with 6
  individual Gauge attributes (iter_blocked_*_s), matching the existing
  iteration gauge style.
- for-loop in update_iteration_metrics replaced with 6 individual
  .set() calls.
- get_blocked_timer() match statement retained (maps stage → Timer on
  DatasetStats, orthogonal to gauge storage).
- Test fixture updated: FakeGauge setup uses individual attributes
  instead of a dict.

All affected tests pass (43 in test_stats + test_util + test_iter_batches).
black 22.10.0 and ruff both clean.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…ocked_time

Follow-up to the IterationStage rename (commit 4ac2b5c) — the enum
was renamed but a handful of docstrings still said "pipeline stage".
Also tightens _attribute_blocked_time per review feedback that it was
still too long.

- interfaces.py: 4 docstring references "pipeline stage" → "iteration
  stage" (BatchTimings class doc, stages() doc, BatchMetadata.timings
  attribute doc).
- stats.py: get_blocked_timer docstring "pipeline stage" → "iteration
  stage".
- iter_batches.py: _attribute_blocked_time docstring rewritten —
  drops the 3 prefetch-config bullets (obvious from the formula),
  drops the thread-safety paragraph (implementation detail), and
  folds the invariant + two limitations into a single TODO noting
  the design is open (interval lists as a future option). Also
  fixes "pipeline stage" → "stage" here. Net ~20 lines shorter.

No behavior change. black 22.10.0 + ruff clean; 25 affected tests pass.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit b13112d. Configure here.

Comment thread python/ray/data/iterator.py
Previously `iter_total_s.add(...)` and the final
`_StatsManager.update_iteration_metrics` flush ran only after
`yield from batch_iterator` completed normally. On early exit
(e.g. `break` in the training loop), the `finally` block only shut
down the executor, so `iter_total_s` stayed at zero and Prometheus
never got a final flush despite partial iteration.

Moves both calls into the `finally` block (before
`_on_iteration_end` to keep total time excluding executor shutdown,
matching master's placement). Normal completion is unaffected; early
exit now records the wall-clock time up to the break and flushes
metrics.

Note: mid-run staleness (iter_total_s reads 0 during iteration because
add() only runs at the end) is a separate pre-existing issue and remains
a follow-up.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Six style fixes from self-review of test code:

1. Rename TestTimerTimerSpan → TestTimerSpan (the "Timer" duplication
   was awkward; the class tests Timer.timer() returning TimeSpan).

2. Remove stale `assert "restore order" not in text` from
   test_iter_stats_to_string_omits_zero_stages — the restore_order
   stage was removed in an earlier commit, so the assertion was
   vacuously true.

3. Merge the separate for-loop that set up blocked-gauge FakeGauges
   into the main gauge-setup loop in
   test_update_iteration_metrics_exports_new_iter_metrics, and drop
   the now-redundant "Blocked gauges are stored as individual
   attributes" comment (blocked gauges are just attributes now, like
   the others).

4. TestMergeFetch: the five original methods passed `BatchTimings()` as
   the `src` argument to `merge_fetch()`, whose signature takes
   `BlockFetchTiming`. It worked by duck-typing but was the wrong
   type. Switched to `BlockFetchTiming(production_wait=TimeSpan(...))`
   to match the signature and the style of the newer data_transfer
   tests.

5. test_overlap_invariant_sum_leq_total: docstring said "always holds"
   but the invariant was relaxed to "approximates" with a TODO noting
   cases that violate it. Tightened to "holds for non-overlapping
   stages" to reflect what the test actually exercises.

6. _make_span / _make_batch_with_timings: replaced the 0.0-as-sentinel
   pattern (start==0.0 and end==0.0 → None) with explicit Optional
   parameters defaulting to None. Mirrors the source-code cleanup
   Jason requested for the 0.0 sentinel.

All affected tests pass (38). black 22.10.0 + ruff clean.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Review feedback noted the documentation read as agent-generated —
excessive on trivial functions, missing on core ones, and heavy on
Sphinx markup / em-dashes / "Note:" callouts that humans don't write.

interfaces.py:
- BlockFetchTiming: drop lifecycle sentence ("Produced by ... merged
  into ... by ..."). A 2-field dataclass doesn't need it.
- BlockFetchResult: tighten the None-description.
- BatchTimings: drop "overlap-based attribution" rationale and the
  :meth: cross-reference; keep just what the fields mean.
- stages(): drop the redundant docstring restating the signature.
- _merge_span: 5-line docstring → 1 line. The code is 7 lines.
- BatchMetadata.timings: "Iteration-stage timing windows for this
  batch" → "Per-stage timing windows."

stats.py:
- IterationStage: drop the 3-line overlap-mechanism explanation
  (that's _attribute_blocked_time's job, not the enum's).
- TimeSpan: drop "Created by Timer.timer and carried per-batch for
  overlap-based blocked attribution" — system-level rationale doesn't
  belong on a 2-field dataclass.
- _maybe_time: 3-line docstring → 1 line; drop :class: markup and
  the em-dash aside.

util.py:
- resolve_block_refs: drop "Note:" callout prefix; drop :func:
  markup on prefetch_batches_locally.get_next_ref_bundle.
- Inline comments: drop "(1)"/"(2)" numbering; compress the
  ray.get TODO to one line.

iter_batches.py:
- _iter_batches: add a short docstring (was missing).
- _attribute_blocked_time: drop "recorded by background threads"
  implementation detail; drop the _iter_batches cross-ref; drop the
  "Interval lists would give precise..." suggestion from the TODO;
  plain-backtick Args instead of ``-wrapped.
- get_next_batch_context / yield_batch_context: add short docstrings
  (were missing).

Net: -22 docstring lines, no behavior change. black 22.10.0 + ruff
clean; 38 affected tests pass.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The inline comment "# production_wait: upstream wait (not accumulated;
see docstring)" pointed back at the docstring above it — a circular
reference that reads as filler. State the fact inline instead.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…mings

Same LLM-tell pattern as the "see docstring" pointer fixed in the
previous commit — a "(see <method>)" cross-reference that reads as
filler. The reader can find the consumer by grepping.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The previous docstring cleanup pass missed this one — Timer.timer()
still had ``:class:`TimeSpan`` and ``:meth:`add`` Sphinx markup while
the rest of the PR's internal docstrings use plain backticks. Switched
to plain backticks for consistency.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Naming (util.py):
- prod_start_s / prod_span → production_wait_start / production_wait_span
- xfer_span → data_transfer_span
  Variable names now match the BlockFetchTiming field names they're
  assigned to, so readers don't have to map abbreviations.

Stale test names (test_iter_batches.py):
- TestReportBatchTimingsEdgeCases → TestAttributeBlockedTimeEdgeCases
  (method was renamed _report_batch_timings → _attribute_blocked_time
  in an earlier commit; the test class name was missed).
- _make_report_iterator → _make_test_iterator (same stale "report").

Test hygiene (test_util.py):
- Move `from ray.data._internal.stats import DatasetStats` from inside
  two test functions to a top-level import.
- Trim the 8-line regression-test docstring to 2 lines (the source
  docstring already has the rationale).
- Drop the "Pairs with <other test>" cross-reference (brittle to
  renames, reads as filler).
- Drop the weak `assert stats.iter_get_s.get() >= 0.0` (always true
  for non-negative Timer values; wasn't verifying anything). The
  structural checks (data_transfer span exists per block) remain.

Test hygiene (test_stats.py):
- test_each_call_returns_fresh_span: drop the unnecessary `spans`
  list; use s1/s2 directly.
- test_maybe_time_skips_when_timer_none: docstring claimed "skips
  perf_counter entirely" but the test only checks `span is None`;
  dropped the unverified claim.

All affected tests pass (39). black 22.10.0 + ruff clean.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
ruff's isort check (--select I) flagged the imports I added
(IterationStage, TimeSpan, _maybe_time) as out of alphabetical order.
pre-commit runs `ruff --select I --fix --exit-non-zero-on-fix`, which
fixes and exits non-zero — failing microcheck CI.

Reordered to satisfy ruff's case-insensitive isort.

Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Data Ingest Observability with Ray Data + Ray Train

2 participants