[Data] Add per-stage training-thread blocking attribution to iter_batches#64183
[Data] Add per-stage training-thread blocking attribution to iter_batches#64183OneSizeFitsQuorum wants to merge 32 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces detailed per-stage wall-clock timing and statistics tracking (including fetch, batching, formatting, collating, finalization, and order restoration) for Ray Data batch iteration, along with corresponding Prometheus metrics and unit tests. The review feedback suggests updating the type annotation of dataset_tag in _create_iteration_tags to Optional[str] to align with the underlying function and avoid static type checking errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
b02e44d to
850ff58
Compare
faaeb93 to
14ec378
Compare
JasonLi1909
left a comment
There was a problem hiding this comment.
Thank you for the PR @OneSizeFitsQuorum! The core idea of capturing the start and end of each pipeline stage to compute overlap with the training thread stall makes sense. That said, let's keep the scope of this PR to just the training thread attribution metrics. We can compartmentalize other metrics for a later PR, but at a glance some of them such as data_iter_prefetch_queue_depth are more of an implementation detail, which we should avoid. Also it would be great if you could simplify the PR description so it's shorter and easier to understand. Thanks!
14ec378 to
69c8cbb
Compare
…e observability to iter_batches Implements overlap-based latency attribution for Ray Data's iter_batches pipeline, addressing ray-project#64132 and RFC ray-project#63911. Each pipeline stage (fetch, batching, format, collate, finalize, restore_order) records an independent (start_s, end_s) time window. The training thread captures its own blocked window around next(). Attribution per stage is the overlap of the two windows, correctly handling prefetch > 1. New Prometheus metrics (14 total): - data_iter_blocked_{fetch,batching,format,collate,finalize,restore_order}_seconds - data_iter_batches_total, data_iter_rows_total - data_iter_total_seconds, data_iter_restore_order_buffer_peak - data_iter_shuffle_buffer_{rows,compactions_total,compaction_seconds} - data_iter_prefetch_queue_depth Also adds: - Per-stage breakdown rendering in IterStatsSummary.to_string() - Rank extraction from dataset tags for Prometheus labels - Final metrics flush on iterator completion Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
69c8cbb to
071ebd6
Compare
|
Thanks for the review @JasonLi1909! I've made the following changes:
Also fixed a regex issue flagged by Cursor Bugbot: the rank extraction now uses the last |
Reverts batcher.py changes that were only needed for the shuffle buffer metrics which have been removed from this PR's scope per reviewer feedback. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
JasonLi1909
left a comment
There was a problem hiding this comment.
Hey @OneSizeFitsQuorum, thanks for the cleanup. I've left some more comments. Overall, make sure to include docstrings, comments, and that names of data classes/functions communicate their purpose. I'll take another look after you're done, thanks!
Per reviewer feedback, restore_order is an implementation detail rather than an actionable user-facing metric. Reverts restore_original_order() to the original simple for-loop and removes the data_iter_blocked_restore_order_seconds Prometheus metric along with all related fields, exports, and tests. The PR now exposes 8 core metrics (5 blocked stages + batches/rows/total). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, consolidates the dual timing mechanism: - StageTiming now supports context manager protocol (__enter__/ __exit__) to automatically capture start_s/end_s - Timer gains start_s/end_s fields populated by timer() - Pipeline functions (resolve_block_refs, _format_batch, _collate_batch, _finalize_batch) use nested context managers instead of redundant perf_counter() + _record_stage_window() - resolve_block_refs always returns BlockWithTiming, removing the record_timings parameter, Union types, and isinstance branching - Removed _record_stage_window helper (no longer needed) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The fetch timing window in resolve_block_refs now spans from when we start waiting for the upstream iterator (blocked on the data pipeline) through ray.get() completion. This captures cross-node transfer and upstream production delays, giving a more complete picture of what blocks the training thread. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, adds clear docstrings to: - BatchTimings (per-batch pipeline-stage timing windows) - BlockWithTiming (resolved block with fetch timing) - BatchTimings.merge_fetch() (multi-block fetch window expansion) - BatchTimings.stages() (stage name/timing iterator) - _report_batch_timings() (overlap-based attribution algorithm) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
_BatchingIterator can receive blocks from paths other than resolve_block_refs (e.g., doctest examples that pass raw pyarrow Tables). Restore the isinstance check to handle both BlockWithTiming and raw Block objects gracefully. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, removed isinstance check and Union type from _BatchingIterator by ensuring all entry points wrap blocks in BlockWithTiming: - batch_blocks() now wraps raw blocks in BlockWithTiming with zero timing before passing to blocks_to_batches() - _BatchingIterator now assumes all blocks are BlockWithTiming - Removed Union import from util.py This provides a uniform type throughout the batching pipeline while maintaining backward compatibility for external callers of batch_blocks(). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
- test_util.py: Updated test_resolve_block_refs to expect BlockWithTiming objects and test_blocks_to_batches to wrap raw blocks - block_batching.py: Changed generator expression to map() to avoid holding references to blocks, fixing test_chained_transforms_release_intermediates Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
| """ | ||
|
|
||
| block: Block | ||
| timings: BatchTimings = field(default_factory=BatchTimings) |
There was a problem hiding this comment.
Question: The data model feels off here, an entire block will not populate all of the BatchTimings. The data models should cleanly reflect what is possible and intrinsic to the component, is there a better way to organize this?
| def _wrap_block(b): | ||
| return BlockWithTiming(block=b, timings=BatchTimings()) | ||
|
|
||
| wrapped_blocks = map(_wrap_block, blocks) |
There was a problem hiding this comment.
Is there a way to avoid this? This is purely a shim for the typing. Somehow, we should make the timing optional into BatchIterator. If it's too complex and out of scope, leave a detailed TODO comment for the future.
| ("finalize", self.finalize), | ||
| ) | ||
|
|
||
| def merge_fetch(self, other: "BatchTimings") -> None: |
There was a problem hiding this comment.
Two things here:
- The naming here is confusing as this function is used to merge timings at the block level but the type is BatchTimings. It reads as if batches rather than blocks are being merged. See comment below about the data model of BlockWithTiming.
- If we split merge into data wait + transfer, we will need to change this to merge the stages properly. This may result in some double counting between the wait and transfer timers in the case of batches that span multiple blocks, so we will need to be careful here. For instance:
Block 1: wait=[0,1], transfer=[1,2]
Block 2: wait=[2,3], transfer=[3,4]
Say a batch spans these two blocks, then when we merge these blocks we get
wait=[0,3], transfer=[1,4], resulting in double counting. We will need to avoid this somehow
Per reviewer feedback, consolidates the dual timing mechanism: - New TimeSpan dataclass (pure data: start_s, end_s, duration) - Timer.timer() now yields a thread-local TimeSpan (safe for concurrent use from thread pools) - Removed thread-unsafe start_s/end_s fields from Timer - New _timed() helper yields Optional[TimeSpan] (None when no stats) - Removed StageTiming class entirely - BatchTimings fields are now Optional[TimeSpan] (None = stage didn't execute, eliminating magic-number 0.0 checks) - Pipeline functions use single _timed() context instead of nested StageTiming + Timer.timer() context managers - _merge_span helper handles Optional[TimeSpan] merge logic Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, splits the single fetch stage into two distinct stages to differentiate data production stall vs cross-node transfer stall: - production_wait: blocked on upstream data pipeline (next on ref bundle iterator) - data_transfer: cross-node transfer via ray.get() New Prometheus metrics: - data_iter_blocked_production_wait_seconds - data_iter_blocked_data_transfer_seconds Removed: data_iter_blocked_fetch_seconds (replaced by the two above) resolve_block_refs now times each sub-stage independently using _timed() with the appropriate Timer (iter_get_ref_bundles_s for production_wait, iter_get_s for data_transfer). Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, refines the data model to accurately reflect what each component populates: - New BlockFetchTiming: only production_wait + data_transfer (what a single block actually has) - New BlockFetchResult: replaces BlockWithTiming, carries block + Optional[BlockFetchTiming] (None when no fetch timing) - Removed BlockWithTiming (data model was inaccurate — it carried full BatchTimings but only fetch was populated) - Moved num_rows from BatchTimings to BatchMetadata (it's not a timing, it's batch metadata) - batch_blocks() shim simplified: BlockFetchResult(block=b) with fetch=None instead of creating fake BatchTimings - merge_fetch now accepts BlockFetchTiming (not BatchTimings) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Per reviewer feedback, replaces string-based stage identifiers and getattr with type-safe alternatives: - New PipelineStage(Enum) in stats.py (avoids circular import) - BatchTimings.stages() returns Tuple[PipelineStage, Optional[TimeSpan]] - DatasetStats.get_blocked_timer(stage) uses match statement (no getattr, no string formatting) - _StatsActor stores blocked gauges in Dict[PipelineStage, Gauge] instead of individual attributes - update_iteration_metrics iterates the dict instead of calling 6 individual .set() methods - Renamed _report_batch_timings → _attribute_blocked_time (more descriptive of what the method does) Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
|
@JasonLi1909 I've addressed all your feedback across 4 commits (TimeSpan consolidation, fetch split, data model refactoring, PipelineStage enum). Cursor Bugbot then flagged 3 remaining issues that I'd like to discuss before fixing: 1. Nested timer double-counts When Proposed fix: In 2. Split fetch spans overlap, violating After splitting fetch into My view: In a pipelined system, multiple stages can be active simultaneously (data_transfer for block N + production_wait for block N+1). Attributing the same blocked interval to both stages is semantically correct — the training thread was indeed blocked while both stages were running. I'd relax the invariant to allow 3. Reorder buffer skews blocked attribution With My view: Since you asked to remove |
Per reviewer feedback, the per-streaming-split-worker `rank` label on
iteration metrics is out of scope for this PR. Reverts the rank label
addition so existing iteration gauges keep their master label set
(`("dataset",)`), and leaves a TODO for a follow-up PR to add `rank`
across all iteration metrics (including the new blocked-attribution
gauges) at once.
- Removes module-level `_create_iteration_tags` and the
`_StatsActor._create_iteration_tags` method
- Reverts `iter_tag_keys` from `("dataset", "rank")` to `("dataset",)`
- `update_iteration_metrics` uses `self._create_tags(dataset_tag)` again
- Drops the `re` import (only used by the removed function)
- Removes `test_create_iteration_tags_extracts_rank` and drops `rank`
from `expected_tags` in `test_update_iteration_metrics_exports_new_iter_metrics`
Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
When prefetching is enabled, `iter_get_ref_bundles_s` was accumulated twice per block: once by `prefetch_batches_locally.get_next_ref_bundle` (which times `next(ref_bundles)`) and again by `resolve_block_refs` (which timed `next(block_ref_iter)` with the same Timer). The upstream wait was therefore double-counted in Prometheus `data_iter_get_ref_bundles_seconds`. `production_wait` is now captured as a bare `TimeSpan` (start/end `perf_counter` only) for overlap attribution, without calling `Timer.add()`. The cumulative `iter_get_ref_bundles_s` remains driven solely by `get_next_ref_bundle` when prefetch is on, and is not tracked when prefetch is off — matching master behavior. `data_transfer` (`iter_get_s`, the `ray.get()` call) is unaffected. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Adds `test_resolve_block_refs_does_not_accumulate_ref_bundles_timer` to guard against re-introducing the nested-timer double-counting bug fixed in the previous commit. The test asserts that: - `iter_get_ref_bundles_s` stays at 0 after `resolve_block_refs` processes a slow upstream iterator (i.e. the function does not accumulate into that Timer; `prefetch_batches_locally` owns it). - `production_wait` `TimeSpan` is still captured per block, so overlap-based blocked attribution continues to work. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
… TODO
Addresses review feedback from JasonLi1909 and Cursor on design-level
threads that don't require code changes, only clarification:
- block_batching.py: expand the BlockFetchResult wrapping comment into a
detailed TODO explaining why the typing shim exists (uniform iterator
type across prefetch/non-prefetch paths) and how a future refactor
could remove it. Per Jason's request to leave a TODO when the fix is
out of scope.
- iter_batches.py: relax the `sum(iter_blocked_*) ≤ iter_total_blocked_s`
invariant in `_attribute_blocked_time`'s docstring to "approximates",
and document two known cases that can push the sum above or below
total by design:
* Split fetch stages overlap (multi-block batches: production_wait
for block N+1 concurrent with data_transfer for block N).
* Reorder buffer wait is unattributed under preserve_order + prefetch
(per-stage TimeSpan recorded at format/collate completion, before
the batch leaves restore_original_order).
These are documented as expected behavior, not bugs.
Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…ed_timer Closes five test-coverage gaps identified during PR review: - TestTimerTimerSpan (test_stats.py): verifies Timer.timer() yields a fresh TimeSpan per call and accumulates its duration via add(); also covers _timed(None) skipping perf_counter and _timed(Timer) yielding a real span. - TestGetBlockedTimer (test_stats.py, parametrized over 6 stages): verifies DatasetStats.get_blocked_timer(stage) returns the correct Timer attribute via the match statement (previously only covered indirectly through update_iteration_metrics). - TestMergeFetch.data_transfer cases (test_iter_batches.py): extends existing production_wait-only tests to cover data_transfer merging (multiple blocks, overlapping windows, independence from production_wait, None src preservation). - test_resolve_block_refs_accumulates_data_transfer_timer (test_util.py): pairs with the existing ref_bundles non-accumulation test to verify the data_transfer (iter_get_s / ray.get) path IS accumulated and a per-block data_transfer TimeSpan is captured. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Fixes CI lint failures from unformatted code in the Cursor-2 fix and the new TestMergeFetch data_transfer cases. Black 22.10.0 (the version pinned in .pre-commit-config.yaml) collapses two patterns onto single lines: - util.py: `TimeSpan(...) if stats else None` ternary inside the prod_span assignment. - test_iter_batches.py: `BlockFetchTiming(data_transfer=TimeSpan(...))` constructor calls in the new merge tests. No behavior change. ruff and black --check both pass on all changed files. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
| BATCHING = "batching" | ||
| FORMAT = "format" | ||
| COLLATE = "collate" | ||
| FINALIZE = "finalize" |
There was a problem hiding this comment.
"IterationStage" is a better name for this. Also, document with a description for each stage. "Pipeline stages for blocked attribution" is vague- what pipeline stages? what blocked attribution?
There was a problem hiding this comment.
doc added. BTW, I think either name works—they each emphasize a slightly different aspect, and both make sense in this context. Since you seem to prefer it, I've changed it to iteration stage.
…mments, remove gauge dict Addresses 5 new review threads from JasonLi1909 (2026-06-27) plus comment-hygiene cleanup: Renames (Jason #5, #6): - PipelineStage → IterationStage (stats.py, interfaces.py, tests) - _timed → _maybe_time (stats.py, util.py, tests) - IterationStage docstring expanded with per-stage descriptions and Prometheus label mapping note. Comment cleanup: - block_batching.py: typing-shim TODO compressed from 8 to 2 lines, "Out of scope for this PR" removed (Jason #4). - stats.py: rank-label TODO compressed, "this PR's scope" removed. - stats.py: Timer.timer() docstring "thread-local" → "fresh per call" (the span is a fresh instance, not TLS). - util.py: resolve_block_refs docstring Note compressed, "matching master behavior" removed; redundant inline comment replaced with a one-liner pointing at the docstring. - util.py: _pending_timings field gets a one-line comment. - iter_batches.py: _attribute_blocked_time docstring — the 15-line invariant/limitations section replaced with a brief TODO noting the design is open (interval lists as a future option). Structure simplification (Jason #7, #8): - _blocked_gauges: Dict[IterationStage, Gauge] replaced with 6 individual Gauge attributes (iter_blocked_*_s), matching the existing iteration gauge style. - for-loop in update_iteration_metrics replaced with 6 individual .set() calls. - get_blocked_timer() match statement retained (maps stage → Timer on DatasetStats, orthogonal to gauge storage). - Test fixture updated: FakeGauge setup uses individual attributes instead of a dict. All affected tests pass (43 in test_stats + test_util + test_iter_batches). black 22.10.0 and ruff both clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…ocked_time Follow-up to the IterationStage rename (commit 4ac2b5c) — the enum was renamed but a handful of docstrings still said "pipeline stage". Also tightens _attribute_blocked_time per review feedback that it was still too long. - interfaces.py: 4 docstring references "pipeline stage" → "iteration stage" (BatchTimings class doc, stages() doc, BatchMetadata.timings attribute doc). - stats.py: get_blocked_timer docstring "pipeline stage" → "iteration stage". - iter_batches.py: _attribute_blocked_time docstring rewritten — drops the 3 prefetch-config bullets (obvious from the formula), drops the thread-safety paragraph (implementation detail), and folds the invariant + two limitations into a single TODO noting the design is open (interval lists as a future option). Also fixes "pipeline stage" → "stage" here. Net ~20 lines shorter. No behavior change. black 22.10.0 + ruff clean; 25 affected tests pass. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit b13112d. Configure here.
Previously `iter_total_s.add(...)` and the final `_StatsManager.update_iteration_metrics` flush ran only after `yield from batch_iterator` completed normally. On early exit (e.g. `break` in the training loop), the `finally` block only shut down the executor, so `iter_total_s` stayed at zero and Prometheus never got a final flush despite partial iteration. Moves both calls into the `finally` block (before `_on_iteration_end` to keep total time excluding executor shutdown, matching master's placement). Normal completion is unaffected; early exit now records the wall-clock time up to the break and flushes metrics. Note: mid-run staleness (iter_total_s reads 0 during iteration because add() only runs at the end) is a separate pre-existing issue and remains a follow-up. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Six style fixes from self-review of test code: 1. Rename TestTimerTimerSpan → TestTimerSpan (the "Timer" duplication was awkward; the class tests Timer.timer() returning TimeSpan). 2. Remove stale `assert "restore order" not in text` from test_iter_stats_to_string_omits_zero_stages — the restore_order stage was removed in an earlier commit, so the assertion was vacuously true. 3. Merge the separate for-loop that set up blocked-gauge FakeGauges into the main gauge-setup loop in test_update_iteration_metrics_exports_new_iter_metrics, and drop the now-redundant "Blocked gauges are stored as individual attributes" comment (blocked gauges are just attributes now, like the others). 4. TestMergeFetch: the five original methods passed `BatchTimings()` as the `src` argument to `merge_fetch()`, whose signature takes `BlockFetchTiming`. It worked by duck-typing but was the wrong type. Switched to `BlockFetchTiming(production_wait=TimeSpan(...))` to match the signature and the style of the newer data_transfer tests. 5. test_overlap_invariant_sum_leq_total: docstring said "always holds" but the invariant was relaxed to "approximates" with a TODO noting cases that violate it. Tightened to "holds for non-overlapping stages" to reflect what the test actually exercises. 6. _make_span / _make_batch_with_timings: replaced the 0.0-as-sentinel pattern (start==0.0 and end==0.0 → None) with explicit Optional parameters defaulting to None. Mirrors the source-code cleanup Jason requested for the 0.0 sentinel. All affected tests pass (38). black 22.10.0 + ruff clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Review feedback noted the documentation read as agent-generated —
excessive on trivial functions, missing on core ones, and heavy on
Sphinx markup / em-dashes / "Note:" callouts that humans don't write.
interfaces.py:
- BlockFetchTiming: drop lifecycle sentence ("Produced by ... merged
into ... by ..."). A 2-field dataclass doesn't need it.
- BlockFetchResult: tighten the None-description.
- BatchTimings: drop "overlap-based attribution" rationale and the
:meth: cross-reference; keep just what the fields mean.
- stages(): drop the redundant docstring restating the signature.
- _merge_span: 5-line docstring → 1 line. The code is 7 lines.
- BatchMetadata.timings: "Iteration-stage timing windows for this
batch" → "Per-stage timing windows."
stats.py:
- IterationStage: drop the 3-line overlap-mechanism explanation
(that's _attribute_blocked_time's job, not the enum's).
- TimeSpan: drop "Created by Timer.timer and carried per-batch for
overlap-based blocked attribution" — system-level rationale doesn't
belong on a 2-field dataclass.
- _maybe_time: 3-line docstring → 1 line; drop :class: markup and
the em-dash aside.
util.py:
- resolve_block_refs: drop "Note:" callout prefix; drop :func:
markup on prefetch_batches_locally.get_next_ref_bundle.
- Inline comments: drop "(1)"/"(2)" numbering; compress the
ray.get TODO to one line.
iter_batches.py:
- _iter_batches: add a short docstring (was missing).
- _attribute_blocked_time: drop "recorded by background threads"
implementation detail; drop the _iter_batches cross-ref; drop the
"Interval lists would give precise..." suggestion from the TODO;
plain-backtick Args instead of ``-wrapped.
- get_next_batch_context / yield_batch_context: add short docstrings
(were missing).
Net: -22 docstring lines, no behavior change. black 22.10.0 + ruff
clean; 38 affected tests pass.
Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The inline comment "# production_wait: upstream wait (not accumulated; see docstring)" pointed back at the docstring above it — a circular reference that reads as filler. State the fact inline instead. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
…mings Same LLM-tell pattern as the "see docstring" pointer fixed in the previous commit — a "(see <method>)" cross-reference that reads as filler. The reader can find the consumer by grepping. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
The previous docstring cleanup pass missed this one — Timer.timer() still had ``:class:`TimeSpan`` and ``:meth:`add`` Sphinx markup while the rest of the PR's internal docstrings use plain backticks. Switched to plain backticks for consistency. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
Naming (util.py): - prod_start_s / prod_span → production_wait_start / production_wait_span - xfer_span → data_transfer_span Variable names now match the BlockFetchTiming field names they're assigned to, so readers don't have to map abbreviations. Stale test names (test_iter_batches.py): - TestReportBatchTimingsEdgeCases → TestAttributeBlockedTimeEdgeCases (method was renamed _report_batch_timings → _attribute_blocked_time in an earlier commit; the test class name was missed). - _make_report_iterator → _make_test_iterator (same stale "report"). Test hygiene (test_util.py): - Move `from ray.data._internal.stats import DatasetStats` from inside two test functions to a top-level import. - Trim the 8-line regression-test docstring to 2 lines (the source docstring already has the rationale). - Drop the "Pairs with <other test>" cross-reference (brittle to renames, reads as filler). - Drop the weak `assert stats.iter_get_s.get() >= 0.0` (always true for non-negative Timer values; wasn't verifying anything). The structural checks (data_transfer span exists per block) remain. Test hygiene (test_stats.py): - test_each_call_returns_fresh_span: drop the unnecessary `spans` list; use s1/s2 directly. - test_maybe_time_skips_when_timer_none: docstring claimed "skips perf_counter entirely" but the test only checks `span is None`; dropped the unverified claim. All affected tests pass (39). black 22.10.0 + ruff clean. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>
ruff's isort check (--select I) flagged the imports I added (IterationStage, TimeSpan, _maybe_time) as out of alphabetical order. pre-commit runs `ruff --select I --fix --exit-non-zero-on-fix`, which fixes and exits non-zero — failing microcheck CI. Reordered to satisfy ruff's case-insensitive isort. Signed-off-by: OneSizeFitsQuorum <tanxinyu@apache.org>

Summary
Decomposes
iter_total_blocked_sinto per-stage contributions so users can see where the training thread is blocked. Closes #64132, part of RFC #63911.Design
Each pipeline stage records an independent
(start_s, end_s)window viaTimeSpan, a lightweight value yielded byTimer.timer(). The training thread records its blocked window aroundnext(batch_iter). Per-stage attribution = overlap of the two windows:Invariant:
sum(iter_blocked_*)approximatesiter_total_blocked_s. It is a lower bound in the common case, but split fetch stages (production_wait+data_transfer) may overlap by design for multi-block batches — see_attribute_blocked_timedocstring for the full rationale.The fetch stage is split into two:
production_waitspans the upstream wait (blocked on the data pipeline), anddata_transferspans the cross-node transfer viaray.get(). This lets users differentiate data production stall from cross-node transfer stall.New Prometheus Metrics (9)
data_iter_blocked_production_wait_secondsdata_iter_blocked_data_transfer_secondsdata_iter_blocked_batching_secondsdata_iter_blocked_format_secondsdata_iter_blocked_collate_secondsdata_iter_blocked_finalize_secondsdata_iter_batches_totaldata_iter_rows_totaldata_iter_total_secondsPer-rank Prometheus label (
rankextracted from dataset tag's trailingsplit_N) is deferred to a follow-up PR — see TODO instats.py.Changes
interfaces.py—TimeSpan,BlockFetchTiming(per-block: production_wait + data_transfer),BlockFetchResult(block + Optional timing),BatchTimings(per-batch: all 6 stages),merge_fetch/_merge_spaniter_batches.py—_attribute_blocked_time()overlap computation with docstring documenting known limitations (split-fetch overlap, reorder buffer unattributed gap)util.py—resolve_block_refsreturnsIterator[BlockFetchResult];production_waitcaptured as a bare TimeSpan (no accumulation, to avoid double-counting withprefetch_batches_locally.get_next_ref_bundle);data_transferaccumulated intoiter_get_sstats.py— 9 Prometheus Gauges,Timer.timer()yields a freshTimeSpanper call (thread-safe),PipelineStageenum,get_blocked_timer(stage)via match,IterStatsSummaryper-stage breakdowniterator.py— Final metrics flush on iteration endExample Output
Performance
~6 μs/batch overhead. At 10k batches/sec: <0.04% impact on a 10ms training step.
Tests
References
Closes #64132 · Part of RFC #63911
cc @JasonLi1909