[Data] Make PandasBlock.size_bytes deterministic by dragongu · Pull Request #64393 · ray-project/ray

dragongu · 2026-06-27T15:13:42Z

Description

PandasBlockAccessor.size_bytes() estimates string/object column sizes by sampling up to 200 rows via pandas.DataFrame.sample(n=...) without a seed, so two blocks holding identical data report different sizes. That estimate drives block splitting in streaming generators, so a re-executed task can emit a different number of objects than the original attempt — which leaves downstream consumers silently hanging or silently dropping data (the companion #64394 fails fast when that mismatch occurs). This PR fixes the sampling non-determinism by passing random_state=0, and skips sampling entirely when the column fits in the sample. User UDFs remain responsible for their own determinism.

Symptom. A streaming generator task re-executed for lineage reconstruction can leave the pipeline silently hanging or silently dropping data, with no error logged.

Root cause. PandasBlockAccessor.size_bytes() estimates string/object columns by sampling up to 200 rows via pandas.DataFrame.sample(n=...). No seed is passed, so each call samples different rows and two blocks holding identical data report different sizes.

How it turns into a bug. That estimate drives block splitting:

OutputBuffer.next() calls size_bytes() to compute target_num_rows, so a task can split the same output into a different number of blocks each time it runs.
The first successful attempt pins the end-of-stream index to the object count it produced.
On re-execution, a different split produces a different object count:
- fewer → downstream consumers block on indices that are never produced (silent hang);
- more → the extras past the pinned end-of-stream index are dropped by ObjectRefStream::InsertToStream (silent data loss).

Fix. Pass random_state=0 to sample(), and skip sampling when the column has fewer than sample_size rows. Identical input now always yields the same size estimate. This removes one Ray-internal source of streaming-generator object-count non-determinism; user UDFs remain responsible for their own determinism.

Scope. Only PandasBlockAccessor.size_bytes() is affected — the Arrow equivalent already uses table.nbytes and is deterministic. Pandas blocks are produced along several common paths:

a UDF that returns a pandas.DataFrame (e.g. map_batches(fn, batch_format="pandas")), kept as a pandas block by BlockAccessor.batch_to_block;
batches that Arrow cannot convert (object/ragged columns), which fall back to pandas blocks;
split_repartition preserving a pandas input schema; and
the built-in preprocessors (encoder, vectorizer, imputer).

On released branches these paths default to pandas, so the bug is broadly reachable. On master, #61733 added a default that converts many UDF outputs to Arrow, but the fallback and pandas-schema paths above still build pandas blocks.

Related issues

None.

Additional information

Reproduction (before this PR, the two printed sizes frequently differ):

import pandas as pd
from ray.data._internal.pandas_block import PandasBlockAccessor

data = [f"str_{i}" for i in range(10_000)]
b1 = pd.DataFrame({"col": pd.Series(data, dtype="string")})
b2 = pd.DataFrame({"col": pd.Series(data, dtype="string")})

print(PandasBlockAccessor.for_block(b1).size_bytes())
print(PandasBlockAccessor.for_block(b2).size_bytes())

Test: added TestSizeBytes.test_deterministic_across_blocks, asserting that two blocks holding identical data report the same size_bytes().

gemini-code-assist

Code Review

This pull request ensures that size estimation in PandasBlockAccessor.size_bytes() is deterministic by using a fixed random_state=0 when sampling columns. This prevents issues like silent hangs or data loss during lineage reconstruction. A corresponding unit test was added to verify determinism. The reviewer suggested a performance optimization to bypass sampling when the sample size equals the total size.

PandasBlock.size_bytes estimates the byte size of string/object columns by sampling up to 200 rows with pandas.DataFrame.sample(n=), which uses a fresh random seed on each call by default. The estimate is therefore non-deterministic: two blocks holding identical data can report different sizes. OutputBuffer.next uses size_bytes to decide how to split blocks, so this non-determinism makes a streaming generator task produce a different number of objects across replay attempts (e.g. lineage reconstruction), which can hang or silently drop data downstream. This passes a fixed random_state to sample() so the estimate is deterministic. When the whole column is sampled (common for blocks under 200 rows), it reads the values directly and skips sample() entirely, avoiding the permutation/RNG/copy overhead; that path has no randomness and is trivially deterministic. Adds a unit test asserting two blocks with identical data report the same size_bytes. Signed-off-by: dragongu <andrewgu@vip.qq.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dragongu requested a review from a team as a code owner June 27, 2026 15:13

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread python/ray/data/_internal/pandas_block.py Outdated

dragongu force-pushed the fix/pandas_size_bytes_determinism branch from 4839bbe to 1d3d4ec Compare June 27, 2026 15:22

dragongu changed the title ~~[Data] Make PandasBlock.size_bytes deterministic to avoid streaming generator hangs~~ [Data] Make PandasBlock.size_bytes deterministic Jun 27, 2026

dragongu mentioned this pull request Jun 27, 2026

[Core] Fail-fast on streaming generator replay object-count mismatch #64394

Open

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Make PandasBlock.size_bytes deterministic#64393

[Data] Make PandasBlock.size_bytes deterministic#64393
dragongu wants to merge 1 commit into
ray-project:masterfrom
dragongu:fix/pandas_size_bytes_determinism

dragongu commented Jun 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dragongu commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dragongu commented Jun 27, 2026 •

edited

Loading