Skip to content

[Data] Make PandasBlock.size_bytes deterministic#64393

Open
dragongu wants to merge 1 commit into
ray-project:masterfrom
dragongu:fix/pandas_size_bytes_determinism
Open

[Data] Make PandasBlock.size_bytes deterministic#64393
dragongu wants to merge 1 commit into
ray-project:masterfrom
dragongu:fix/pandas_size_bytes_determinism

Conversation

@dragongu

@dragongu dragongu commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Description

PandasBlockAccessor.size_bytes() estimates string/object column sizes by sampling up to 200 rows via pandas.DataFrame.sample(n=...) without a seed, so two blocks holding identical data report different sizes. That estimate drives block splitting in streaming generators, so a re-executed task can emit a different number of objects than the original attempt — which leaves downstream consumers silently hanging or silently dropping data (the companion #64394 fails fast when that mismatch occurs). This PR fixes the sampling non-determinism by passing random_state=0, and skips sampling entirely when the column fits in the sample. User UDFs remain responsible for their own determinism.

Symptom. A streaming generator task re-executed for lineage reconstruction can leave the pipeline silently hanging or silently dropping data, with no error logged.

Root cause. PandasBlockAccessor.size_bytes() estimates string/object columns by sampling up to 200 rows via pandas.DataFrame.sample(n=...). No seed is passed, so each call samples different rows and two blocks holding identical data report different sizes.

How it turns into a bug. That estimate drives block splitting:

  1. OutputBuffer.next() calls size_bytes() to compute target_num_rows, so a task can split the same output into a different number of blocks each time it runs.
  2. The first successful attempt pins the end-of-stream index to the object count it produced.
  3. On re-execution, a different split produces a different object count:
    • fewer → downstream consumers block on indices that are never produced (silent hang);
    • more → the extras past the pinned end-of-stream index are dropped by ObjectRefStream::InsertToStream (silent data loss).

Fix. Pass random_state=0 to sample(), and skip sampling when the column has fewer than sample_size rows. Identical input now always yields the same size estimate. This removes one Ray-internal source of streaming-generator object-count non-determinism; user UDFs remain responsible for their own determinism.

Scope. Only PandasBlockAccessor.size_bytes() is affected — the Arrow equivalent already uses table.nbytes and is deterministic. Pandas blocks are produced along several common paths:

  • a UDF that returns a pandas.DataFrame (e.g. map_batches(fn, batch_format="pandas")), kept as a pandas block by BlockAccessor.batch_to_block;
  • batches that Arrow cannot convert (object/ragged columns), which fall back to pandas blocks;
  • split_repartition preserving a pandas input schema; and
  • the built-in preprocessors (encoder, vectorizer, imputer).

On released branches these paths default to pandas, so the bug is broadly reachable. On master, #61733 added a default that converts many UDF outputs to Arrow, but the fallback and pandas-schema paths above still build pandas blocks.

Related issues

None.

Additional information

Reproduction (before this PR, the two printed sizes frequently differ):

import pandas as pd
from ray.data._internal.pandas_block import PandasBlockAccessor

data = [f"str_{i}" for i in range(10_000)]
b1 = pd.DataFrame({"col": pd.Series(data, dtype="string")})
b2 = pd.DataFrame({"col": pd.Series(data, dtype="string")})

print(PandasBlockAccessor.for_block(b1).size_bytes())
print(PandasBlockAccessor.for_block(b2).size_bytes())
  • Test: added TestSizeBytes.test_deterministic_across_blocks, asserting that two blocks holding identical data report the same size_bytes().

@dragongu dragongu requested a review from a team as a code owner June 27, 2026 15:13

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that size estimation in PandasBlockAccessor.size_bytes() is deterministic by using a fixed random_state=0 when sampling columns. This prevents issues like silent hangs or data loss during lineage reconstruction. A corresponding unit test was added to verify determinism. The reviewer suggested a performance optimization to bypass sampling when the sample size equals the total size.

Comment thread python/ray/data/_internal/pandas_block.py Outdated
PandasBlock.size_bytes estimates the byte size of string/object columns
by sampling up to 200 rows with pandas.DataFrame.sample(n=), which uses
a fresh random seed on each call by default. The estimate is therefore
non-deterministic: two blocks holding identical data can report different
sizes.

OutputBuffer.next uses size_bytes to decide how to split blocks, so this
non-determinism makes a streaming generator task produce a different
number of objects across replay attempts (e.g. lineage reconstruction),
which can hang or silently drop data downstream.

This passes a fixed random_state to sample() so the estimate is
deterministic. When the whole column is sampled (common for blocks under
200 rows), it reads the values directly and skips sample() entirely,
avoiding the permutation/RNG/copy overhead; that path has no randomness
and is trivially deterministic.

Adds a unit test asserting two blocks with identical data report the
same size_bytes.

Signed-off-by: dragongu <andrewgu@vip.qq.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dragongu dragongu force-pushed the fix/pandas_size_bytes_determinism branch from 4839bbe to 1d3d4ec Compare June 27, 2026 15:22
@dragongu dragongu changed the title [Data] Make PandasBlock.size_bytes deterministic to avoid streaming generator hangs [Data] Make PandasBlock.size_bytes deterministic Jun 27, 2026
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant