Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI by wombatu-kun · Pull Request #16565 · apache/iceberg

wombatu-kun · 2026-05-26T06:37:07Z

Part of #16397.

TestRewriteDataFilesAction is the #2 slowest Spark test class in spark-ci (~12.4 min in the profiling gist linked from #16397). It is parameterized only on formatVersion = [2, 3, 4] — each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) — so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a shared SCALE = 400000 consumed by ~50 @TestTemplate methods that each write and then rewrite ~400k rows, across three format versions.

What changed

Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small SCALE = 400. The few methods whose assertions genuinely depend on large files keep the original volume via a new LARGE_SCALE = 400000 constant, so they stay byte-for-byte equivalent: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, and testZOrderSort. This mirrors the sibling TestRewritePositionDeleteFilesAction, which already uses SCALE = 4000.

The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees.

Measured impact

Measured locally as the JUnit testsuite time summed across the three formatVersion suites, three runs each, via cleanTest test --no-build-cache (forces real re-execution, no cache):

	mean of 3 runs	tests
Before	688 s (11.5 min)	171 pass / 0 fail
After	316 s (5.3 min)	171 pass / 0 fail

That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved — only the data volume shrank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kevinjqliu · 2026-05-26T16:50:40Z

Thanks for the PR! this was one of the optimizations I observed as well.
Would be great to get some Spark folks to review this; I want to make sure that the smaller SCALE variable does not affect the semantics of those tests

kevinjqliu · 2026-05-26T16:53:05Z

btw TestRewriteDataFilesAction is ran sequentially 168 times in CI, totaling ~13 minutes
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--core-spark

so a 50% reduction should drop ~6 minutes in total wall time for spark-ci 🥳

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI

bb80a8b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the spark label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale

wombatu-kun commented May 26, 2026

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wombatu-kun commented May 26, 2026

What changed

Measured impact

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

kevinjqliu commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants