Skip to content

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale
Open

Spark: Reduce TestRewriteDataFilesAction data volume to speed up CI#16565
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16397-rewrite-data-files-test-scale

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Part of #16397.

TestRewriteDataFilesAction is the #2 slowest Spark test class in spark-ci (~12.4 min in the profiling gist linked from #16397). It is parameterized only on formatVersion = [2, 3, 4] — each version is meaningful (v2 position deletes, v3 deletion vectors, v4 Parquet manifests) — so its matrix cannot be trimmed. Its runtime is instead dominated by data volume: a shared SCALE = 400000 consumed by ~50 @TestTemplate methods that each write and then rewrite ~400k rows, across three format versions.

What changed

Most methods only assert on file/snapshot counts and rewrite structure, which do not depend on the absolute row count, so they now use a small SCALE = 400. The few methods whose assertions genuinely depend on large files keep the original volume via a new LARGE_SCALE = 400000 constant, so they stay byte-for-byte equivalent: testBinPackSplitLargeFile, testBinPackCombineMixedFiles, testBinPackCombineMediumFiles, testAutoSortShuffleOutput, and testZOrderSort. This mirrors the sibling TestRewritePositionDeleteFilesAction, which already uses SCALE = 4000.

The same change is applied identically to the v3.5, v4.0, and v4.1 Spark trees.

Measured impact

Measured locally as the JUnit testsuite time summed across the three formatVersion suites, three runs each, via cleanTest test --no-build-cache (forces real re-execution, no cache):

mean of 3 runs tests
Before 688 s (11.5 min) 171 pass / 0 fail
After 316 s (5.3 min) 171 pass / 0 fail

That is a ~54% reduction (≈60% at warm steady-state). Test counts and pass/fail are unchanged across all three trees, so coverage is preserved — only the data volume shrank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the spark label May 26, 2026
@kevinjqliu
Copy link
Copy Markdown
Contributor

Thanks for the PR! this was one of the optimizations I observed as well.
Would be great to get some Spark folks to review this; I want to make sure that the smaller SCALE variable does not affect the semantics of those tests

@kevinjqliu
Copy link
Copy Markdown
Contributor

btw TestRewriteDataFilesAction is ran sequentially 168 times in CI, totaling ~13 minutes
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--core-spark

so a 50% reduction should drop ~6 minutes in total wall time for spark-ci 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants