Skip to content

Spark: Trim TestStructuredStreamingRead3 parameter rows from 8 to 2#16559

Merged
stevenzwu merged 1 commit into
apache:mainfrom
stevenzwu:spark-trim-streaming-read
May 27, 2026
Merged

Spark: Trim TestStructuredStreamingRead3 parameter rows from 8 to 2#16559
stevenzwu merged 1 commit into
apache:mainfrom
stevenzwu:spark-trim-streaming-read

Conversation

@stevenzwu
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu commented May 24, 2026

Summary

Reduces TestStructuredStreamingRead3 parameter rows from 8 to 2 in v3.5 / v4.0 / v4.1, dropping the catalogs that don't add meaningful coverage for streaming-read semantics:

  • testhive (Hive, async=true) — Hive-metastore baseline.
  • testrest (REST, async=false) — REST is the OSS-strategic catalog.

Removed: testhadoop (HadoopCatalog isn't recommended for production) and spark_catalog (SessionCatalog wrapper differences live in DDL/table-resolution paths, not streaming reads — both SparkCatalog and SparkSessionCatalog resolve to SparkTable once a table is identified as Iceberg, and streaming reads don't exercise the wrapper-specific code paths).

The trim cuts each test class invocation count from 264 → 66 (75% reduction).

TestStructuredStreamingRead3 was the highest-CPU test class in the Spark core CI job (20.3% of total test CPU on (17, 4.1, 2.13, core), ~931 CPU-sec out of 4595). Per-row coverage analysis shows no test method uses assumeThat(catalogName), so no test gets silenced by the trim — every test still runs across both rows.

Axis coverage — before

# catalog async
1 testhive false
2 testhive true
3 testhadoop false
4 testhadoop true
5 testrest false
6 testrest true
7 spark_catalog (Hive) false
8 spark_catalog (Hive) true
Axis Values present (rows)
Catalog testhive (1, 2) · testhadoop (3, 4) · testrest (5, 6) · spark_catalog (7, 8)
async false (1, 3, 5, 7) · true (2, 4, 6, 8)

Axis coverage — after

# catalog async
1 testhive true
2 testrest false
Axis Values present (rows)
Catalog testhive (1) · testrest (2)
async true (1) · false (2)

Both async values still tested; both strategic production catalogs still tested. Joint coverage of (testhive, false) and (testrest, true) is sacrificed — but no test in the class depends on those joint combinations specifically (no assumeThat on catalogName or stacked predicates).

Design rationale

  • Streaming read semantics aren't catalog-specific. A streaming read on an Iceberg table goes through Spark's micro-batch source machinery, the Iceberg SparkTable DSv2 read path, and the IncrementalScan planning logic. The catalog backend is only involved at table-resolution time. Once the table is loaded, the streaming behavior is the same regardless of catalog.
  • testhadoop dropped. HadoopCatalog isn't a production target for Iceberg. The streaming tests don't exercise anything HadoopCatalog-specific.
  • spark_catalog dropped. The SparkSessionCatalog wrapper test is more valuable when it exercises code paths the wrapper actually intercepts (DDL routing, V1-vs-V2 fallback for non-Iceberg tables) — none of which streaming reads do.
  • REST kept at async=false, Hive at async=true — distributes both async values across both catalogs, no implicit "one catalog runs both async modes" preference.

Test plan

  • spotlessCheck passes on all 3 Spark versions.
  • Local smoke run on Spark 4.1: TestStructuredStreamingRead366 tests, 0 skipped, 0 failures, 0 errors (down from 264 invocations originally; 75% reduction confirmed).
  • Verified no test method uses assumeThat(catalogName)git grep "assumeThat" in the file returned 0 matches, so the catalog axis trim cannot silence any test.
  • Full Spark CI run on this branch.

🤖 Generated with Claude Code

Reduces the parameter set in TestStructuredStreamingRead3 from 8 rows
(4 catalogs × async{T,F}) to 2 rows: testhive (async=true) and
testrest (async=false). Streaming read semantics aren't catalog-specific
in any meaningful way, and async-vs-sync planning is the only axis
TestStructuredStreamingRead3 actually exercises beyond basic Spark
streaming behavior. Both async values and the strategic catalog
backends (Hive metastore + REST) remain covered with one row each.

Drops testhadoop (HadoopCatalog isn't recommended for production)
and the spark_catalog SessionCatalog rows (the SessionCatalog wrapper
differences live in DDL/table-resolution paths, not streaming reads).

Each invocation runs 33 streaming tests, so 8→2 rows cuts the class
from 264 to 66 invocations - roughly 75% reduction in CPU time for
this test (the highest-CPU class in the Spark core CI job at 20.3% of
total test CPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the spark label May 24, 2026
@stevenzwu stevenzwu marked this pull request as ready for review May 25, 2026 00:53
@kevinjqliu kevinjqliu self-requested a review May 26, 2026 15:03
@kevinjqliu
Copy link
Copy Markdown
Contributor

this is great! these tests are top contributors of github action wall-time
I've benchmark top contributor of spark-ci wall time

#16397
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--core-spark
https://gist.github.com/kevinjqliu/86827bedf9d02adad36881f476484208#top-25--extensions-spark-extensions

@stevenzwu stevenzwu merged commit 36d79e7 into apache:main May 27, 2026
45 checks passed
@stevenzwu
Copy link
Copy Markdown
Contributor Author

thanks @nastra and @kevinjqliu for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants