fix(streamer): Include start commit in S3/GCS IncrSource incremental query by yihua · Pull Request #18949 · apache/hudi

yihua · 2026-06-10T00:44:45Z

Describe the issue this Pull Request addresses

S3/GCS cloud-object incremental sources can silently drop records whenever a previous batch persisted a commit#fileKey mid-commit-pagination checkpoint (i.e., the prior batch hit sourceLimit before exhausting the start commit's files). Files in the start commit after the checkpoint key become unreachable, and the persisted checkpoint advances past them as a bare instant.

Affected sources and conditions:

S3EventsHoodieIncrSource / GcsEventsHoodieIncrSource
Triggered when at least one upstream source-events commit exceeds hoodie.deltastreamer.read.source.limit

Common triggers: cold-start backfills against a source table with a big initial commit, bursty event writers, low sourceLimit overrides. Steady-state streams whose upstream commits fit within sourceLimit are unaffected. HoodieIncrSource (non-cloud) does not go through QueryRunner and is unaffected.

Summary and Changelog

Root cause 1: start commit dropped from the scan. QueryRunner.runIncrementalQuery passes queryInfo.getStartInstant() as the Spark START_COMMIT. The Spark V1 incremental relation's findInstantsInRange is (start, end] (start-exclusive), so the start commit is dropped from the scan. The downstream (commit_time || object_key) > 'commit#fileKey' filter in IncrSourceHelper.filterAndGenerateCheckpointBasedOnSourceLimit then matches nothing in the start commit, the empty-batch branch fires, and the new checkpoint is emitted as endInstant with no #fileKey suffix. The next batch resumes past the gap. Regression introduced by ab4c7774a6 [HUDI-8141] Support incremental query with completion time (#11947).

Root cause 2: file-group-reader incremental scans return no rows when _hoodie_commit_time is pruned. The incremental relations enforce the commit-time span filters only via parquet filter push-down inside HoodieFileGroupReaderBasedFileFormat. When Spark's column pruning drops _hoodie_commit_time from the scan schema, which happens for count(), isEmpty() (the first operation the cloud sources run on the batch), and any projection without meta fields, parquet evaluates the predicate against a missing column as all-null and filters out every row, so the batch looks empty and the checkpoint again advances as a bare instant. Sessions with HoodieSparkSessionExtension are shielded because AdaptIngestionTargetLogicalRelations injects the same filters into the logical plan; sessions without the extension relied solely on the broken push-down path. Affects both V1 and V2 incremental relations through the file group reader.

Fix.

Force the V1 incremental relation regardless of source table version (QueryRunner): set INCREMENTAL_READ_TABLE_VERSION = 6 unconditionally. Cloud event sources always use V1 checkpoints (commit#fileKey, requested-time) externally, so they always route through the V1 incremental relation internally.
Pass previousInstant (not startInstant) to START_COMMIT (QueryRunner): with start-exclusive semantics, the scan range (previousInstant, end] still includes the start commit needed for commit#fileKey pagination.
Read required-filter columns regardless of Spark's column pruning (HoodieFileGroupReaderBasedFileFormat): columns referenced by the relation's required filters but pruned from the scan schema are now read from the file and projected away after filtering. This restores format-level enforcement that HUDI-6658 moved to the HoodieSparkSessionExtension-gated plan-injection rule (leaving sessions without the extension unprotected) and HUDI-7567 removed the last remnants of. Snapshot/CDC reads have no required filters, so their read path is unchanged.

Tests.

testRealQueryRunnerResumesMidCommitPagination in both TestS3EventsHoodieIncrSource and TestGcsEventsHoodieIncrSource (@ParameterizedTest over source meta-table versions 6/8 x COW/MOR) exercises a real QueryRunner against an on-disk events meta-table with the file group reader enabled, resumes from a mid-commit commit#fileKey checkpoint with sourceLimit smaller than the remaining files, and asserts both the next persisted checkpoint and the exact files passed downstream. The source meta-table is written non-partitioned with the production S3 events schema (no test-only schema changes); the MOR runs assert the second commit's records land in log files so the read exercises log merging. Supporting S3EventsHoodieIncrSourceHarness changes: non-partitioned meta-table writes, multi-record commits, and using the table's commit action type so MOR writes commit as delta commits.
New TestIncrementalReadWithFileGroupReader covers {COW, MOR} x {source v6 + v6 read, source v8 + v6 read, source v8 + v8 read} on a session without HoodieSparkSessionExtension (no plan-injected filters; the file format alone enforces the span filters). Each run writes 3 insert commits (small-file handling keeps a single file group with base files only, validated as a precondition) and 3 update commits (one log file each on MOR, base rewrites on COW, also validated), then queries incremental ranges over base-only (multi-file and single-file), base-plus-some-logs, logs-only, and empty windows, validating the exact row set for select * plus a narrow projection, count(), and isEmpty per range. One key is updated in every update commit, so a range spanning several updates must surface it exactly once with the latest in-range value; a precondition also asserts the latest base file retains rows from multiple commits: the carried-over rows the span filters must exclude. Without the fix, the pruned-schema shapes fail: COW count() returns 0; MOR count() returns only log-merged records with base-file rows silently dropped.

Impact

Behavior change is contained to incremental reads:

The incremental read of the source events meta-table now always uses the V1 relation (requested-time semantics), matching the V1 commit#fileKey checkpoint contract regardless of source table version.
commit#fileKey mid-commit-pagination resumes now re-include the start commit in the scan (fixes the silent data drop).
Incremental queries through the file group reader now return correct results for query shapes that prune _hoodie_commit_time (e.g., count(), isEmpty()), for both V1 and V2 relations.

No API, config, or on-disk format changes; HoodieIncrSource (non-cloud) is untouched.

Risk Level

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…query S3/GCS cloud-object incremental sources can silently drop records whenever a previous batch persisted a commit#fileKey mid-commit-pagination checkpoint (i.e., the prior batch hit sourceLimit before exhausting the start commit's files). Files in the start commit after the checkpoint key become unreachable, and the persisted checkpoint advances past them as a bare instant. Root cause: QueryRunner.runIncrementalQuery passes queryInfo.getStartInstant() as the Spark START_COMMIT. The Spark incremental relation's findInstantsInRange is (start, end] (start-exclusive), so the start commit is dropped from the scan. The downstream (commit_time || object_key) > 'commit#fileKey' filter then matches nothing in the start commit, and filterAndGenerateCheckpointBasedOnSourceLimit falls through its empty-batch branch, emitting endInstant as bare with no #fileKey suffix. The next batch resumes past the gap. Fix: pass queryInfo.getPreviousInstant() so the resulting scan range (previousInstant, end] includes the start commit (startInstant) while preserving start-exclusive relation semantics. Required for cloud-object sources whose commit#fileKey pagination depends on re-reading the start commit to find files past the persisted key. Adds testRealQueryRunnerResumesMidCommitPagination to both TestS3EventsHoodieIncrSource and TestGcsEventsHoodieIncrSource. The new tests exercise a real QueryRunner against an on-disk Hudi events meta-table, resuming from a mid-commit commit#fileKey checkpoint with sourceLimit smaller than the remaining files. They assert both the next persisted checkpoint and the exact files passed downstream (via captor on loadAsDataset). The existing tests mocked QueryRunner.run() to return inputDs unfiltered for incremental queries and could not catch a START_COMMIT-handling regression.

Long.parseLong(startCommit) - 1 was producing a previousInstant string of shorter length than the real timeline instants, so findInstantsInRange's lexicographic compare excluded the start commit and the empty-batch path silently advanced the checkpoint past it.

The V1 incremental relation (where the QueryRunner fix actually takes effect) is only chosen when the source table version is < 8. The test harness defaults to v8, which routed the test through the V2 relation and broke the assertion.

…ryRunner Cloud event incremental sources (S3/GCS) always use V1 checkpoint (commit#fileKey, requested-time). They should always route through the V1 incremental relation regardless of the source meta-table's actual version. Parameterizes the regression test on source version {6, 8} to cover both.

The S3 metadata test schema lacked a top-level partition_path field, so the V1 incremental relation's partition-schema lookup failed when the test ran a real read against the on-disk meta-table. Mirroring the GCS test schema.

…umns are pruned Incremental span filters on _hoodie_commit_time are enforced via parquet push-down, which drops all rows when Spark prunes the column from the scan schema (count(), isEmpty()). Read filter-referenced columns regardless of Spark's pruning and project them away after filtering. Re-enable the file group reader in QueryRunner. Document the getMandatoryFields contract and prune declarations not backed by required filters. Cover COW and MOR (non-partitioned, unchanged schema) source meta-tables in the S3/GCS mid-commit pagination tests.

The fix derives the extra read columns from the required filters only, so the declaration cleanup and contract docs can land separately.

Write records to a partition path consistent with the table config so the non-partitioned meta-tables list correctly, and re-write an existing key in the second commit so MOR runs produce log files. Expand the file group reader incremental test to cover COW/MOR x source table v6/v8 x v6/v8 reads, with preconditions on small-file handling and base/log file layout, incremental ranges over base-only, base-plus-log, log-only, and empty windows, and select *, narrow projection, count(), and isEmpty() per range.

The incremental read path now normalizes START_COMMIT and END_COMMIT via HoodieSqlCommonUtils.formatIncrementalInstant, which rejects single-digit instant times.

codecov-commenter · 2026-06-11T05:56:22Z

Codecov Report

❌ Patch coverage is 60.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.66%. Comparing base (1fd2c36) to head (bb7b1af).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	52.94%	5 Missing and 3 partials ⚠️

❗ There is a different number of reports uploaded between BASE (1fd2c36) and HEAD (bb7b1af). Click for more details.

HEAD has 31 uploads less than BASE

Flag BASE (1fd2c36) HEAD (bb7b1af)

spark-java-tests 18 0

spark-scala-tests 12 0

utilities 1 0

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #18949       +/-   ##
=============================================
- Coverage     68.26%   53.66%   -14.60%     
+ Complexity    29500    21865     -7635     
=============================================
  Files          2542     2452       -90     
  Lines        142618   132467    -10151     
  Branches      17790    15525     -2265     
=============================================
- Hits          97352    71084    -26268     
- Misses        37261    55745    +18484     
+ Partials       8005     5638     -2367

Flag	Coverage Δ
common-and-other-modules	`45.32% <60.00%> (+0.53%)`	⬆️
hadoop-mr-java-client	`44.77% <ø> (+0.09%)`	⬆️
spark-client-hadoop-common	`48.08% <ø> (+0.02%)`	⬆️
spark-java-tests	`?`
spark-scala-tests	`?`
utilities	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...he/hudi/utilities/sources/helpers/QueryRunner.java	`50.00% <100.00%> (+27.50%)`	⬆️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`50.99% <52.94%> (-33.03%)`	⬇️

... and 914 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-11T09:17:40Z

CI report:

0f4d233 UNKNOWN
199e19c UNKNOWN
b95efa8 UNKNOWN
ae54e8f UNKNOWN
40dc1bf UNKNOWN
92d67e5 UNKNOWN
bb7b1af Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…cremental happy-path test The NOTE previously claimed the underlying cause was all-null record-key column statistics; investigation showed the actual cause is Spark column pruning on the incremental relation: HoodieFileGroupReaderBasedFileFormat hides its requiredFilters (the commit-time span predicate) from Catalyst, so Spark drops _hoodie_commit_time from the scan schema for count()/isEmpty()/projections-without-meta. Parquet then evaluates the predicate against missing columns as all-null and returns 0 rows. The bug is pre-existing and reproduces without selective meta-field exclusion - this test just happens to be the first one to trip it. Tracking fix: apache#18949 (augments the read schema with filter-only columns and projects them away after filtering). Locally validated: with that PR's HoodieFileGroupReaderBasedFileFormat diff applied, both CoW (v6/v9) and MoR (v6/v9) happy-path tests pass with the count() assertion uncommented. Reverted the production change here since PR apache#18949 owns it; left the assertEquals(2, count()) commented out with a reference so we re-enable it once apache#18949 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yihua force-pushed the fix-cloud-incr-source-mid-commit-pagination branch from 0f4d233 to 2917bb5 Compare June 10, 2026 00:47

yihua changed the title ~~[HUDI-XXXXX] fix(streamer): Include start commit in S3/GCS IncrSource incremental query~~ fix(streamer): Include start commit in S3/GCS IncrSource incremental query Jun 10, 2026

yihua force-pushed the fix-cloud-incr-source-mid-commit-pagination branch from 2917bb5 to 199e19c Compare June 10, 2026 00:54

yihua force-pushed the fix-cloud-incr-source-mid-commit-pagination branch from 199e19c to b95efa8 Compare June 10, 2026 00:57

yihua added 3 commits June 9, 2026 18:04

Remove unused Arrays import in S3EventsHoodieIncrSourceHarness

ae54e8f

Simplify javadocs for cloud incr source mid-commit pagination tests

40dc1bf

Fix GcsEventsHoodieIncrSource constructor arg order in test

6bd6725

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 10, 2026

yihua and others added 14 commits June 9, 2026 20:02

Fix build

983c499

Fix test

db636ba

Keep getMandatoryFields declarations unchanged

af04f7b

The fix derives the extra read columns from the required filters only, so the declaration cleanup and contract docs can land separately.

Tighten test comments

cd7a6a3

Address review comments

1c9c638

Use individual static imports for Assertions and Mockito

22267a0

Fix broken fully-qualified Mockito.anyInt call

4a94340

Use timestamp-format commit times in mid-commit pagination tests

bb7b1af

The incremental read path now normalizes START_COMMIT and END_COMMIT via HoodieSqlCommonUtils.formatIncrementalInstant, which rejects single-digit instant times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(streamer): Include start commit in S3/GCS IncrSource incremental query#18949

fix(streamer): Include start commit in S3/GCS IncrSource incremental query#18949
yihua wants to merge 18 commits into
apache:masterfrom
yihua:fix-cloud-incr-source-mid-commit-pagination

yihua commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 11, 2026

Uh oh!

hudi-bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yihua commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

codecov-commenter commented Jun 11, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 11, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yihua commented Jun 10, 2026 •

edited

Loading