Skip to content

feat:(DNM) add a lsm-tree based FG reader#18987

Open
danny0405 wants to merge 2 commits into
apache:masterfrom
danny0405:lsm-reader
Open

feat:(DNM) add a lsm-tree based FG reader#18987
danny0405 wants to merge 2 commits into
apache:masterfrom
danny0405:lsm-reader

Conversation

@danny0405

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

RFC-103 introduces an LSM tree file-group layout where base and log files are sorted by record key and merged with a streaming k-way merge. The reader side needs a dedicated implementation for that layout without changing the existing HoodieFileGroupReader path.

The design also uses native parquet log files instead of Avro log files with embedded parquet data blocks. Native data logs use <fileId>_<writeToken>_<instant>_<version>.parquet, and native delete logs use <fileId>_<writeToken>_<instant>_<version>.delete.parquet, so common file-name parsing and file-system view classification need to recognize those files correctly.

Summary and Changelog

Adds a separate LSM file-group reader for native parquet log files and updates common log-file parsing to recognize RFC-style native parquet data/delete logs.

Commit 1: feat:(DNM) add a lsm-tree based FG reader (f0b63593dedd)

  • Added HoodieLsmFileGroupReader as a separate reader entry point instead of modifying HoodieFileGroupReader.
  • Added LsmFileGroupRecordIterator to perform streaming sorted k-way merge over one active record per base/log file.
  • Implemented the k-way merge with a loser-tree state machine, deterministic same-key ordering, and existing BufferedRecordMerger semantics.
  • Preserved existing tie behavior for equal ordering values by processing sources in merge order: base file first, then log files ordered by instant/version/write token/suffix, so later log records win when ordering values are equal.
  • Read native parquet data logs directly through HoodieReaderContext and added reader-side handling for native delete parquet logs with the fixed delete schema.
  • Added native parquet log parsing in FSUtils and HoodieLogFile, including data log and .delete.parquet delete log names.
  • Updated AbstractTableFileSystemView so native parquet log files are classified as log files and excluded from base-file discovery.
  • Added TestHoodieLogFile coverage for native parquet data/delete log parsing and helper extraction.

Impact

This adds a new reader implementation for LSM file groups without changing the existing HoodieFileGroupReader behavior. It affects common file-name parsing and file-system view classification for native parquet log files, enabling readers to distinguish native log v2 files from regular parquet base files.

No writer path, table config default, or existing Avro log reader behavior is changed. The main compatibility impact is that RFC-style native parquet log files are now recognized as Hudi log files by common utilities.

Risk Level

medium

The change touches common file parsing and file-system view classification, which are core read-path utilities. The new LSM reader also implements merge ordering semantics that must stay consistent with existing file-group merge behavior. Risk is mitigated by keeping the LSM reader separate from HoodieFileGroupReader, preserving existing merge APIs, and validating with:

  • mvn -pl hudi-common -DskipTests compile
  • mvn -pl hudi-common -DskipITs -Dtest=TestHoodieLogFile test

Documentation Update

none

This PR adds reader implementation and native log-file recognition but does not introduce a new user-facing config, default behavior change, or public documentation surface in this repo. The behavior follows the RFC-103 design.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Jun 12, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 22.93578% with 336 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.29%. Comparing base (612e327) to head (8eb9994).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...mon/table/read/lsm/LsmFileGroupRecordIterator.java 0.00% 217 Missing ⚠️
...ommon/table/read/lsm/HoodieLsmFileGroupReader.java 0.00% 90 Missing ⚠️
...mon/table/read/lsm/SpillableLsmRecordIterator.java 73.33% 13 Missing and 3 partials ⚠️
...c/main/java/org/apache/hudi/common/fs/FSUtils.java 76.47% 3 Missing and 9 partials ⚠️
...common/table/view/AbstractTableFileSystemView.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18987      +/-   ##
============================================
- Coverage     68.24%   67.29%   -0.96%     
+ Complexity    29478    29100     -378     
============================================
  Files          2542     2545       +3     
  Lines        142541   142967     +426     
  Branches      17798    17877      +79     
============================================
- Hits          97281    96203    -1078     
- Misses        37254    38657    +1403     
- Partials       8006     8107     +101     
Flag Coverage Δ
common-and-other-modules 44.69% <20.87%> (-0.08%) ⬇️
hadoop-mr-java-client 44.46% <3.66%> (-0.28%) ⬇️
spark-client-hadoop-common 21.11% <6.19%> (-26.95%) ⬇️
spark-java-tests 48.66% <22.01%> (-0.10%) ⬇️
spark-scala-tests 44.66% <5.27%> (-0.15%) ⬇️
utilities 37.10% <3.21%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java 100.00% <100.00%> (ø)
...va/org/apache/hudi/common/model/HoodieLogFile.java 95.78% <100.00%> (+0.49%) ⬆️
...common/table/view/AbstractTableFileSystemView.java 88.20% <66.66%> (-1.25%) ⬇️
...c/main/java/org/apache/hudi/common/fs/FSUtils.java 78.14% <76.47%> (+0.13%) ⬆️
...mon/table/read/lsm/SpillableLsmRecordIterator.java 73.33% <73.33%> (ø)
...ommon/table/read/lsm/HoodieLsmFileGroupReader.java 0.00% <0.00%> (ø)
...mon/table/read/lsm/LsmFileGroupRecordIterator.java 0.00% <0.00%> (ø)

... and 160 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Jun 12, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR adds a dedicated LSM file group reader and updates log-file parsing to recognize native parquet log files. The k-way merge structure and pattern handling look reasonable overall. A few items worth a closer look in the inline comments — most importantly, a regression in getFileIdFromLogPath for archive files, and the delete-log ordering_val field that the schema defines but the read path doesn't consume. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and simplification suggestions below — the main one being the magic regex group indices scattered across multiple methods; the others are minor dead-parameter and redundant-lookup nits in the loser tree.

return logFileMatcher.get().group(1);
}
return FSUtils.getFileId(fileName);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Switching to matchLogFile here looks like a regression for archive files — the original LOG_FILE_PATTERN.matcher(...) matched both log|archive, but matchLogFile only returns matches when group(3).equals("log"). The existing testArchiveLogFileName test asserts getFileIdFromLogPath(archive_path) returns the file id, which would now throw InvalidHoodiePathException. Could you preserve the archive-supporting behavior, or use LOG_FILE_PATTERN directly here?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}
if (!areSchemasProjectionEquivalent(fileRequiredSchema, readerSchema) || !requiredSchemaAndRenamedColumns.getRight().isEmpty()) {
UnaryOperator<T> projector = readerContext.getRecordContext()
.projectRecord(fileRequiredSchema, readerSchema, requiredSchemaAndRenamedColumns.getRight());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The DELETE_LOG_SCHEMA defines ordering_val (nullable bytes), but here we always pass OrderingValues.getDefault() rather than reading it from the record. The existing pattern (BufferedRecords.fromDeleteRecordRecordContext.getOrderingValue(DeleteRecord)) preserves the actual ordering value when it isn't commit-time. If the write side writes a real ordering_val for any event-time-ordered table, those deletes would lose to inserts they should override. Could you confirm what the writer puts in this field and read it accordingly?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

readerContext.getRecordContext().seal(record),
readerSchema,
readerContext.getRecordContext(),
orderingFieldNames,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The State enum values, ReaderState.state, and ReaderState.firstSameKeyIndex look like they're written by popWinner / markLoser / setChampionState but never read by the merge loop or anywhere else. If they're intended for a future optimization (e.g., skipping the inner merge loop when WINNER_WITH_NEW_KEY), worth either wiring that up or removing them to keep the loser-tree implementation easier to follow.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

this.metaClient = hoodieTableMetaClient;
this.storage = storage;

readerContext.setHasLogFiles(this.inputSplit.hasLogFiles());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 If a caller invokes multiple iterator getters (e.g. getClosableIterator() then getClosableHoodieRecordIterator()) on the same reader, this line overwrites lsmRecordIterator without closing the previous one — that earlier iterator (and any open parquet readers / spill files it holds) leaks until JVM exit. Could you guard with a null/closed check, or document that only one iterator may be obtained per reader instance?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Path spillDirectory = Paths.get(spillBasePath);
Files.createDirectories(spillDirectory);
this.spillFile = Files.createTempFile(spillDirectory, SPILL_FILE_PREFIX, SPILL_FILE_SUFFIX).toFile();
this.spillFile.deleteOnExit();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 If spill(...) throws (or Files.createTempFile does), and sourceIterator.close() in the finally also throws, the original exception is replaced and lost. Worth catching/suppressing the close exception so the underlying IOException surfaces — debugging "failed to spill" is much easier with the real cause.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

public static final String PATH_SEPARATOR = "/";
public static final Pattern LOG_FILE_PATTERN =
Pattern.compile("^\\.([^._]+)_([^.]*)\\.(log|archive)\\.(\\d+)(_((\\d+)-(\\d+)-(\\d+))(\\.cdc)?)?$");
public static final Pattern NATIVE_LOG_FILE_PATTERN =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The pattern uses [^_]+ for the file id, but Hudi file ids elsewhere in BASE_FILE_PATTERN allow [a-zA-Z0-9-]+ and LOG_FILE_PATTERN allows [^._]+. Are we sure native log file ids will never contain . or other special characters? Tightening this to [a-zA-Z0-9-]+ (or [^._]+ for consistency with the existing log pattern) would avoid accidental matches against unrelated .parquet paths.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

return Integer.compare(left.mergeOrder, right.mergeOrder);
}

private void markLoser(int loserIndex, int winnerIndex, int compareResult) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: compareResult is accepted here but never referenced in the method body — could you drop it from the signature? The sameKey check re-derives the relationship directly from the keys, so the parameter is dead weight that makes callers look like they're communicating something meaningful to markLoser.

- AI-generated; verify before applying. React 👍/👎 to flag quality.


private void markLoser(int loserIndex, int winnerIndex, int compareResult) {
ReaderState<T> loser = leaves.get(loserIndex);
boolean sameKey = leaves.get(loserIndex).current.getRecordKey().equals(leaves.get(winnerIndex).current.getRecordKey());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: loser already holds leaves.get(loserIndex), so leaves.get(loserIndex).current.getRecordKey() on this line could just be loser.current.getRecordKey() — a bit easier to read.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants