feat(clean): support clean-by-time retention boundary and archive protection#19041
Open
fhan688 wants to merge 2 commits into
Open
feat(clean): support clean-by-time retention boundary and archive protection#19041fhan688 wants to merge 2 commits into
fhan688 wants to merge 2 commits into
Conversation
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
KEEP_LATEST_BY_HOURScurrently picks the first completed instant after the time cutoff as the earliest commit to retain. This can make the cleaner skip files that should already be eligiblefor cleaning, especially when commit density is sparse around the cutoff.
This PR corrects the clean-by-time retention boundary and makes timeline archival respect the latest completed clean's
earliestCommitToRetainconsistently across timeline archiver versions.It also exposes the existing cleaner/archive configs through the Flink write path.
Summary and Changelog
This PR adds clean-by-time support improvements without introducing any LSM-table-specific behavior.
Changes:
KEEP_LATEST_BY_HOURSECTR calculation to choose the earliest completed instant at or before the retention cutoff.TimelineArchiverV1andTimelineArchiverV2whenhoodie.archive.block.on.clean.ectris enabled.hoodie.clean.max.commitshoodie.clean.empty.commit.interval.hourshoodie.archive.block.on.clean.ectrNo code was copied from another project.
Impact
User-facing behavior changes:
KEEP_LATEST_BY_HOURScompute the clean boundary more accurately against the configured time window.This may retain more active timeline instants in some cases when clean ECTR archive blocking is enabled, which is expected for correctness.
Risk Level
medium
The change affects cleaner retention boundary calculation and timeline archival retention decisions. The implementation is conservative: it avoids cleaning past pending instants and only blocks
archival on clean ECTR when the existing config is enabled.
Verification:
mvn -pl hudi-common -am -DskipTests -DskipITs -Dcheckstyle.skip=true test-compilemvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -Dcheckstyle.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestFlinkWriteClients#testCleanByTimeConfigsPropagateToWriteConfig testgit diff --checkDocumentation Update
The Hudi website/config documentation should be updated for the newly exposed Flink options:
hoodie.clean.max.commitshoodie.clean.empty.commit.interval.hourshoodie.archive.block.on.clean.ectrContributor's checklist