Skip to content

feat(clean): support clean-by-time retention boundary and archive protection#19041

Open
fhan688 wants to merge 2 commits into
apache:masterfrom
fhan688:support-clean-by-time
Open

feat(clean): support clean-by-time retention boundary and archive protection#19041
fhan688 wants to merge 2 commits into
apache:masterfrom
fhan688:support-clean-by-time

Conversation

@fhan688

@fhan688 fhan688 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

KEEP_LATEST_BY_HOURS currently picks the first completed instant after the time cutoff as the earliest commit to retain. This can make the cleaner skip files that should already be eligible
for cleaning, especially when commit density is sparse around the cutoff.

This PR corrects the clean-by-time retention boundary and makes timeline archival respect the latest completed clean's earliestCommitToRetain consistently across timeline archiver versions.
It also exposes the existing cleaner/archive configs through the Flink write path.

Summary and Changelog

This PR adds clean-by-time support improvements without introducing any LSM-table-specific behavior.

Changes:

  • Update KEEP_LATEST_BY_HOURS ECTR calculation to choose the earliest completed instant at or before the retention cutoff.
  • Ensure the by-hours ECTR does not move past the earliest pending instant by retaining the completed instant before the pending instant.
  • Add a shared archival utility to derive the earliest instant to retain from the latest completed clean metadata.
  • Apply clean ECTR archive blocking to both TimelineArchiverV1 and TimelineArchiverV2 when hoodie.archive.block.on.clean.ectr is enabled.
  • Expose Flink options for:
    • hoodie.clean.max.commits
    • hoodie.clean.empty.commit.interval.hours
    • hoodie.archive.block.on.clean.ectr
  • Add tests for by-hours ECTR selection, pending instant protection, V2 archival behavior, and Flink config propagation.

No code was copied from another project.

Impact

User-facing behavior changes:

  • Tables using KEEP_LATEST_BY_HOURS compute the clean boundary more accurately against the configured time window.
  • When archive blocking on clean ECTR is enabled, timeline archiving avoids archiving commits that may still be needed because their data files have not been cleaned yet.
  • Flink writers can configure the existing clean/archive controls through Flink options.

This may retain more active timeline instants in some cases when clean ECTR archive blocking is enabled, which is expected for correctness.

Risk Level

medium

The change affects cleaner retention boundary calculation and timeline archival retention decisions. The implementation is conservative: it avoids cleaning past pending instants and only blocks
archival on clean ECTR when the existing config is enabled.

Verification:

  • mvn -pl hudi-common -am -DskipTests -DskipITs -Dcheckstyle.skip=true test-compile
  • mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -Dcheckstyle.skip=true -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestFlinkWriteClients#testCleanByTimeConfigsPropagateToWriteConfig test
  • git diff --check

Documentation Update

The Hudi website/config documentation should be updated for the newly exposed Flink options:

  • hoodie.clean.max.commits
  • hoodie.clean.empty.commit.interval.hours
  • hoodie.archive.block.on.clean.ectr

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants