Skip to content

[Discussion WIP] Add utility to both schedule and execute table services on MDT without needing to write to data table #19025

@kbuci

Description

@kbuci

Feature Description

What the feature achieves:

  • Add a utility API that will invoke the same steps as HoodieBackedTableMetadataWriter::performTableServices to schedule and execute compaction, clean, archival on MDT. This allows a user to ensure the MDT of a dataset has no extra/uncompacted files (that can impact storage footprint or read times) without needing them to do an unecessary or "empty" write on the data table.

    • By default, it should hold the table lock of data table, at least when validating and scheduling clean/compaction plans. If we add support for a separate lock on MDT, then we can relax this constraint.
      • We can extend this support to clean as well in the future
  • As an extra safety, can add a tunable config to control wether to schedule, execute, or both schedule and execute clean and compaction. This is since for datasets with a RLI and many record index shards, compaction and clean may require a lot of time or spark executor resources. As a result, a writer may not have sufficient spark resources to execute said plan in a reasonable time bound.

  • Note that these will only trigger and perform table services if the expected criteria/conditions are met. For example, if there an older inflight instant or not enough accumulated writes, then compaction/clean won't be attempted. We just want a writer to be able to run the same steps that HoodieBackedTableMetadataWriter::performTableServices would go through, except without having to write to the data table.

Why this feature is needed:
This is similar to the original sub-ask in #17908 (comment) . Our org has a use case where we would need to run such a utility to avoid buildup of data/instant files in MDT, which can impact writes and causes storage to grow unbounded. Typically this scenario happens if there is a backfill of clustering/deletePartition writes on a dataset, that do not perform MDT table services. They cannot be configured to do this (since they may not have sufficient spark executors to compact/clean a MDT with a large RLI).
Currently, we work around this by performing an "empty commit" on the data table at a regular cadence (with sufficient spark resources) to perform this MDT "cleanup". But this is not an ideal solution, as it makes observability more difficult (distinguishing "empty"from "actual" write) and adds more instants to data table and MDT timeline (the latter required us to add optimization of #18215 (comment) )

User Experience

How users will use this feature:

  • Configuration changes needed
  • API changes
  • Usage examples

Hudi RFC Requirements

RFC PR link: (if applicable)

Why RFC is/isn't needed:

  • Does this change public interfaces/APIs? (Yes/No)
  • Does this change storage format? (Yes/No)
  • Justification:

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureNew features and enhancements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions