Skip to content

Core: Expose helper methods of partition statistics#16560

Open
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/iceberg-partition-public
Open

Core: Expose helper methods of partition statistics#16560
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/iceberg-partition-public

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented May 25, 2026

#14998 causes build failures in Trino / Starburst.

As I reported #14284, Trino doesn't allow the Hadoop-based Parquet library.
Therefore, the project uses some helper methods instead of relying solely on the reader and writer classes in the Iceberg library.

This PR makes some methods public to resolve the build failure.

@ebyhr ebyhr force-pushed the ebi/iceberg-partition-public branch from 6cfed35 to f967c49 Compare May 25, 2026 03:27
@ebyhr ebyhr force-pushed the ebi/iceberg-partition-public branch from f967c49 to 02fbc95 Compare May 25, 2026 03:28
@github-actions github-actions Bot removed the API label May 25, 2026
@ebyhr ebyhr changed the title Core, API: Expose helper methods of partition statistics Core: Expose helper methods of partition statistics May 25, 2026
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented May 25, 2026

@pvary @gaborkaszab Could you review this PR when you have time?

@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks for pinging me on this, @ebyhr ! I wasn't aware of such a limitation from the Trino side.

I checked the functions that you changed to public and IMO they are very low level functions meant for internal usage. I'm just throwing out ideas here, but would it make sense to split the current computeAndWriteStatsFile public function into multiple parts and make the part public that is independent from the Parquet read/write.
In practice the public interface to stats does these steps:

  1. read existing stats (Parquet reading involved)
  2. calculate new stats either full or incremental (independent of file read/write, except Manifest reading)
  3. write new stat file (Parquet writing involved)
    This is an oversimplified summary here, I know, but might worth exploring if we can split 2) into a separate function and make it public. Would. that be sufficient, @ebyhr ?

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented May 26, 2026

@gaborkaszab I had previously submitted such a PR (#14290), but unfortunately, it wasn't reviewed. I'll restore the PR once we've agreed on the approach.

@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks for the link, @ebyhr ! After a Quick Look it seems slightly different approach than what I suggested here: It makes the partition stats readers and writers injectable into the computeAndWritePartitionStats function. My suggestion is simpler: we could split up this function into multiple steps and separate the read/write part from the rest. The rest could be public and used by Trino.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants