Implement ParquetFormatModel and update write_file to use the format API by nssalian · Pull Request #3381 · apache/iceberg-python

nssalian · 2026-05-19T03:36:06Z

Continued work on #3100

PR Description

Follow-up to #3119. Implements ParquetFormatWriter and ParquetFormatModel, registers Parquet in the FileFormatFactory, and rewrites write_file to dispatch through the factory using the write.format.default table property. Future formats can be added in a similar way.

Rationale for this change

The write.format.default table property was never read - the write path was hardcoded to Parquet. This PR makes the property functional. Also threads file_format through _to_requested_schema / ArrowProjectionVisitor / _construct_field so field ID metadata keys are correct per format (PARQUET:field_id for Parquet, iceberg.id plus iceberg.required for ORC), preparing the write path for ORC support without changing default behavior.

Are these changes tested?

tests/io/test_format_writers.py adds parametrized tests modeled after Java's BaseFormatModelTests covering round-trip, statistics, null handling, context manager caching, close idempotency, close-without-write, and ORC vs Parquet field ID dispatch.
tests/io/test_pyarrow.py adds test_write_file_parquet_round_trip and test_write_file_dispatches_on_write_format_default exercising the full write_file path.

Are there any user-facing changes?

No. Default behavior is unchanged. Setting write.format.default to an unregistered format now raises a ValueError.

nssalian · 2026-05-19T03:56:56Z

@kevinjqliu @Fokko @geruh PTAL when you can

Kurtiscwright

Thank you for messaging me on slack about this PR. I only have one general question about the Python implementation.

Will the Python implementation support non-Arrow use cases?

For example Spark may want to use a custom columnar format, will that require translation from Spark to Arrow and then write to Parquet from Arrow?

The question the Rust File Format RFC is trying to answer:
Can the Spark format be extended to talk Iceberg specific semantics and then write to Parquet directly without ever needing to translate away from that Spark format?

rambleraptor

This looks awesome! I love how much it cleans this up. I've got a question, but otherwise I think this is good!

rambleraptor · 2026-05-29T21:50:56Z

+        self._output_file = output_file
+        self._file_schema = file_schema
+        self._properties = properties
+        self._writer: pq.ParquetWriter | None = None


Why not setup the writer here if it's set to None?

Implement ParquetFormatModel and wire write_file to use the format API

e64df3c

nssalian changed the title ~~Implement ParquetFormatModel and wire write_file to use the format API~~ Implement ParquetFormatModel and update write_file to use the format API May 19, 2026

nssalian marked this pull request as ready for review May 19, 2026 03:51

rambleraptor reviewed May 19, 2026

View reviewed changes

Comment thread pyiceberg/io/pyarrow.py Outdated

Comment thread tests/io/test_format_writers.py Outdated

Comment thread tests/io/test_format_writers.py Outdated

Comment thread pyiceberg/io/pyarrow.py Outdated

Comment thread pyiceberg/io/pyarrow.py Outdated

nssalian added 2 commits May 21, 2026 13:59

PR Comments

a2d9ea7

Merge remote-tracking branch 'apache/main' into file-format-parquet-impl

4d7194c

nssalian requested a review from rambleraptor May 22, 2026 17:01

Kurtiscwright reviewed May 29, 2026

View reviewed changes

rambleraptor approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ParquetFormatModel and update write_file to use the format API#3381

Implement ParquetFormatModel and update write_file to use the format API#3381
nssalian wants to merge 3 commits into
apache:mainfrom
nssalian:file-format-parquet-impl

nssalian commented May 19, 2026

Uh oh!

nssalian commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kurtiscwright left a comment

Uh oh!

rambleraptor left a comment

Uh oh!

rambleraptor May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nssalian commented May 19, 2026

PR Description

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kurtiscwright left a comment

Choose a reason for hiding this comment

Uh oh!

rambleraptor left a comment

Choose a reason for hiding this comment

Uh oh!

rambleraptor May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants