Skip to content

Implement ParquetFormatModel and update write_file to use the format API#3381

Open
nssalian wants to merge 3 commits into
apache:mainfrom
nssalian:file-format-parquet-impl
Open

Implement ParquetFormatModel and update write_file to use the format API#3381
nssalian wants to merge 3 commits into
apache:mainfrom
nssalian:file-format-parquet-impl

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

Continued work on #3100

PR Description

Follow-up to #3119. Implements ParquetFormatWriter and ParquetFormatModel, registers Parquet in the FileFormatFactory, and rewrites write_file to dispatch through the factory using the write.format.default table property. Future formats can be added in a similar way.

Rationale for this change

The write.format.default table property was never read - the write path was hardcoded to Parquet. This PR makes the property functional. Also threads file_format through _to_requested_schema / ArrowProjectionVisitor / _construct_field so field ID metadata keys are correct per format (PARQUET:field_id for Parquet, iceberg.id plus iceberg.required for ORC), preparing the write path for ORC support without changing default behavior.

Are these changes tested?

  • tests/io/test_format_writers.py adds parametrized tests modeled after Java's BaseFormatModelTests covering round-trip, statistics, null handling, context manager caching, close idempotency, close-without-write, and ORC vs Parquet field ID dispatch.
  • tests/io/test_pyarrow.py adds test_write_file_parquet_round_trip and test_write_file_dispatches_on_write_format_default exercising the full write_file path.

Are there any user-facing changes?

No. Default behavior is unchanged. Setting write.format.default to an unregistered format now raises a ValueError.

@nssalian nssalian changed the title Implement ParquetFormatModel and wire write_file to use the format API Implement ParquetFormatModel and update write_file to use the format API May 19, 2026
@nssalian nssalian marked this pull request as ready for review May 19, 2026 03:51
@nssalian
Copy link
Copy Markdown
Contributor Author

@kevinjqliu @Fokko @geruh PTAL when you can

Comment thread pyiceberg/io/pyarrow.py Outdated
Comment thread tests/io/test_format_writers.py Outdated
Comment thread tests/io/test_format_writers.py Outdated
Comment thread pyiceberg/io/pyarrow.py Outdated
Comment thread pyiceberg/io/pyarrow.py Outdated
@nssalian nssalian requested a review from rambleraptor May 22, 2026 17:01
Copy link
Copy Markdown

@Kurtiscwright Kurtiscwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for messaging me on slack about this PR. I only have one general question about the Python implementation.

Will the Python implementation support non-Arrow use cases?

For example Spark may want to use a custom columnar format, will that require translation from Spark to Arrow and then write to Parquet from Arrow?

The question the Rust File Format RFC is trying to answer:
Can the Spark format be extended to talk Iceberg specific semantics and then write to Parquet directly without ever needing to translate away from that Spark format?

Copy link
Copy Markdown
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! I love how much it cleans this up. I've got a question, but otherwise I think this is good!

Comment thread pyiceberg/io/pyarrow.py
self._output_file = output_file
self._file_schema = file_schema
self._properties = properties
self._writer: pq.ParquetWriter | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not setup the writer here if it's set to None?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants