Implement ParquetFormatModel and update write_file to use the format API#3381
Implement ParquetFormatModel and update write_file to use the format API#3381nssalian wants to merge 3 commits into
Conversation
|
@kevinjqliu @Fokko @geruh PTAL when you can |
Kurtiscwright
left a comment
There was a problem hiding this comment.
Thank you for messaging me on slack about this PR. I only have one general question about the Python implementation.
Will the Python implementation support non-Arrow use cases?
For example Spark may want to use a custom columnar format, will that require translation from Spark to Arrow and then write to Parquet from Arrow?
The question the Rust File Format RFC is trying to answer:
Can the Spark format be extended to talk Iceberg specific semantics and then write to Parquet directly without ever needing to translate away from that Spark format?
rambleraptor
left a comment
There was a problem hiding this comment.
This looks awesome! I love how much it cleans this up. I've got a question, but otherwise I think this is good!
| self._output_file = output_file | ||
| self._file_schema = file_schema | ||
| self._properties = properties | ||
| self._writer: pq.ParquetWriter | None = None |
There was a problem hiding this comment.
Why not setup the writer here if it's set to None?
Continued work on #3100
PR Description
Follow-up to #3119. Implements
ParquetFormatWriterandParquetFormatModel, registers Parquet in theFileFormatFactory, and rewriteswrite_fileto dispatch through the factory using thewrite.format.defaulttable property. Future formats can be added in a similar way.Rationale for this change
The
write.format.defaulttable property was never read - the write path was hardcoded to Parquet. This PR makes the property functional. Also threadsfile_formatthrough_to_requested_schema/ArrowProjectionVisitor/_construct_fieldso field ID metadata keys are correct per format (PARQUET:field_idfor Parquet,iceberg.idplusiceberg.requiredfor ORC), preparing the write path for ORC support without changing default behavior.Are these changes tested?
tests/io/test_format_writers.pyadds parametrized tests modeled after Java'sBaseFormatModelTestscovering round-trip, statistics, null handling, context manager caching, close idempotency, close-without-write, and ORC vs Parquet field ID dispatch.tests/io/test_pyarrow.pyaddstest_write_file_parquet_round_tripandtest_write_file_dispatches_on_write_format_defaultexercising the fullwrite_filepath.Are there any user-facing changes?
No. Default behavior is unchanged. Setting
write.format.defaultto an unregistered format now raises aValueError.