feat: Add support for writing bloom filters#3265
Conversation
|
Thanks for working on this @renaudb. I've noticed that in the |
|
@Fokko the issue is with Bodo, not Boto. Bodo forces |
b86268b to
efd56e8
Compare
|
@Fokko this should be ready for review. |
| from packaging import version | ||
|
|
||
| MIN_PYARROW_VERSION_SUPPORTING_BLOOM_FILTER_WRITES = "24.0.0" | ||
| if version.parse(pyarrow.__version__) < version.parse(MIN_PYARROW_VERSION_SUPPORTING_BLOOM_FILTER_WRITES): |
There was a problem hiding this comment.
Would it be better to make this explicitly a different error since it is implemented but is gate on the dependency version?
| else: | ||
| file_schema = table_schema | ||
|
|
||
| parquet_writer_kwargs = _get_parquet_writer_kwargs(table_metadata.properties, file_schema) |
There was a problem hiding this comment.
If I understand correctly, neither input varies per file: table_metadata.properties is table-level, and file_schema is derived solely from table_metadata.schema() (just sanitized), so it's identical for every task.
Would it make sense to lift the file_schema derivation and this _get_parquet_writer_kwargs call back out of write_parquet, computing them once per write_file call instead of once per write_parquet call?
| schema (pyiceberg.schema.Schema): The current table schema. | ||
| table_properties (dict[str, str]): The table properties. | ||
| """ | ||
| bloom_filter_options = pre_order_visit( |
There was a problem hiding this comment.
I think parquet_path_to_id_mapping(file_schema) already walks this schema with ID2ParquetPathVisitor for stats. If I'm reading it right, the bloom path adds a couple more passes over the same schema. Would it make sense to do a single ID2ParquetPathVisitor pass and deriving both the bloom options and the stats mapping from it?
| @pytest.mark.integration | ||
| @skip_if_bloom_filter_not_supported | ||
| @pytest.mark.parametrize("format_version", [1, 2]) | ||
| def test_write_parquet_bloom_filter_properties( |
There was a problem hiding this comment.
Would it make sense to assert pq.ParquetWriter is called with bloom_filter_options?
Closes #850
Note: This PR is currently held back by boto requiring
pyarrow<=23.1as bloom filter write support was added in pyarrow 24.Rationale for this change
Add support for writing bloom filters to parquet files. This changes leverages the new
bloom_filter_optionswrite_parquetargument in pyarrow 24.Are these changes tested?
Added tests for the metadata parsing. Added a very basic test for the writing path (there is currently no way to test for the existence of a bloomfilter in a parquet file using pyarrow).
Are there any user-facing changes?
N/A