Skip to content

feat: Add support for writing bloom filters#3265

Open
renaudb wants to merge 3 commits into
apache:mainfrom
renaudb:renaudb-add-bloom-filters-write
Open

feat: Add support for writing bloom filters#3265
renaudb wants to merge 3 commits into
apache:mainfrom
renaudb:renaudb-add-bloom-filters-write

Conversation

@renaudb
Copy link
Copy Markdown

@renaudb renaudb commented Apr 21, 2026

Closes #850

Note: This PR is currently held back by boto requiring pyarrow<=23.1 as bloom filter write support was added in pyarrow 24.

Rationale for this change

Add support for writing bloom filters to parquet files. This changes leverages the new bloom_filter_options write_parquet argument in pyarrow 24.

Are these changes tested?

Added tests for the metadata parsing. Added a very basic test for the writing path (there is currently no way to test for the existence of a bloomfilter in a parquet file using pyarrow).

Are there any user-facing changes?

N/A

@renaudb renaudb marked this pull request as draft April 21, 2026 21:40
@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented May 11, 2026

Thanks for working on this @renaudb. I've noticed that in the uv.lock we're behind some versions of boto3/botocore, not sure why this. Maybe we can bump this with uv sync --upgrade manually (in a separate PR).

@renaudb
Copy link
Copy Markdown
Author

renaudb commented May 11, 2026

@Fokko the issue is with Bodo, not Boto. Bodo forces pyarrow>=23.0,<23.1. I saw a bunch of issues filed on their hand about how this is too restrictive, but it looks like they are limited in changing it.

https://github.com/bodo-ai/Bodo/blob/main/pyproject.toml#L9

@renaudb renaudb force-pushed the renaudb-add-bloom-filters-write branch from b86268b to efd56e8 Compare May 14, 2026 18:44
@renaudb renaudb marked this pull request as ready for review May 14, 2026 19:04
@renaudb
Copy link
Copy Markdown
Author

renaudb commented May 21, 2026

@Fokko this should be ready for review.

@Fokko Fokko self-requested a review May 21, 2026 20:10
Comment thread pyiceberg/io/pyarrow.py
from packaging import version

MIN_PYARROW_VERSION_SUPPORTING_BLOOM_FILTER_WRITES = "24.0.0"
if version.parse(pyarrow.__version__) < version.parse(MIN_PYARROW_VERSION_SUPPORTING_BLOOM_FILTER_WRITES):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make this explicitly a different error since it is implemented but is gate on the dependency version?

Comment thread pyiceberg/io/pyarrow.py
else:
file_schema = table_schema

parquet_writer_kwargs = _get_parquet_writer_kwargs(table_metadata.properties, file_schema)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, neither input varies per file: table_metadata.properties is table-level, and file_schema is derived solely from table_metadata.schema() (just sanitized), so it's identical for every task.

Would it make sense to lift the file_schema derivation and this _get_parquet_writer_kwargs call back out of write_parquet, computing them once per write_file call instead of once per write_parquet call?

Comment thread pyiceberg/io/pyarrow.py
schema (pyiceberg.schema.Schema): The current table schema.
table_properties (dict[str, str]): The table properties.
"""
bloom_filter_options = pre_order_visit(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parquet_path_to_id_mapping(file_schema) already walks this schema with ID2ParquetPathVisitor for stats. If I'm reading it right, the bloom path adds a couple more passes over the same schema. Would it make sense to do a single ID2ParquetPathVisitor pass and deriving both the bloom options and the stats mapping from it?

@pytest.mark.integration
@skip_if_bloom_filter_not_supported
@pytest.mark.parametrize("format_version", [1, 2])
def test_write_parquet_bloom_filter_properties(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to assert pq.ParquetWriter is called with bloom_filter_options?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support bloom-filter writing

3 participants