Skip to content

Push Overture building fetch bbox predicate down to Parquet row groups#368

Open
mividtim wants to merge 13 commits into
larsiusprime:masterfrom
fabrica-land:sfogliatella/eng-3033-overture-fetch-pushdown
Open

Push Overture building fetch bbox predicate down to Parquet row groups#368
mividtim wants to merge 13 commits into
larsiusprime:masterfrom
fabrica-land:sfogliatella/eng-3033-overture-fetch-pushdown

Conversation

@mividtim

Copy link
Copy Markdown

Summary

Replaces the Overture building fetch batch source with a DuckDB-backed Parquet scan that pushes the bbox overlap predicate into row-group pruning, then re-slices streamed pandas chunks into bounded GeoDataFrame batches for the existing streaming stats path.

Why

The prior PyArrow dataset scanner can decode far more of the global buildings theme than needed for a small bbox because it does not reliably prune on nested bbox struct subfields. In the AVM worker this caused multi-GB memory spikes for small dense queries. DuckDB can apply the bbox predicate against Parquet row-group statistics, sharply reducing peak memory while preserving the streamed aggregation contract.

Stacking note

This branch is intentionally stacked on #356 (strudel/eng-3033-overture-streaming-upstream). Until #356 merges, this PR's diff includes that streaming foundation plus the DuckDB pushdown commit. After #356 lands, this branch can be rebased so the PR shows only the pushdown layer.

Validation

  • python -m py_compile openavmkit/utilities/overture.py
  • python -m pytest tests/test_overture_streaming.py tests/test_overture_fetch.py -q passed locally on the feature branch and on fabrica-land integration: 7 passed, 1 pandas FutureWarning.

@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:


I affirm that this contributor has signed the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from c0aeaf5 to af1278c Compare June 22, 2026 21:11
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from af1278c to e8adedb Compare June 22, 2026 21:30
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from e8adedb to c0853b1 Compare June 22, 2026 21:34
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from c0853b1 to 19899f8 Compare June 22, 2026 21:47
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from 19899f8 to e6adce5 Compare June 22, 2026 21:56
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from e6adce5 to b160a31 Compare June 22, 2026 22:05
…ckDB)

The streaming fetch still OOM-killed the 4 GiB AVM worker on wide bounding
boxes: _building_batches read the global Overture buildings theme through
PyArrow's dataset scanner, which does not prune row groups on the nested `bbox`
struct subfields. Streaming the *output* did not bound memory because the
*scan* still materializes a huge fraction of the global theme regardless of
bbox — a ~1 km box pulled ~3 GB RSS over ~84 s, and a single 0.05deg tile alone
exceeded 4 GiB, so subprocess tiling could never bound it either.

Fetch via DuckDB read_parquet instead, with the bbox-overlap predicate pushed
to Parquet row-group statistics. The matched rows are streamed back with
fetch_df_chunk (so the full building set is still never materialized at once,
preserving the streaming contract the per-parcel stats aggregation relies on).
The same ~1 km fetch returns the identical rows at ~0.4 GB; full get_buildings
peak ~0.57 GB.

Building outputs (geometry WKB, height_m_best, floors_best, footprint area) are
bit-for-bit identical to the prior streaming path, verified row-for-row on a
built-up multi-parcel bbox, so there is no model-output change and no version
bump is required.

Pandas (not Arrow) is used because the Overture `geometry` column is a GeoArrow
GEOMETRY type whose DuckDB Arrow export resolves its CRS via the `spatial`
extension catalog (not loaded) and raises an internal error; the pandas path
returns the raw stored WKB verbatim. httpfs is loaded explicitly (INSTALL is an
idempotent no-op when the extension is already present, e.g. baked into the
image).

Adds duckdb to requirements; repoints the _building_batches split test at the
new DuckDB stream seam; adds a network-free unit test covering the pushdown
SQL/params, the unavailable-column drop, and the GeoDataFrame construction.

ENG-3033
@fabrica-cc-engineering-agent fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from 353bd7c to b0b5cb8 Compare June 22, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant