Push Overture building fetch bbox predicate down to Parquet row groups by mividtim · Pull Request #368 · larsiusprime/openavmkit

mividtim · 2026-06-22T20:45:15Z

Summary

Replaces the Overture building fetch batch source with a DuckDB-backed Parquet scan that pushes the bbox overlap predicate into row-group pruning, then re-slices streamed pandas chunks into bounded GeoDataFrame batches for the existing streaming stats path.

Why

The prior PyArrow dataset scanner can decode far more of the global buildings theme than needed for a small bbox because it does not reliably prune on nested bbox struct subfields. In the AVM worker this caused multi-GB memory spikes for small dense queries. DuckDB can apply the bbox predicate against Parquet row-group statistics, sharply reducing peak memory while preserving the streamed aggregation contract.

Stacking note

This branch is intentionally stacked on #356 (strudel/eng-3033-overture-streaming-upstream). Until #356 merges, this PR's diff includes that streaming foundation plus the DuckDB pushdown commit. After #356 lands, this branch can be rebased so the PR shows only the pushdown layer.

Validation

python -m py_compile openavmkit/utilities/overture.py
python -m pytest tests/test_overture_streaming.py tests/test_overture_fetch.py -q passed locally on the feature branch and on fabrica-land integration: 7 passed, 1 pandas FutureWarning.

github-actions · 2026-06-22T20:45:24Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.

No action is required from you in this PR thread. Once you have signed the CLA externally, a maintainer will verify your signature and record it here on your behalf by commenting:

I affirm that this contributor has signed the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

…ckDB) The streaming fetch still OOM-killed the 4 GiB AVM worker on wide bounding boxes: _building_batches read the global Overture buildings theme through PyArrow's dataset scanner, which does not prune row groups on the nested `bbox` struct subfields. Streaming the *output* did not bound memory because the *scan* still materializes a huge fraction of the global theme regardless of bbox — a ~1 km box pulled ~3 GB RSS over ~84 s, and a single 0.05deg tile alone exceeded 4 GiB, so subprocess tiling could never bound it either. Fetch via DuckDB read_parquet instead, with the bbox-overlap predicate pushed to Parquet row-group statistics. The matched rows are streamed back with fetch_df_chunk (so the full building set is still never materialized at once, preserving the streaming contract the per-parcel stats aggregation relies on). The same ~1 km fetch returns the identical rows at ~0.4 GB; full get_buildings peak ~0.57 GB. Building outputs (geometry WKB, height_m_best, floors_best, footprint area) are bit-for-bit identical to the prior streaming path, verified row-for-row on a built-up multi-parcel bbox, so there is no model-output change and no version bump is required. Pandas (not Arrow) is used because the Overture `geometry` column is a GeoArrow GEOMETRY type whose DuckDB Arrow export resolves its CRS via the `spatial` extension catalog (not loaded) and raises an internal error; the pandas path returns the raw stored WKB verbatim. httpfs is loaded explicitly (INSTALL is an idempotent no-op when the extension is already present, e.g. baked into the image). Adds duckdb to requirements; repoints the _building_batches split test at the new DuckDB stream seam; adds a network-free unit test covering the pushdown SQL/params, the unavailable-column drop, and the GeoDataFrame construction. ENG-3033

Stream Overture parcel building stats

5b5daf8

Harden streaming Overture stats fallback

b5e9ec5

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from c0aeaf5 to af1278c Compare June 22, 2026 21:11

Harden streaming Overture aggregation

17164a3

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from af1278c to e8adedb Compare June 22, 2026 21:30

Narrow streaming fixes to duplicate keys

9deed30

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from e8adedb to c0853b1 Compare June 22, 2026 21:34

Harden Overture cache edge cases

60a8098

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from c0853b1 to 19899f8 Compare June 22, 2026 21:47

Bypass Overture cache for duplicate keys

79439ce

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from 19899f8 to e6adce5 Compare June 22, 2026 21:56

Tighten Overture enrichment tests

8f7a5b3

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from e6adce5 to b160a31 Compare June 22, 2026 22:05

fabrica-cc-engineering-agent Bot added 6 commits June 22, 2026 18:18

Extract Overture streaming test helper

09b81d1

Harden DuckDB Overture fetch path

caa297e

Clarify Overture fetch tests

0034e0b

Tighten Overture fetch regressions

46498f6

Separate Overture fetch and enrichment tests

b0b5cb8

fabrica-cc-engineering-agent Bot force-pushed the sfogliatella/eng-3033-overture-fetch-pushdown branch from 353bd7c to b0b5cb8 Compare June 22, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push Overture building fetch bbox predicate down to Parquet row groups#368

Push Overture building fetch bbox predicate down to Parquet row groups#368
mividtim wants to merge 13 commits into
larsiusprime:masterfrom
fabrica-land:sfogliatella/eng-3033-overture-fetch-pushdown

mividtim commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mividtim commented Jun 22, 2026

Summary

Why

Stacking note

Validation

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant