Push Overture building fetch bbox predicate down to Parquet row groups#368
Open
mividtim wants to merge 13 commits into
Open
Push Overture building fetch bbox predicate down to Parquet row groups#368mividtim wants to merge 13 commits into
mividtim wants to merge 13 commits into
Conversation
Contributor
|
Thank you for your contribution. I affirm that this contributor has signed the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
c0aeaf5 to
af1278c
Compare
af1278c to
e8adedb
Compare
e8adedb to
c0853b1
Compare
c0853b1 to
19899f8
Compare
19899f8 to
e6adce5
Compare
e6adce5 to
b160a31
Compare
…ckDB) The streaming fetch still OOM-killed the 4 GiB AVM worker on wide bounding boxes: _building_batches read the global Overture buildings theme through PyArrow's dataset scanner, which does not prune row groups on the nested `bbox` struct subfields. Streaming the *output* did not bound memory because the *scan* still materializes a huge fraction of the global theme regardless of bbox — a ~1 km box pulled ~3 GB RSS over ~84 s, and a single 0.05deg tile alone exceeded 4 GiB, so subprocess tiling could never bound it either. Fetch via DuckDB read_parquet instead, with the bbox-overlap predicate pushed to Parquet row-group statistics. The matched rows are streamed back with fetch_df_chunk (so the full building set is still never materialized at once, preserving the streaming contract the per-parcel stats aggregation relies on). The same ~1 km fetch returns the identical rows at ~0.4 GB; full get_buildings peak ~0.57 GB. Building outputs (geometry WKB, height_m_best, floors_best, footprint area) are bit-for-bit identical to the prior streaming path, verified row-for-row on a built-up multi-parcel bbox, so there is no model-output change and no version bump is required. Pandas (not Arrow) is used because the Overture `geometry` column is a GeoArrow GEOMETRY type whose DuckDB Arrow export resolves its CRS via the `spatial` extension catalog (not loaded) and raises an internal error; the pandas path returns the raw stored WKB verbatim. httpfs is loaded explicitly (INSTALL is an idempotent no-op when the extension is already present, e.g. baked into the image). Adds duckdb to requirements; repoints the _building_batches split test at the new DuckDB stream seam; adds a network-free unit test covering the pushdown SQL/params, the unavailable-column drop, and the GeoDataFrame construction. ENG-3033
353bd7c to
b0b5cb8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the Overture building fetch batch source with a DuckDB-backed Parquet scan that pushes the bbox overlap predicate into row-group pruning, then re-slices streamed pandas chunks into bounded GeoDataFrame batches for the existing streaming stats path.
Why
The prior PyArrow dataset scanner can decode far more of the global buildings theme than needed for a small bbox because it does not reliably prune on nested bbox struct subfields. In the AVM worker this caused multi-GB memory spikes for small dense queries. DuckDB can apply the bbox predicate against Parquet row-group statistics, sharply reducing peak memory while preserving the streamed aggregation contract.
Stacking note
This branch is intentionally stacked on #356 (
strudel/eng-3033-overture-streaming-upstream). Until #356 merges, this PR's diff includes that streaming foundation plus the DuckDB pushdown commit. After #356 lands, this branch can be rebased so the PR shows only the pushdown layer.Validation
python -m py_compile openavmkit/utilities/overture.pypython -m pytest tests/test_overture_streaming.py tests/test_overture_fetch.py -qpassed locally on the feature branch and on fabrica-land integration: 7 passed, 1 pandas FutureWarning.