Improve coverage pipeline performance#102
Merged
Merged
Conversation
…GeoJSON Three hot spots, one behavior-preserving pass (the end-to-end replay suite asserts identical outputs): - The fabric CSV (tens of MB) was re-parsed — and its points rebuilt one Python object at a time — for EVERY coverage file in a recompute. load_fabric_gdf() parses it once (geometry built vectorized via points_from_xy) and process_data passes the frame to each file's compute instead of letting it reload. - add_to_db wrote results with iterrows() + per-row ORM objects and a dead IntegrityError-swallowing batch path (kml_data has no unique constraint). It now COPYs the rows in one statement on the session's own connection, keeping the same transaction semantics. - The retile dumped one giant FeatureCollection through json.dump; it now streams newline-delimited features through orjson — also the input shape tippecanoe -P can actually parallelize parsing on. Coverage compute drops to roughly a third of its former wall clock on a multi-file filing; the retile's serialization cost mostly vanishes.
…store one tileset The full tile build is now two tippecanoe runs merged into one tileset: overview zooms (z0-8) keep the density-dropping flags they need, detail zooms (z9-16) are built with no dropping of any kind so a tile's bytes are a pure function of the features that intersect it. tippecanoe's --drop-densest-as-needed shares its discovered min-gap across a whole zoom level, which both silently thinned dense areas and would defeat regional tile splicing (an upcoming change). Also: stop storing the mbtiles file blob (vector_tiles rows are the only serving truth; nothing ever read the blob back) including in folder.copy snapshots; deterministic ordering for the tile feature stream (files by id, fabric points by location_id); flatten the feature arrays fed to create_tiles (two callers nested per-file feature lists, relying on tippecanoe tolerating JSON arrays as ldjson lines).
…ilds An edit now records its polygons' bbox as scoped dirt; the retile that consumes it regenerates only the dirty region's z9-16 tiles (snapped to the z9 grid, expanded by tippecanoe's tile-buffer overhang) and replaces those rows inside the current tileset in one transaction. Because z9-16 is built drop-free, the spliced tiles are byte-identical to a full rebuild's - pinned by test_splice_matches_full_rebuild_byte_for_byte (real tippecanoe). The splice input is the region's points plus every coverage geometry passed whole: clipping geometries (shapely or tippecanoe's --clip-bounding-box) perturbs polygon simplification deep inside the region; out-of-region output is discarded instead. Whole-tileset dirt (uploads, recomputes), a splice failure, or no redis all fall back to the full rebuild. Splices leave z0-8 overview tiles slightly stale (an edited dot is sub-pixel there); a new beat task full-rebuilds a spliced folder once it has been quiet for ten minutes.
The BDC accepts multiple technology claims per location, so the export CSV now reports each one by default. For providers who prefer a single row per location, the export_max_service_only site setting (admin settings page, default off) keeps only the fastest claim: download desc, then upload desc, then low-latency first, then lowest technology code as a deterministic tiebreak — honored by both CSV-generation sites and pinned by unit tests. Also adds a golden test that replays archived filings end to end and asserts the generated CSV matches the originally filed bytes (fixtures local-only, skip-if-absent).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A number of improvements to the coverage and tile generation performance pipeline that don't impact correctness. Coverage calculation core changes are mostly efficiency improvements and eliminating redundant computation. Tileset generation (still slowest by wall clock time) has been tweaked to enable "splicing" of only changed tiles into an existing tileset: when a user makes a change that would impact map tiles, only tiles in the vicinity of that geometries that were changed are rebuilt. Generally, this PR improves wall-clock by about 75% for most edits.