Skip to content

Improve coverage pipeline performance#102

Merged
shaddi merged 4 commits into
mainfrom
pipeline-performance
Jun 13, 2026
Merged

Improve coverage pipeline performance#102
shaddi merged 4 commits into
mainfrom
pipeline-performance

Conversation

@shaddi

@shaddi shaddi commented Jun 13, 2026

Copy link
Copy Markdown
Member

A number of improvements to the coverage and tile generation performance pipeline that don't impact correctness. Coverage calculation core changes are mostly efficiency improvements and eliminating redundant computation. Tileset generation (still slowest by wall clock time) has been tweaked to enable "splicing" of only changed tiles into an existing tileset: when a user makes a change that would impact map tiles, only tiles in the vicinity of that geometries that were changed are rebuilt. Generally, this PR improves wall-clock by about 75% for most edits.

shaddi added 4 commits June 12, 2026 23:53
…GeoJSON

Three hot spots, one behavior-preserving pass (the end-to-end replay
suite asserts identical outputs):

- The fabric CSV (tens of MB) was re-parsed — and its points rebuilt
  one Python object at a time — for EVERY coverage file in a recompute.
  load_fabric_gdf() parses it once (geometry built vectorized via
  points_from_xy) and process_data passes the frame to each file's
  compute instead of letting it reload.
- add_to_db wrote results with iterrows() + per-row ORM objects and a
  dead IntegrityError-swallowing batch path (kml_data has no unique
  constraint). It now COPYs the rows in one statement on the session's
  own connection, keeping the same transaction semantics.
- The retile dumped one giant FeatureCollection through json.dump; it
  now streams newline-delimited features through orjson — also the
  input shape tippecanoe -P can actually parallelize parsing on.

Coverage compute drops to roughly a third of its former wall clock on
a multi-file filing; the retile's serialization cost mostly vanishes.
…store one tileset

The full tile build is now two tippecanoe runs merged into one tileset:
overview zooms (z0-8) keep the density-dropping flags they need, detail
zooms (z9-16) are built with no dropping of any kind so a tile's bytes are
a pure function of the features that intersect it. tippecanoe's
--drop-densest-as-needed shares its discovered min-gap across a whole zoom
level, which both silently thinned dense areas and would defeat regional
tile splicing (an upcoming change).

Also: stop storing the mbtiles file blob (vector_tiles rows are the only
serving truth; nothing ever read the blob back) including in folder.copy
snapshots; deterministic ordering for the tile feature stream (files by id,
fabric points by location_id); flatten the feature arrays fed to
create_tiles (two callers nested per-file feature lists, relying on
tippecanoe tolerating JSON arrays as ldjson lines).
…ilds

An edit now records its polygons' bbox as scoped dirt; the retile that
consumes it regenerates only the dirty region's z9-16 tiles (snapped to the
z9 grid, expanded by tippecanoe's tile-buffer overhang) and replaces those
rows inside the current tileset in one transaction. Because z9-16 is built
drop-free, the spliced tiles are byte-identical to a full rebuild's - pinned
by test_splice_matches_full_rebuild_byte_for_byte (real tippecanoe).

The splice input is the region's points plus every coverage geometry passed
whole: clipping geometries (shapely or tippecanoe's --clip-bounding-box)
perturbs polygon simplification deep inside the region; out-of-region output
is discarded instead.

Whole-tileset dirt (uploads, recomputes), a splice failure, or no redis all
fall back to the full rebuild. Splices leave z0-8 overview tiles slightly
stale (an edited dot is sub-pixel there); a new beat task full-rebuilds a
spliced folder once it has been quiet for ten minutes.
The BDC accepts multiple technology claims per location, so the export CSV
now reports each one by default. For providers who prefer a single row per
location, the export_max_service_only site setting (admin settings page,
default off) keeps only the fastest claim: download desc, then upload desc,
then low-latency first, then lowest technology code as a deterministic
tiebreak — honored by both CSV-generation sites and pinned by unit tests.
Also adds a golden test that replays archived filings end to end and asserts
the generated CSV matches the originally filed bytes (fixtures local-only,
skip-if-absent).
@shaddi shaddi merged commit baf3ec7 into main Jun 13, 2026
3 checks passed
@shaddi shaddi deleted the pipeline-performance branch June 13, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant