Skip to content

Partial partition rewrite for in-place inventory modifications & treatments#368

Open
amarcozzi wants to merge 1 commit into
mainfrom
355-partial-rewrite-inventory-modifications
Open

Partial partition rewrite for in-place inventory modifications & treatments#368
amarcozzi wants to merge 1 commit into
mainfrom
355-partial-rewrite-inventory-modifications

Conversation

@amarcozzi

Copy link
Copy Markdown
Contributor

Problem

In-place modifications and treatments rewrote the entire Parquet dataset on every call. save_parquet_replace wrote the whole inventory to a {id}__rev staging prefix, deleted the live directory, copied staging back, and deleted staging — a many-GCS-op swap whose cost scaled with total inventory size, regardless of how many rows the operation actually touched. A modification scoped to one corner of the domain paid the same price as one that rewrote every tree.

Closes #355.

Change

Rewrite only the partitions whose content actually changes.

standgen storage.py

  • write_changed_partitions(id, transform) — reads every partition in one concurrent gcsfs cat, applies a per-partition DataFrame -> DataFrame transform, and writes back (one concurrent pipe) only the partitions where not new.equals(old), under their existing names. A scoped op rewrites only the partitions it overlaps. Replaces save_parquet_replace. Leaves _metadata untouched — the file set is unchanged and every reader re-reads each file's footer, so a stale aggregate never corrupts a read.
  • write_full_partitions(id, df) — materialized full rewrite for the one case that re-partitions globally; drops the now-wrong _metadata/_common_metadata and any stale part files so readers list the directory fresh.

Routing

  • Modifications: every mod goes through write_changed_partitions.
  • Treatments route by per-partition expressibility, not metric (a polygon-scoped thin only changes the partitions its polygon overlaps, whatever the metric):
    • diameter and directional basal-area thins → partial rewrite. Directional basal-area is reduced to a single diameter cutoff precomputed over the treated population (precompute_cutoffs_directional_cutoff), then applied per-partition. Verified the cutoff filter reproduces the fastfuels-core thinner exactly (identical kept sets, both directions).
    • proportional basal-area (random whole-stand removal — can't be a per-partition filter without diverging from the thinner's RNG) → write_full_partitions.

API (cache.py / router.py / schema.py)

  • /data/metadata and /data/{i} now read num_partitions / total_rows / columns from the dask DataFrame (per-file footers) instead of the aggregated _metadata, so counts stay correct after a partial rewrite leaves _metadata stale. Per-partition row-count list dropped from the metadata response (get a partition's count from /data/{i}); partition data served by index.

Testing

  • standgen unit (196) + API cache/schema (48), ruff clean.
  • Integration suites all green against real GCS/Firestore + a live API server: standgen 27/27 (incl. directional-cutoff, proportional full-rewrite, spatially-scoped scope), API data-router 17/17, duplicate-router 12/12.

Benchmark

Adds services/api/benchmarks/bench_inplace_inventory.py — times in-place modifications end-to-end against the deployed pipeline (domain → PIM grid → PIM inventory → timed modify, cold rep discarded).

Baseline against the currently-deployed full-rewrite standgen (547k trees / 16 partitions, 6 warm reps):

Scenario warm median
scoped (sub-region) 2.24 s
global (all trees) 2.24 s

scoped == global is the full-rewrite signature — both rewrite all 16 partitions regardless of scope. After this PR is deployed, scoped should write only the partitions its region overlaps and drop below global (and global should fall too as the staging swap becomes an in-place overwrite). The after-deploy A/B will be added as a comment.

In-place modifications and treatments rewrote the entire Parquet dataset
via a delete + copy-back staging swap, regardless of how many rows the
operation actually touched. Rewrite only the partitions whose content
changes.

standgen storage:
- write_changed_partitions(id, transform): read every partition in one
  concurrent gcsfs cat, apply the per-partition transform, and write back
  only the partitions where not new.equals(old) in one concurrent pipe.
  Leaves _metadata untouched (the file set is unchanged; readers re-read
  each footer). Replaces save_parquet_replace.
- write_full_partitions(id, df): materialized full rewrite for the one
  case that re-partitions globally; drops the now-wrong _metadata /
  _common_metadata and stale part files so readers list fresh.

Modifications route every mod through write_changed_partitions. Treatments
route by per-partition expressibility, not metric: diameter and directional
basal-area thins (the latter reduced to a single precomputed diameter cutoff
that reproduces the fastfuels-core thinner exactly) rewrite only changed
partitions, including polygon-scoped thins; only proportional basal-area
(random whole-stand removal) falls back to write_full_partitions.

API metadata/data endpoints read num_partitions / total_rows / columns from
the dask DataFrame (per-file footers) instead of the aggregated _metadata,
so counts stay correct after a partial rewrite leaves _metadata stale. Drops
the per-partition row-count list from the /data metadata response.

Adds an end-to-end benchmark (benchmarks/bench_inplace_inventory.py) that
times in-place modifications against the deployed API + standgen pipeline.

Closes #355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace inventory in-place rewrite (delete + copy-back) with an atomic _metadata commit and stats-pruned partial partition rewrites

1 participant