Partial partition rewrite for in-place inventory modifications & treatments by amarcozzi · Pull Request #368 · silvxlabs/FastFuels-API-v2

amarcozzi · 2026-06-16T04:52:46Z

Problem

In-place modifications and treatments rewrote the entire Parquet dataset on every call. save_parquet_replace wrote the whole inventory to a {id}__rev staging prefix, deleted the live directory, copied staging back, and deleted staging — a many-GCS-op swap whose cost scaled with total inventory size, regardless of how many rows the operation actually touched. A modification scoped to one corner of the domain paid the same price as one that rewrote every tree.

Closes #355.

Change

Rewrite only the partitions whose content actually changes.

standgen storage.py

write_changed_partitions(id, transform) — reads every partition in one concurrent gcsfs cat, applies a per-partition DataFrame -> DataFrame transform, and writes back (one concurrent pipe) only the partitions where not new.equals(old), under their existing names. A scoped op rewrites only the partitions it overlaps. Replaces save_parquet_replace. Leaves _metadata untouched — the file set is unchanged and every reader re-reads each file's footer, so a stale aggregate never corrupts a read.
write_full_partitions(id, df) — materialized full rewrite for the one case that re-partitions globally; drops the now-wrong _metadata/_common_metadata and any stale part files so readers list the directory fresh.

Routing

Modifications: every mod goes through write_changed_partitions.
Treatments route by per-partition expressibility, not metric (a polygon-scoped thin only changes the partitions its polygon overlaps, whatever the metric):
- diameter and directional basal-area thins → partial rewrite. Directional basal-area is reduced to a single diameter cutoff precomputed over the treated population (precompute_cutoffs → _directional_cutoff), then applied per-partition. Verified the cutoff filter reproduces the fastfuels-core thinner exactly (identical kept sets, both directions).
- proportional basal-area (random whole-stand removal — can't be a per-partition filter without diverging from the thinner's RNG) → write_full_partitions.

API (cache.py / router.py / schema.py)

/data/metadata and /data/{i} now read num_partitions / total_rows / columns from the dask DataFrame (per-file footers) instead of the aggregated _metadata, so counts stay correct after a partial rewrite leaves _metadata stale. Per-partition row-count list dropped from the metadata response (get a partition's count from /data/{i}); partition data served by index.

Testing

standgen unit (196) + API cache/schema (48), ruff clean.
Integration suites all green against real GCS/Firestore + a live API server: standgen 27/27 (incl. directional-cutoff, proportional full-rewrite, spatially-scoped scope), API data-router 17/17, duplicate-router 12/12.

Benchmark

Adds services/api/benchmarks/bench_inplace_inventory.py — times in-place modifications end-to-end against the deployed pipeline (domain → PIM grid → PIM inventory → timed modify, cold rep discarded).

Baseline against the currently-deployed full-rewrite standgen (547k trees / 16 partitions, 6 warm reps):

Scenario	warm median
`scoped` (sub-region)	2.24 s
`global` (all trees)	2.24 s

scoped == global is the full-rewrite signature — both rewrite all 16 partitions regardless of scope. After this PR is deployed, scoped should write only the partitions its region overlaps and drop below global (and global should fall too as the staging swap becomes an in-place overwrite). The after-deploy A/B will be added as a comment.

In-place modifications and treatments rewrote the entire Parquet dataset via a delete + copy-back staging swap, regardless of how many rows the operation actually touched. Rewrite only the partitions whose content changes. standgen storage: - write_changed_partitions(id, transform): read every partition in one concurrent gcsfs cat, apply the per-partition transform, and write back only the partitions where not new.equals(old) in one concurrent pipe. Leaves _metadata untouched (the file set is unchanged; readers re-read each footer). Replaces save_parquet_replace. - write_full_partitions(id, df): materialized full rewrite for the one case that re-partitions globally; drops the now-wrong _metadata / _common_metadata and stale part files so readers list fresh. Modifications route every mod through write_changed_partitions. Treatments route by per-partition expressibility, not metric: diameter and directional basal-area thins (the latter reduced to a single precomputed diameter cutoff that reproduces the fastfuels-core thinner exactly) rewrite only changed partitions, including polygon-scoped thins; only proportional basal-area (random whole-stand removal) falls back to write_full_partitions. API metadata/data endpoints read num_partitions / total_rows / columns from the dask DataFrame (per-file footers) instead of the aggregated _metadata, so counts stay correct after a partial rewrite leaves _metadata stale. Drops the per-partition row-count list from the /data metadata response. Adds an end-to-end benchmark (benchmarks/bench_inplace_inventory.py) that times in-place modifications against the deployed API + standgen pipeline. Closes #355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial partition rewrite for in-place inventory modifications & treatments#368

Partial partition rewrite for in-place inventory modifications & treatments#368
amarcozzi wants to merge 1 commit into
mainfrom
355-partial-rewrite-inventory-modifications

amarcozzi commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amarcozzi commented Jun 16, 2026

Problem

Change

Testing

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant