Partial partition rewrite for in-place inventory modifications & treatments#368
Open
amarcozzi wants to merge 1 commit into
Open
Partial partition rewrite for in-place inventory modifications & treatments#368amarcozzi wants to merge 1 commit into
amarcozzi wants to merge 1 commit into
Conversation
In-place modifications and treatments rewrote the entire Parquet dataset via a delete + copy-back staging swap, regardless of how many rows the operation actually touched. Rewrite only the partitions whose content changes. standgen storage: - write_changed_partitions(id, transform): read every partition in one concurrent gcsfs cat, apply the per-partition transform, and write back only the partitions where not new.equals(old) in one concurrent pipe. Leaves _metadata untouched (the file set is unchanged; readers re-read each footer). Replaces save_parquet_replace. - write_full_partitions(id, df): materialized full rewrite for the one case that re-partitions globally; drops the now-wrong _metadata / _common_metadata and stale part files so readers list fresh. Modifications route every mod through write_changed_partitions. Treatments route by per-partition expressibility, not metric: diameter and directional basal-area thins (the latter reduced to a single precomputed diameter cutoff that reproduces the fastfuels-core thinner exactly) rewrite only changed partitions, including polygon-scoped thins; only proportional basal-area (random whole-stand removal) falls back to write_full_partitions. API metadata/data endpoints read num_partitions / total_rows / columns from the dask DataFrame (per-file footers) instead of the aggregated _metadata, so counts stay correct after a partial rewrite leaves _metadata stale. Drops the per-partition row-count list from the /data metadata response. Adds an end-to-end benchmark (benchmarks/bench_inplace_inventory.py) that times in-place modifications against the deployed API + standgen pipeline. Closes #355
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In-place modifications and treatments rewrote the entire Parquet dataset on every call.
save_parquet_replacewrote the whole inventory to a{id}__revstaging prefix, deleted the live directory, copied staging back, and deleted staging — a many-GCS-op swap whose cost scaled with total inventory size, regardless of how many rows the operation actually touched. A modification scoped to one corner of the domain paid the same price as one that rewrote every tree.Closes #355.
Change
Rewrite only the partitions whose content actually changes.
standgen
storage.pywrite_changed_partitions(id, transform)— reads every partition in one concurrent gcsfscat, applies a per-partitionDataFrame -> DataFrametransform, and writes back (one concurrentpipe) only the partitions wherenot new.equals(old), under their existing names. A scoped op rewrites only the partitions it overlaps. Replacessave_parquet_replace. Leaves_metadatauntouched — the file set is unchanged and every reader re-reads each file's footer, so a stale aggregate never corrupts a read.write_full_partitions(id, df)— materialized full rewrite for the one case that re-partitions globally; drops the now-wrong_metadata/_common_metadataand any stale part files so readers list the directory fresh.Routing
write_changed_partitions.precompute_cutoffs→_directional_cutoff), then applied per-partition. Verified the cutoff filter reproduces the fastfuels-core thinner exactly (identical kept sets, both directions).write_full_partitions.API (
cache.py/router.py/schema.py)/data/metadataand/data/{i}now readnum_partitions/total_rows/columnsfrom the dask DataFrame (per-file footers) instead of the aggregated_metadata, so counts stay correct after a partial rewrite leaves_metadatastale. Per-partition row-count list dropped from the metadata response (get a partition's count from/data/{i}); partition data served by index.Testing
Benchmark
Adds
services/api/benchmarks/bench_inplace_inventory.py— times in-place modifications end-to-end against the deployed pipeline (domain → PIM grid → PIM inventory → timed modify, cold rep discarded).Baseline against the currently-deployed full-rewrite standgen (547k trees / 16 partitions, 6 warm reps):
scoped(sub-region)global(all trees)scoped == globalis the full-rewrite signature — both rewrite all 16 partitions regardless of scope. After this PR is deployed,scopedshould write only the partitions its region overlaps and drop belowglobal(andglobalshould fall too as the staging swap becomes an in-place overwrite). The after-deploy A/B will be added as a comment.