Vs30 refactor#45
Conversation
Evaluates approaches for testing grid/points pipeline consistency across the full NZ domain and all model versions. Settles on 3x3 local grids per test point with pre-computed deliberate + random point selection and a two-tier fast/slow test structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five tasks: generate test points fixture, register slow marker, add config-loading helper, rewrite test with full NZ/all-model coverage, and verify existing tests still pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… empty, return [] immediately before any work is done
Only jaehwi_v1p0 sets fill_gaps: true; all other model configs and test fixtures set fill_gaps: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add fill_gaps to required fields in load_model_config - Add fill_gaps param to grid_pipeline, guard Stage 6 gap-fill block - Add fill_gaps param to points_pipeline, guard gap-fill block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass fill_gaps from config in grid/points commands - Add --fill-gaps/--no-fill-gaps to grid_custom/points_custom - Add fill_gaps to run_points_pipeline helper - Update all test call sites to pass fill_gaps - Update gapfill-design.md to reflect config flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce peak memory in fill_nodata_grid by restricting coordinate computation and KDTree construction to a dilated neighborhood around nodata gaps, rather than the full 161M-pixel grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
~25% of the grid is ocean nodata. Dilating the full nodata mask would expand from the entire coastline and defeat the optimization. Instead, exclude water (GID=0) and offshore pixels first, then dilate only around the small set of fillable on-land gaps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three tasks: verify baseline, add helper + import, rewrite fill_nodata_grid with classify-before-dilate algorithm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r import Preparation for memory-optimized fill_nodata_grid. The helper computes pixel center coordinates in float32, and maximum_filter will be used for efficient separable dilation of the fillable mask.
Restrict coordinate computation and KDTree construction to a small neighborhood around fillable on-land gaps instead of the full grid. This reduces peak memory from 10+ GB to a few MB for typical inputs. Algorithm: exclude water (GID=0) in index space, run coastline check on the small candidate set, dilate only the fillable mask to define the donor search region, then build the KDTree from valid pixels in that neighborhood. Uses float32 coordinates for spatial lookup and indexes back into the original float64 arrays for donor values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop unused threadpoolctl, declare pyproj as a test dep (used in test_benchmarks), and ignore test_benchmarks under DEP001 since it's imported across test files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the 19 non-qcore ty diagnostics on this branch:
- gapfill.create_local_grid_config: ``round()`` the float bounds back to
int so they match ``GridConfig`` annotations.
- spatial.prepare_observation_data: narrow ``slope_array`` /
``coast_dist_array`` inside the GEOLOGY branch (function-entry check
already raises if they're missing).
- pipeline.compute_component_grid: restructure the
``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so
the inner branches can narrow ``categorical_model_csv``.
- pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so
each key has its precise value type, removing union noise at
call-sites (``fill_one_point_via_local_grid``, tests).
- pipeline.points_pipeline: convert the ``... if run_X else None``
conditional-expression assignments to ``if/else`` statements and add
``type guard`` asserts so ty can narrow downstream uses.
Standardise type-guard asserts (including pre-existing ones) on a
``# type guard: <runtime invariant>`` comment so readers know they
exist for the static checker.
The 2 remaining ``unresolved-import: qcore`` diagnostics are
environment-specific and persist in CI; they're unrelated to this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs ``deptry -ddg test,types``, classifying the ``test`` extras group as dev. DEP004 then fires on any dev-classified package imported in tests/. pytest is already on the ignore list for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-pixel O(N_obs) np.partition scan in select_observations_for_pixel with a per-pixel KDTree query. ObservationData carries the tree, built lazily in __post_init__ from locations. ~22x per-call speedup on the viktor 35,706-obs benchmark (microbenchmark); pinned raster/points benchmarks and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NTERMEDIATE_ARRAY_MEMORY_GB Preparation for batched-KDTree-query refactor which reuses the same memory budget for intp+float64 chunked query-result arrays. The constant no longer describes only boolean arrays; rename and update docstrings to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batched KDTree query producing (N_pix, k) indices and distances matrices in one Cython call. Not yet wired in; Task 3 hoists selection out of the per-pixel update function and routes both adjustment loops through this helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoist observation selection out of the per-pixel update and call the new select_observations_for_pixel_batch helper once per chunk in the grid loop and once total in the points loop. compute_spatial_adjustment_for_pixel now accepts obs_indices as a required parameter; max_dist_m and max_points are dropped from its signature. All pinned benchmark rasters and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced by select_observations_for_pixel_batch (Task 2) which both adjustment loops now call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Thread max_spatial_intermediate_array_memory_gb through compute_spatial_adjustment_on_grid so user-supplied caps reach the new chunk loop (previously stopped at find_affected_pixels). - Rename chunk_indices/chunk_distances to batch_indices/batch_distances in compute_spatial_pixel_adjustments and align compute_spatial_point_adjustments so both callers of select_observations_for_pixel_batch use the same variable names. - Restore a one-line comment near corr_zero explaining why it is slightly less than 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop defensive np.atleast_2d from select_observations_for_pixel_batch. - Expand the (N_pix, k) reshape comment to describe scipy's k=1 squeeze. - Encapsulate the dual coordinate-space dance: find_affected_pixels now returns (affected_flat_indices, affected_locs) and compute_spatial_pixel_adjustments takes them as parallel single-space inputs. Tests adapted. - Extract the per-slot byte size into BYTES_PER_KDTREE_QUERY_SLOT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ption Update the stale flag name in Usage.md and broaden the description to cover both storage shapes the budget now caps (boolean bbox arrays plus KDTree query-result arrays). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed the comments from the gemini-code-assist bot |
…_pixels Replaces the axis-aligned bbox check in find_affected_pixels with one tree.query(grid_locs, k=1, distance_upper_bound=max_dist_m) call, using the KDTree already built in ObservationData.__post_init__ for the MVN inner loop. Drops the now-dead max_spatial_intermediate_array_memory_gb and model_type parameters from the signature, deletes the only-caller grid_points_in_bbox helper, and refreshes the constant's docstring to describe its single remaining use site (the MVN chunk loop). Bumps MAX_SPATIAL_INTERMEDIATE_ARRAY_MEMORY_GB default from 1.0 to 4.0 to better fit modern workstation memory while users on smaller machines retain the CLI flag override. wiki/Usage.md updated to match. Benchmarks on the prefilter_perf branch show 6-13x speedup at low N_obs models and ~300-450x speedup on viktor_cpt_clustering. KDTree returns a strict subset of bbox's affected pixels; the ~0.3-1% bbox-corner pixels that disappear were short-circuiting in the MVN inner loop anyway with a tiny corr_zero stdv shrinkage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| - **Source**: `sites_load.load_mcgann_vs()` | ||
| - **Description**: Surface wave measurements from the Canterbury region | ||
| - **Count**: ~400 sites | ||
| - **Source label**: `mcgann` | ||
|
|
||
| #### 2. Wotherspoon et al. (2015) | ||
| - **Source**: `sites_load.load_wotherspoon_vs()` | ||
| - **Description**: Additional surface wave measurements from Canterbury | ||
| - **Count**: ~40 sites | ||
| - **Source label**: `wotherspoon` | ||
|
|
||
| #### 3. Kaiser et al. (2017) - Filtered | ||
| - **Source**: `sites_load.load_kaiseretal_vs()` |
There was a problem hiding this comment.
sites_load.py doesn't exist anymore? Not sure where the functions live now
There was a problem hiding this comment.
Good catch — this README hadn't had a careful pass before the PR, and following up on your comment surfaced enough staleness (and a few inaccuracies) that I rewrote the whole thing (ea74b28). sites_load.py was legacy and removed in the refactor, so those loaders aren't in the package anymore; the CSVs now ship pre-generated, with sites_load.py kept only as a provenance note for where the data came from.
Removes stale sites_load.py / dev/observations / config.yaml references (PR #45) and inaccurate dataset details. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-opened after history rewrite
This PR is a re-opened version of an earlier one. Between then and now, I used
git filter-repoto remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch frommasterand made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.Important
If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with
originanymore -git pullon top of an old clone will fail or produce a broken state. A freshgit cloneis the safe move.Summary
This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.
Addressed review feedback
Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out but some structural changes are worth flagging:
cli.py) is now just a CLI wrapper (without pipeline orchestration code)constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.StrEnumrather than their config file name. The four valid names live asFixedModelVersionmembers invs30/constants.py, giving a single canonical source for the supported set.Additional development work
jaehwi_v1p0model (in addition to the existingmodified_foster_2019andviktor_cpt_clustering), along with the gap-filling functionality it requires.foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.Why points and grid are separate pipelines
A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:
vs30 gridcall the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.vs30 pointsrun the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.