Skip to content

Vs30 refactor#45

Open
AndrewRidden-Harper wants to merge 455 commits into
masterfrom
vs30_refactor
Open

Vs30 refactor#45
AndrewRidden-Harper wants to merge 455 commits into
masterfrom
vs30_refactor

Conversation

@AndrewRidden-Harper

@AndrewRidden-Harper AndrewRidden-Harper commented May 13, 2026

Copy link
Copy Markdown
Contributor

Re-opened after history rewrite

This PR is a re-opened version of an earlier one. Between then and now, I used git filter-repo to remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch from master and made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.

Important

If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with origin anymore - git pull on top of an old clone will fail or produce a broken state. A fresh git clone is the safe move.

Summary

This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.

Addressed review feedback

Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out but some structural changes are worth flagging:

  • Removed CLI access to individual pipeline steps (interested users can still call the functions)
  • CLI module (cli.py) is now just a CLI wrapper (without pipeline orchestration code)
  • Shared parameters lifted from per-model configs into constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.
  • Bundled model versions are specified by a StrEnum rather than their config file name. The four valid names live as FixedModelVersion members in vs30/constants.py, giving a single canonical source for the supported set.

Additional development work

  • Ensured exact correspondence between points and grid modes. New tests verify identical outputs at tricky locations like geological category boundaries, missing-value edges in the terrain raster, and similar edge cases.
  • Tightened grid-bounds semantics to ensure unambiguous pixel-edge interpretation and eliminate the potential for inconsistencies from GDAL tie-breaking behaviour.
  • Added the jaehwi_v1p0 model (in addition to the existing modified_foster_2019 and viktor_cpt_clustering), along with the gap-filling functionality it requires.
  • Added a near-reproduction of the Foster et al. (2019) model (foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.
  • Added support for custom YAML model configurations. Users can pass a YAML config path in place of a bundled model name to define their own model variants without modifying the package.
  • Added small end-to-end benchmark tests with reference outputs generated by the legacy code, to verify reproduction.
  • Removed the Python multiprocessing option. Testing across a wide range of scenarios showed it was consistently slower than the single-process path, which is already multi-threaded via the BLAS used by NumPy/SciPy.
  • Other performance touch-ups — einsum-based MVN distances, gapfill memory optimisation, Float32 coast-distance arrays, etc.
  • Added a brief GitHub wiki.

Why points and grid are separate pipelines

A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:

  • Have vs30 grid call the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.
  • Have vs30 points run the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.

AndrewRidden-Harper and others added 30 commits April 1, 2026 15:04
Evaluates approaches for testing grid/points pipeline consistency
across the full NZ domain and all model versions. Settles on 3x3
local grids per test point with pre-computed deliberate + random
point selection and a two-tier fast/slow test structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Five tasks: generate test points fixture, register slow marker,
add config-loading helper, rewrite test with full NZ/all-model
coverage, and verify existing tests still pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…del versions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… empty, return [] immediately before any work is done
Only jaehwi_v1p0 sets fill_gaps: true; all other model configs and test
fixtures set fill_gaps: false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add fill_gaps to required fields in load_model_config
- Add fill_gaps param to grid_pipeline, guard Stage 6 gap-fill block
- Add fill_gaps param to points_pipeline, guard gap-fill block

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass fill_gaps from config in grid/points commands
- Add --fill-gaps/--no-fill-gaps to grid_custom/points_custom
- Add fill_gaps to run_points_pipeline helper
- Update all test call sites to pass fill_gaps
- Update gapfill-design.md to reflect config flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce peak memory in fill_nodata_grid by restricting coordinate
computation and KDTree construction to a dilated neighborhood around
nodata gaps, rather than the full 161M-pixel grid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
~25% of the grid is ocean nodata. Dilating the full nodata mask would
expand from the entire coastline and defeat the optimization. Instead,
exclude water (GID=0) and offshore pixels first, then dilate only around
the small set of fillable on-land gaps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three tasks: verify baseline, add helper + import, rewrite
fill_nodata_grid with classify-before-dilate algorithm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r import

Preparation for memory-optimized fill_nodata_grid. The helper computes
pixel center coordinates in float32, and maximum_filter will be used for
efficient separable dilation of the fillable mask.
Restrict coordinate computation and KDTree construction to a small
neighborhood around fillable on-land gaps instead of the full grid.
This reduces peak memory from 10+ GB to a few MB for typical inputs.

Algorithm: exclude water (GID=0) in index space, run coastline check
on the small candidate set, dilate only the fillable mask to define
the donor search region, then build the KDTree from valid pixels in
that neighborhood. Uses float32 coordinates for spatial lookup and
indexes back into the original float64 arrays for donor values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AndrewRidden-Harper and others added 3 commits May 12, 2026 15:11
…ntions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.

Comment thread vs30/spatial.py Outdated
Comment thread vs30/category.py
Comment thread vs30/correlations.py
Comment thread vs30/pipeline.py
AndrewRidden-Harper and others added 14 commits May 13, 2026 12:59
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop unused threadpoolctl, declare pyproj as a test dep (used in
test_benchmarks), and ignore test_benchmarks under DEP001 since it's
imported across test files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the 19 non-qcore ty diagnostics on this branch:

- gapfill.create_local_grid_config: ``round()`` the float bounds back to
  int so they match ``GridConfig`` annotations.
- spatial.prepare_observation_data: narrow ``slope_array`` /
  ``coast_dist_array`` inside the GEOLOGY branch (function-entry check
  already raises if they're missing).
- pipeline.compute_component_grid: restructure the
  ``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so
  the inner branches can narrow ``categorical_model_csv``.
- pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so
  each key has its precise value type, removing union noise at
  call-sites (``fill_one_point_via_local_grid``, tests).
- pipeline.points_pipeline: convert the ``... if run_X else None``
  conditional-expression assignments to ``if/else`` statements and add
  ``type guard`` asserts so ty can narrow downstream uses.

Standardise type-guard asserts (including pre-existing ones) on a
``# type guard: <runtime invariant>`` comment so readers know they
exist for the static checker.

The 2 remaining ``unresolved-import: qcore`` diagnostics are
environment-specific and persist in CI; they're unrelated to this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runs ``deptry -ddg test,types``, classifying the ``test`` extras
group as dev. DEP004 then fires on any dev-classified package imported
in tests/. pytest is already on the ignore list for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rior

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-pixel O(N_obs) np.partition scan in
select_observations_for_pixel with a per-pixel KDTree query.
ObservationData carries the tree, built lazily in __post_init__ from
locations. ~22x per-call speedup on the viktor 35,706-obs benchmark
(microbenchmark); pinned raster/points benchmarks and grid/points
consistency tests pass unchanged across all four model versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NTERMEDIATE_ARRAY_MEMORY_GB

Preparation for batched-KDTree-query refactor which reuses the same memory
budget for intp+float64 chunked query-result arrays. The constant no longer
describes only boolean arrays; rename and update docstrings to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batched KDTree query producing (N_pix, k) indices and distances matrices in
one Cython call. Not yet wired in; Task 3 hoists selection out of the
per-pixel update function and routes both adjustment loops through this
helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoist observation selection out of the per-pixel update and call the new
select_observations_for_pixel_batch helper once per chunk in the grid loop
and once total in the points loop. compute_spatial_adjustment_for_pixel
now accepts obs_indices as a required parameter; max_dist_m and max_points
are dropped from its signature.

All pinned benchmark rasters and grid/points consistency tests pass
unchanged across all four model versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaced by select_observations_for_pixel_batch (Task 2) which both
adjustment loops now call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Thread max_spatial_intermediate_array_memory_gb through
  compute_spatial_adjustment_on_grid so user-supplied caps reach the
  new chunk loop (previously stopped at find_affected_pixels).
- Rename chunk_indices/chunk_distances to batch_indices/batch_distances
  in compute_spatial_pixel_adjustments and align compute_spatial_point_adjustments
  so both callers of select_observations_for_pixel_batch use the same
  variable names.
- Restore a one-line comment near corr_zero explaining why it is slightly
  less than 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop defensive np.atleast_2d from select_observations_for_pixel_batch.
- Expand the (N_pix, k) reshape comment to describe scipy's k=1 squeeze.
- Encapsulate the dual coordinate-space dance: find_affected_pixels now
  returns (affected_flat_indices, affected_locs) and
  compute_spatial_pixel_adjustments takes them as parallel single-space
  inputs. Tests adapted.
- Extract the per-slot byte size into BYTES_PER_KDTREE_QUERY_SLOT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ption

Update the stale flag name in Usage.md and broaden the description to
cover both storage shapes the budget now caps (boolean bbox arrays plus
KDTree query-result arrays).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AndrewRidden-Harper

Copy link
Copy Markdown
Contributor Author

Addressed the comments from the gemini-code-assist bot

AndrewRidden-Harper and others added 3 commits May 15, 2026 10:14
…_pixels

Replaces the axis-aligned bbox check in find_affected_pixels with one
tree.query(grid_locs, k=1, distance_upper_bound=max_dist_m) call, using
the KDTree already built in ObservationData.__post_init__ for the MVN
inner loop. Drops the now-dead max_spatial_intermediate_array_memory_gb
and model_type parameters from the signature, deletes the only-caller
grid_points_in_bbox helper, and refreshes the constant's docstring to
describe its single remaining use site (the MVN chunk loop).

Bumps MAX_SPATIAL_INTERMEDIATE_ARRAY_MEMORY_GB default from 1.0 to 4.0
to better fit modern workstation memory while users on smaller machines
retain the CLI flag override. wiki/Usage.md updated to match.

Benchmarks on the prefilter_perf branch show 6-13x speedup at low N_obs
models and ~300-450x speedup on viktor_cpt_clustering. KDTree returns a
strict subset of bbox's affected pixels; the ~0.3-1% bbox-corner pixels
that disappear were short-circuiting in the MVN inner loop anyway with
a tiny corr_zero stdv shrinkage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread vs30/resources/observations/README.md Outdated
Comment on lines +14 to +26
- **Source**: `sites_load.load_mcgann_vs()`
- **Description**: Surface wave measurements from the Canterbury region
- **Count**: ~400 sites
- **Source label**: `mcgann`

#### 2. Wotherspoon et al. (2015)
- **Source**: `sites_load.load_wotherspoon_vs()`
- **Description**: Additional surface wave measurements from Canterbury
- **Count**: ~40 sites
- **Source label**: `wotherspoon`

#### 3. Kaiser et al. (2017) - Filtered
- **Source**: `sites_load.load_kaiseretal_vs()`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sites_load.py doesn't exist anymore? Not sure where the functions live now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this README hadn't had a careful pass before the PR, and following up on your comment surfaced enough staleness (and a few inaccuracies) that I rewrote the whole thing (ea74b28). sites_load.py was legacy and removed in the refactor, so those loaders aren't in the package anymore; the CSVs now ship pre-generated, with sites_load.py kept only as a provenance note for where the data came from.

Removes stale sites_load.py / dev/observations / config.yaml references (PR #45) and inaccurate dataset details.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants