Vs30 refactor by AndrewRidden-Harper · Pull Request #45 · ucgmsim/Vs30

AndrewRidden-Harper · 2026-05-13T00:36:17Z

Re-opened after history rewrite

This PR is a re-opened version of an earlier one. Between then and now, I used git filter-repo to remove some unused benchmark files from the repo's history. The rewrite worked as intended but changed every commit's SHA, which orphaned the previous branch from master and made the old PR unmergeable, so I closed it and re-targeted the same work against the rewritten history. The content is unchanged; only the underlying SHAs have shifted.

Important

If you have an existing local clone of this repository, delete it and re-clone from scratch before pulling these changes. Your local commits share no history with origin anymore - git pull on top of an old clone will fail or produce a broken state. A fresh git clone is the safe move.

Summary

This PR continues the Vs30 package refactor, addressing prior review feedback, adding two new bundled models and support for custom YAML configurations, and incorporating other improvements.

Addressed review feedback

Several review comments from the previous PR have all been addressed. Most do not warrant a specific call-out but some structural changes are worth flagging:

Removed CLI access to individual pipeline steps (interested users can still call the functions)
CLI module (cli.py) is now just a CLI wrapper (without pipeline orchestration code)
Shared parameters lifted from per-model configs into constants.py. Values that were previously declared in each of the bundled model config files are now defined once in code; the config files are correspondingly leaner.
Bundled model versions are specified by a StrEnum rather than their config file name. The four valid names live as FixedModelVersion members in vs30/constants.py, giving a single canonical source for the supported set.

Additional development work

Ensured exact correspondence between points and grid modes. New tests verify identical outputs at tricky locations like geological category boundaries, missing-value edges in the terrain raster, and similar edge cases.
Tightened grid-bounds semantics to ensure unambiguous pixel-edge interpretation and eliminate the potential for inconsistencies from GDAL tie-breaking behaviour.
Added the jaehwi_v1p0 model (in addition to the existing modified_foster_2019 and viktor_cpt_clustering), along with the gap-filling functionality it requires.
Added a near-reproduction of the Foster et al. (2019) model (foster_2019_approx). Exact reproduction wasn't pursued because the published model has known flaws.
Added support for custom YAML model configurations. Users can pass a YAML config path in place of a bundled model name to define their own model variants without modifying the package.
Added small end-to-end benchmark tests with reference outputs generated by the legacy code, to verify reproduction.
Removed the Python multiprocessing option. Testing across a wide range of scenarios showed it was consistently slower than the single-process path, which is already multi-threaded via the BLAS used by NumPy/SciPy.
Other performance touch-ups — einsum-based MVN distances, gapfill memory optimisation, Float32 coast-distance arrays, etc.
Added a brief GitHub wiki.

Why points and grid are separate pipelines

A natural design question is why the package maintains two separate pipelines rather than collapsing them into one. Both pipelines produce identical results at every location (verified by the correspondence tests above); they are kept separate purely for performance. Each direction of unification has its own performance problem:

Have vs30 grid call the points pipeline at each pixel. This loses the bulk array operations that the grid pipeline gets from GDAL: bulk raster reads, bulk sampling, bulk spatial adjustment. At national-grid sizes (millions of pixels) it would be impractically slow.
Have vs30 points run the grid pipeline and sample the desired sites from it. This would compute Vs30 across an entire grid regardless of how many sites are requested, making typical points-mode queries needlessly slow.

Evaluates approaches for testing grid/points pipeline consistency across the full NZ domain and all model versions. Settles on 3x3 local grids per test point with pre-computed deliberate + random point selection and a two-tier fast/slow test structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Five tasks: generate test points fixture, register slow marker, add config-loading helper, rewrite test with full NZ/all-model coverage, and verify existing tests still pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… empty, return [] immediately before any work is done

Only jaehwi_v1p0 sets fill_gaps: true; all other model configs and test fixtures set fill_gaps: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add fill_gaps to required fields in load_model_config - Add fill_gaps param to grid_pipeline, guard Stage 6 gap-fill block - Add fill_gaps param to points_pipeline, guard gap-fill block Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Pass fill_gaps from config in grid/points commands - Add --fill-gaps/--no-fill-gaps to grid_custom/points_custom - Add fill_gaps to run_points_pipeline helper - Update all test call sites to pass fill_gaps - Update gapfill-design.md to reflect config flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce peak memory in fill_nodata_grid by restricting coordinate computation and KDTree construction to a dilated neighborhood around nodata gaps, rather than the full 161M-pixel grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

~25% of the grid is ocean nodata. Dilating the full nodata mask would expand from the entire coastline and defeat the optimization. Instead, exclude water (GID=0) and offshore pixels first, then dilate only around the small set of fillable on-land gaps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three tasks: verify baseline, add helper + import, rewrite fill_nodata_grid with classify-before-dilate algorithm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r import Preparation for memory-optimized fill_nodata_grid. The helper computes pixel center coordinates in float32, and maximum_filter will be used for efficient separable dilation of the fillable mask.

Restrict coordinate computation and KDTree construction to a small neighborhood around fillable on-land gaps instead of the full grid. This reduces peak memory from 10+ GB to a few MB for typical inputs. Algorithm: exclude water (GID=0) in index space, run coastline check on the small candidate set, dilate only the fillable mask to define the donor search region, then build the KDTree from valid pixels in that neighborhood. Uses float32 coordinates for spatial lookup and indexes back into the original float64 arrays for donor values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request executes a comprehensive refactoring of the VS30 package, introducing a modular architecture, a Typer-based CLI, and a YAML-driven configuration system for model versions. Key enhancements include updated dependencies for Python 3.11+, a robust suite of benchmark and consistency tests, and a standardized coordinate system using NZTM2000. Feedback highlights several critical areas for optimization and correction: the MVN spatial adjustment in spatial.py should utilize a spatial index to resolve performance bottlenecks caused by high computational complexity; the Bayesian update logic in category.py must clip the posterior standard deviation to correctly maintain minimum uncertainty; the exponential correlation function should not enforce a minimum distance at zero to avoid incorrect variance calculations; and the gap-filling logic in pipeline.py should be batched to improve efficiency for large point sets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop unused threadpoolctl, declare pyproj as a test dep (used in test_benchmarks), and ignore test_benchmarks under DEP001 since it's imported across test files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolve the 19 non-qcore ty diagnostics on this branch: - gapfill.create_local_grid_config: ``round()`` the float bounds back to int so they match ``GridConfig`` annotations. - spatial.prepare_observation_data: narrow ``slope_array`` / ``coast_dist_array`` inside the GEOLOGY branch (function-entry check already raises if they're missing). - pipeline.compute_component_grid: restructure the ``if/elif/else`` over ``posterior_df`` into ``if/else { if/else }`` so the inner branches can narrow ``categorical_model_csv``. - pipeline.grid_pipeline: introduce ``GridPipelineResult`` TypedDict so each key has its precise value type, removing union noise at call-sites (``fill_one_point_via_local_grid``, tests). - pipeline.points_pipeline: convert the ``... if run_X else None`` conditional-expression assignments to ``if/else`` statements and add ``type guard`` asserts so ty can narrow downstream uses. Standardise type-guard asserts (including pre-existing ones) on a ``# type guard: <runtime invariant>`` comment so readers know they exist for the static checker. The 2 remaining ``unresolved-import: qcore`` diagnostics are environment-specific and persist in CI; they're unrelated to this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI runs ``deptry -ddg test,types``, classifying the ``test`` extras group as dev. DEP004 then fires on any dev-classified package imported in tests/. pytest is already on the ignore list for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the per-pixel O(N_obs) np.partition scan in select_observations_for_pixel with a per-pixel KDTree query. ObservationData carries the tree, built lazily in __post_init__ from locations. ~22x per-call speedup on the viktor 35,706-obs benchmark (microbenchmark); pinned raster/points benchmarks and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…NTERMEDIATE_ARRAY_MEMORY_GB Preparation for batched-KDTree-query refactor which reuses the same memory budget for intp+float64 chunked query-result arrays. The constant no longer describes only boolean arrays; rename and update docstrings to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Batched KDTree query producing (N_pix, k) indices and distances matrices in one Cython call. Not yet wired in; Task 3 hoists selection out of the per-pixel update function and routes both adjustment loops through this helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hoist observation selection out of the per-pixel update and call the new select_observations_for_pixel_batch helper once per chunk in the grid loop and once total in the points loop. compute_spatial_adjustment_for_pixel now accepts obs_indices as a required parameter; max_dist_m and max_points are dropped from its signature. All pinned benchmark rasters and grid/points consistency tests pass unchanged across all four model versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaced by select_observations_for_pixel_batch (Task 2) which both adjustment loops now call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Thread max_spatial_intermediate_array_memory_gb through compute_spatial_adjustment_on_grid so user-supplied caps reach the new chunk loop (previously stopped at find_affected_pixels). - Rename chunk_indices/chunk_distances to batch_indices/batch_distances in compute_spatial_pixel_adjustments and align compute_spatial_point_adjustments so both callers of select_observations_for_pixel_batch use the same variable names. - Restore a one-line comment near corr_zero explaining why it is slightly less than 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop defensive np.atleast_2d from select_observations_for_pixel_batch. - Expand the (N_pix, k) reshape comment to describe scipy's k=1 squeeze. - Encapsulate the dual coordinate-space dance: find_affected_pixels now returns (affected_flat_indices, affected_locs) and compute_spatial_pixel_adjustments takes them as parallel single-space inputs. Tests adapted. - Extract the per-slot byte size into BYTES_PER_KDTREE_QUERY_SLOT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ption Update the stale flag name in Usage.md and broaden the description to cover both storage shapes the budget now caps (boolean bbox arrays plus KDTree query-result arrays). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AndrewRidden-Harper · 2026-05-14T02:41:26Z

Addressed the comments from the gemini-code-assist bot

…_pixels Replaces the axis-aligned bbox check in find_affected_pixels with one tree.query(grid_locs, k=1, distance_upper_bound=max_dist_m) call, using the KDTree already built in ObservationData.__post_init__ for the MVN inner loop. Drops the now-dead max_spatial_intermediate_array_memory_gb and model_type parameters from the signature, deletes the only-caller grid_points_in_bbox helper, and refreshes the constant's docstring to describe its single remaining use site (the MVN chunk loop). Bumps MAX_SPATIAL_INTERMEDIATE_ARRAY_MEMORY_GB default from 1.0 to 4.0 to better fit modern workstation memory while users on smaller machines retain the CLI flag override. wiki/Usage.md updated to match. Benchmarks on the prefilter_perf branch show 6-13x speedup at low N_obs models and ~300-450x speedup on viktor_cpt_clustering. KDTree returns a strict subset of bbox's affected pixels; the ~0.3-1% bbox-corner pixels that disappear were short-circuiting in the MVN inner loop anyway with a tiny corr_zero stdv shrinkage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joelridden · 2026-05-18T03:29:10Z

+- **Source**: `sites_load.load_mcgann_vs()`
+- **Description**: Surface wave measurements from the Canterbury region
+- **Count**: ~400 sites
+- **Source label**: `mcgann`
+
+#### 2. Wotherspoon et al. (2015)
+- **Source**: `sites_load.load_wotherspoon_vs()`
+- **Description**: Additional surface wave measurements from Canterbury
+- **Count**: ~40 sites
+- **Source label**: `wotherspoon`
+
+#### 3. Kaiser et al. (2017) - Filtered
+- **Source**: `sites_load.load_kaiseretal_vs()`


sites_load.py doesn't exist anymore? Not sure where the functions live now

Good catch — this README hadn't had a careful pass before the PR, and following up on your comment surfaced enough staleness (and a few inaccuracies) that I rewrote the whole thing (ea74b28). sites_load.py was legacy and removed in the refactor, so those loaders aren't in the package anymore; the CSVs now ship pre-generated, with sites_load.py kept only as a provenance note for where the data came from.

Removes stale sites_load.py / dev/observations / config.yaml references (PR #45) and inaccurate dataset details. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AndrewRidden-Harper and others added 30 commits April 1, 2026 15:04

gave tests more descriptive names

840f2a6

del docs for working

0b3b94d

add consistency test point generation script and fixture CSV

c23606e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix docstring and hoist coast_boundary in point generation script

2567214

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

register slow pytest marker in pyproject.toml

33b83b8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add load_fixed_model_config helper to test conftest

44c0692

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rewrite grid/points consistency test with full NZ coverage and all mo…

44c3427

…del versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

handle asymmetric NaN in consistency test failure messages

ecac661

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ruff linter fix

88550d7

numpydocs linter fix

4067843

ruff linter fix

3c02fb7

numpydoc linter fix

776ad48

ty type check fix

367b2fc

ty type check fix

a480f46

replace np.isclose with pytest.approx in tests

0105230

fix Bayesian update col type bug

cfdc8f6

skip slow tests by default

445bacf

An early-exit guard was added at line 598 if affected_flat_indices is…

6730f10

… empty, return [] immediately before any work is done

fix FutureWarning (wrong dtype)

f60131b

del unused benchmarks

2be379b

feat: add fill_gaps boolean to all model and test YAML configs

0d1651b

Only jaehwi_v1p0 sets fill_gaps: true; all other model configs and test fixtures set fill_gaps: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add implementation plan for gapfill memory optimization

913f575

Three tasks: verify baseline, add helper + import, rewrite fill_nodata_grid with classify-before-dilate algorithm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(gapfill): add _pixel_coords_float32 helper and maximum_filte…

2e99397

…r import Preparation for memory-optimized fill_nodata_grid. The helper computes pixel center coordinates in float32, and maximum_filter will be used for efficient separable dilation of the fillable mask.

AndrewRidden-Harper and others added 3 commits May 12, 2026 15:11

refactor(tests): trim docstrings, drop low-value tests, and fix conve…

662ebf8

…ntions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(wiki): rewrite Home and Usage for CLI newcomers

3257dc3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor: drop dev/ scripts (preserved on dev_scripts_archive)

5c4fcad

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread vs30/spatial.py Outdated

Comment thread vs30/category.py

Comment thread vs30/correlations.py

Comment thread vs30/pipeline.py

AndrewRidden-Harper and others added 14 commits May 13, 2026 12:59

refactor(gapfill): tidy docstrings and comments

61c1a69

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: satisfy numpydoc-lint in constants/raster/spatial

ae3e17d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(constants): clarify MIN_SIGMA is applied to the prior, not poste…

9db6efe

…rior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(spatial): remove per-pixel select_observations_for_pixel

076e451

Replaced by select_observations_for_pixel_batch (Task 2) which both adjustment loops now call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AndrewRidden-Harper requested review from claudio525, joelridden, lispandfound and sungeunbae May 14, 2026 02:41

AndrewRidden-Harper self-assigned this May 14, 2026

AndrewRidden-Harper requested a review from elshawy May 14, 2026 21:39

AndrewRidden-Harper and others added 3 commits May 15, 2026 10:14

docs: minor correction

5678fd5

ruff fix: del unused import

65dcba9

joelridden reviewed Jun 2, 2026

View reviewed changes

docs(observations): rewrite README as a concise high-level overview

ea74b28

Removes stale sites_load.py / dev/observations / config.yaml references (PR #45) and inaccurate dataset details. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vs30 refactor#45

Vs30 refactor#45
AndrewRidden-Harper wants to merge 455 commits into
masterfrom
vs30_refactor

AndrewRidden-Harper commented May 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewRidden-Harper commented May 14, 2026

Uh oh!

joelridden May 18, 2026

Uh oh!

AndrewRidden-Harper Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndrewRidden-Harper commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-opened after history rewrite

Summary

Addressed review feedback

Additional development work

Why points and grid are separate pipelines

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewRidden-Harper commented May 14, 2026

Uh oh!

joelridden May 18, 2026

Choose a reason for hiding this comment

Uh oh!

AndrewRidden-Harper Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndrewRidden-Harper commented May 13, 2026 •

edited

Loading