ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43)#47
Conversation
|
🤖 from Claude Phase 1 pushed (
Gates: Next: Phase 2 — Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 1 review (commit fb8ef18) against issue #30 acceptance criteria and CLAUDE.md. The _quality_mask helper + op wiring is clean and the ATL06 default is preserved exactly — but the ATL03 template as written will crash at runtime on real data. One blocker, plus a few smaller items.
🔴 Blocker — signal_conf_ph is 2-D; the template will raise IndexError on real ATL03 granules
/{group}/heights/signal_conf_ph in ATL03 is shaped (n_photons, 5) — one column per surface type (land/ocean/sea-ice/land-ice/inland-water). The yaml comment acknowledges this ("this template reads the land surface_type column (index 0)"), but the code never selects column 0. In _read_group:
q_flag = data[qf_path][mask_sliced] # 2-D: (n_rows, 5)
quality_mask = _quality_mask(q_flag, quality_filter) # returns 2-D mask (n_rows, 5)
...
values = values[quality_mask] # values is 1-D h_ph -> IndexErrorReproduced directly:
2D mask shape: (5, 5)
FAIL applying 2D mask to 1D values: IndexError too many indices for array: array is 1-dimensional, but 2 were indexed
So default_config("atl03") loads and validates fine, but the first granule read fails. Issue #30(b) specifies signal_conf_ph != 0, which for a 2-D column requires picking the surface-type index. Options:
- Add a
column:/index:key toquality_filter(e.g.index: 0) and sliceq_flag[:, index]in_read_groupbefore_quality_mask. Smallest, explicit, testable. - Hardcode column 0 when the flag array is 2-D (fragile; surface type 0 = land isn't always what an ice user wants).
Either way this needs a test that feeds a 2-D q_flag through the read path — see coverage gap below.
🟠 mean / median / var propagate NaN — confirm that's intended for h_ph
The template uses plain mean/median/var, not the NaN-aware family (nanmean/…). A single NaN/fill h_ph value in a cell nukes that cell's h_mean/h_median/h_variance to NaN (verified: np.mean([1,2,nan]) -> nan). For ATL06 that's the documented contract (fill values are real), but ATL03 photon heights are a different product. If geolocated h_ph can carry invalid/fill values, these should be nanmean/nanmedian/nanvar. If h_ph is guaranteed clean post-confidence-filter, fine as-is — but please state which, since it's a silent correctness lever.
Note: h_variance is _kernel_able (var→variance kernel), so on the experimental arrow-kernel path it's computed by pyarrow and agrees only within KERNEL_RTOL, not byte-identical — consistent with the documented dual contract. mean/median are not kernel-able (no "mean" key in _KERNEL_FUNCS, median absent), so they degrade gracefully to the numpy fallback. That part is correct.
🟡 validate_config doesn't validate quality_filter.op
A typo'd op (e.g. op: neq) passes validate_config and only blows up at runtime via _quality_mask's ValueError, deep inside a Lambda worker. Cheap to validate up front alongside the other data_source checks. Minor, but it's the kind of config error you want to catch at load time.
🟡 Test coverage gap
TestQualityMask covers the helper in isolation (good — eq default, ne, unknown-op raise), and TestATL03Template asserts the yaml's shape. But nothing exercises:
- a 2-D
q_flagthrough_quality_mask/_read_group(the blocker above would have been caught), - the
op: nemask actually wired through_read_groupend-to-end.
CLAUDE.md §4 ("every behavioral change needs tests"): the op addition's integration into _read_group is untested. Adding the 2-D fixture test would cover both.
Non-blocking / confirmations
- ✅ ATL06 regression: none.
atl06.yamlhasvalue: 0with noop, so_quality_maskdefaults toeq— identical to the oldq_flag == value. ATL06 outputs stay byte-for-byte. - ✅
default_config("atl03")loads + validates; all six scalar functions resolve. Matches #30(a) scalar-only and #29 deferral. - ✅ Rectilinear grid only, no HEALPix — matches #30(c)/mortie #35.
- ✅ The 8 new tests pass;
ruff checkclean on the Python files. - ⚪
ruff format --checkreports the two test files +processing.pywould reformat, but the diffs are pre-existing (fixture/blank-line spacing,_read_groupsignature wrap) — not introduced here, so flagging rather than attributing.pre-commit/§4 local-green will trip on it regardless; worth a separate cleanup, not this PR's job.
Bottom line: the helper/op work is solid and ATL06-safe, but the ATL03 template can't actually run until the 2-D signal_conf_ph column-selection is handled. I'd treat that as a phase-1 fix before advancing.
Generated by Claude Code
|
🤖 from Claude Folded the Phase 1 self-review's diff-scoped findings (commits
Two review findings I'm leaving standing for your call rather than acting on:
Gates after fold: Generated by Claude Code |
|
🤖 from Claude Phase 2 pushed (
Gates: Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial review of the Phase 2 diff (9956c0a + 4427b7f) against issue #30 acceptance and CLAUDE.md. Overall this is in good shape — the spec is met and I verified the wiring against the real signatures. No blockers. Findings below, ordered by importance.
Verified working (not just claimed):
- Imports and constructs with no creds touched — exec'd the module standalone;
from zagg.runner import agg/from zagg.config import ...are import-time-safe, and all catalog/S3 calls live insiderun_onebehind the__main__guard. ✓ (#30's hard requirement) - All four regions measure ~10.0 km × 10.0 km and sit at the right places: neon_maryland (-76.56, 38.89, matches the
make_shardmapCLI example bbox), russell_glacier (-50.0, 67.09, on the ice E of Kangerlussuaq), bahamas (-76.0, 24.0), antarctica_88s (0, -88). The1/cos(lat)lon scaling holds E-W extent at high latitude (Russell/88S both come out ~10 km). ✓ - Catalog→agg wiring is correct:
Query("ATL03", version, start, end, region=bbox)matches the dataclass field order(short_name, version, start_date, end_date, region, provider);make_shardmap(query, grid).to_json(path)andagg(cfg, catalog=, store=, handoff=, max_cells=, max_workers=, overwrite=)all match signatures (agg'sconfigis positional, rest keyword-only). ✓ _store_scalarsresolvesfrom_config(cfg).group_pathand readsget_data_vars(cfg)=[count, h_min, h_max, h_mean, h_median, h_variance]— the correct group and var set; x/y coords are excluded (identical by construction, so no parity impact). ✓- Parity is sound and NaN-aware:
np.array_equal(x, y, equal_nan=np.issubdtype(x.dtype, np.floating))correctly enablesequal_nanonly for floats (passingequal_nan=Trueon the int32countwould raise). Both carriers load the same catalog JSON andmax_cellstruncates deterministically to the first N shards, so both runs aggregate the identical cell set. ✓ signal_conf_ph != 0confidence filter + rectilinear grid come from the shippedatl03.yamltemplate (not hand-rolled here), as the spec intends. ✓- ruff
check+format --checkclean; both newTestHandoffPassthroughtests pass. ✓
1. (minor, correctness) handoff is silently dropped on the lambda backend. In runner.agg, _run_lambda(...) is called without handoff=, so agg(backend="lambda", handoff="arrow") runs pandas with no error. The docstring does say "Only honored by the local backend," and the benchmark only uses local, so this is acceptable for a benchmarking flag — but a silent no-op invites a future foot-gun. Options: (a) leave as-is (documented); (b) raise ValueError if handoff != "pandas" and backend == "lambda"; (c) add a one-line note in the PR body that lambda-handoff is out of scope. I'd lean (a)+(c).
2. (minor, tidiness) Region.bbox leaks np.float64 into config/Query/JSON. _box uses np.cos/np.radians, so lon_min/lon_max are numpy scalars; cfg.output["grid"]["bounds"] and the Query(region=...) bbox then carry np.float64 (confirmed at runtime). It works today, but to_json and any strict bounds validation are happier with plain floats. Cheap fix: cast in _box (float(lon - half_lon), …) or in the bbox property.
3. (minor, redundancy) double-copy in _region_config. default_config("atl03") already returns a freshly-parsed config (re-reads the YAML each call), then copy.deepcopy(cfg) copies it again. The deepcopy is harmless but redundant given each call is fresh; the copy import exists only for this. Not load-bearing — drop it or keep for defensiveness.
4. (test coverage) passthrough tests cover _process_and_write, not _run_local. TestHandoffPassthrough asserts _process_and_write forwards handoff and defaults to pandas — good, and the default test confirms the ATL06/pandas path is unchanged. But the _run_local → _process_and_write hop (the driver=driver, handoff=handoff kwargs in the executor submit) isn't directly exercised. Low risk since it's a literal passthrough, but a one-test extension monkeypatching _process_and_write to capture the kwarg from a _run_local call would close the gap. Optional.
5. (spec / RSS caveat — please confirm) peak RSS is process-cumulative, not per-run. resource.getrusage(RUSAGE_SELF).ru_maxrss is a high-water mark for the whole process, so it never decreases across (region × window × handoff) iterations — the arrow run in a pair, and every later region, inherits the max of everything before it. The KB→MB conversion is right (/1e3 Linux, /1e6 darwin), but the recorded peak_rss_mb is not the memory of that run. Issue #30 asks for per-(region×window) peak RSS for each carrier. Options: (a) acknowledge the caveat in a comment/PR-body and keep the simple metric; (b) fork each run into a subprocess and read its ru_maxrss (true per-run peak); (c) sample RSS via a thread during the run for a per-run delta. (a) is probably fine for a relative benchmark but the limitation should be stated so a reader doesn't over-trust the column.
6. (nit) --out appends (open(..., "a")) with no run delimiter/header separation, so reruns accumulate stacked tables. Fine if intentional; a timestamp separator would help. Optional.
Nothing here blocks advancing. Items 1, 2, and the #5 caveat note are the ones I'd fold into the next phase commit; 3, 4, 6 are taste.
Generated by Claude Code
this should actually be If this doesn't cleanly resolve, let me know in your comments, and just disable the
this was too restrictive; we only have 10 granules for the whole year over the AOI. We should do 1y (2025) and all years (2018 - Jan 2026 ; we can clip 2026 so it stays stable on future runs and doesn't add another pass while we're tracking benchmark drift).
what does ATL06 path unchanged refer to? We want ATL06 for this.
see first comment above -- we want to filter our TEP events only, across all surface types, and keep everything else
sure, fine for now |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial self-review of this run's diff (cae77fa..771870e) addressing @espg's review feedback.
Item 1 — TEP filter (_quality_mask + atl03.yaml): correct. The change drops the single-surface-type column: 0 keying and uses value: -2, op: ne reduced across all surface-type columns via match.any(axis=1). I verified the reduction semantics: (q != -2).any(axis=1) ("keep if any column is non-TEP") is provably equivalent by De Morgan to ~(q == -2).all(axis=1) ("drop only when every column is -2"). Since genuine TEP photons are flagged -2 in all 5 surface-type columns (per the ATL03 v007 data dictionary), this drops exactly the TEP photons and keeps everything else — matching @espg's "filter TEP events only, across all surface types, and keep everything else." A photon that is -2 in only some columns is kept, which is correct (it is not a TEP photon). The single-column selector path is preserved and still tested for backward compatibility (ATL06 eq path unchanged).
Item 2 — benchmark windows: correct and matches the spec exactly. WINDOWS = {"1y": ("2025-01-01","2026-01-01"), "all": ("2018-01-01","2026-01-01")}. The all upper bound is a static literal clipped at 2026-01-01, so future runs stay stable and don't drift as new granules land. The old --end-date/timedelta/days machinery is fully removed with no dangling references; run_one, parse_args, and main are consistent.
Folded this run: the review caught one stale docstring example (!= 0 → != -2) in _quality_mask's op-list — fixed in 771870e.
No blockers, no majors. Gates green: ruff check src tests benchmarks clean, pytest 296 passed / 1 skipped.
Inline notes below.
Generated by Claude Code
|
🤖 from Claude Addressed two of the three review items; the third needs your call (a clarifying question below). Pushed 1.
2. Benchmark windows — done. WINDOWS = {
"1y": ("2025-01-01", "2026-01-01"), # calendar 2025
"all": ("2018-01-01", "2026-01-01"), # full mission, upper bound clipped
}The 3. "ATL06 path unchanged" — clarifying question (blocking item 3). "ATL06 path unchanged" in my Phase 2 note meant only that threading the
Which do you want? I'll implement A or C as the next phase; B means item 3 is already satisfied. Gates (local): (The 1 skip is the spherely-fork backend; pre-existing. Setting Generated by Claude Code |
|
Keep ATL03 as the benchmarked product... was a typo, meant to write "We want ATL03 for this" |
|
🤖 from Claude Got it — keeping ATL03 as the benchmarked product for CI is green (test 3.12/3.13, ruff, build x86_64 + arm64); the benchmark itself is Generated by Claude Code |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01MznHvCXwruEszMEffRTMyw
|
🤖 from Claude Honest answer, after digging through the #43/#49 code on Phase A (structured Phase B (multi-level Phase C (
So you're right — without Phase B+C, an ATL03 region run on Lambda will keep hitting the 245 MB/beam coord-read floor, before anything I do at the aggregation handoff matters. Concrete fix is two pieces of work, neither tiny:
Scope question for you — three options, want your call:
My recommendation is (II). The Setting Generated by Claude Code |
|
Option 1 / I : Expand the PR and do (1) + (2) here as phases 5 and 6 before this PR merges. |
|
🤖 from Claude Acknowledged — going with (I), folding (1) and (2) into this PR as phases 5 and 6. Updated the PR title + body to reflect the expanded scope; the phases checklist now carries 5 and 6 as the remaining items, and the original "Refs #30" is joined by "Refs #43" since phase 5 is the missing #49 Phase C wiring. Starting on phase 5 ( Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 5 wires read_plan into _read_group via a new _planned_read_group dispatch, plus a half-open-vs-inclusive fix in read_plan.py. Confirmed correct on the load-bearing pieces:
h5corohalf-open contract is real.h5dataset.py:182raisesFatalError("must provide as list of ranges [x,y)"), and the boundary check ath5dataset.py:188uses> self.meta.dimensions[d](not>=), so[(0, n_base)]is in-bounds whenn_base == dim. The previous[(0, n_base - 1)]was wrong; the tests passed only becauseread_fncallbacks diddata[lo:hi+1]. Fix is correct and the test updates mirror the new semantics.readDatasetsaccepts bare strings.h5coro.py:121–127wraps astrelement as{"dataset": dataset, "hyperslice": []}, so the no-hyperslice branch in_read_fn(h5obj.readDatasets([path])[path]) is fine.global_idx/ cross-level alignment is sound.execute_read_planconcatenatesplan.chunk_listsinbase_slicesorder, solats[i]is base indexglobal_idx[i];cross_full[global_idx]is therefore lat-aligned, and[mask_spatial]reduces it to the same length as the per-base-filterkeep_mask. AND lines up.
What I flagged (all inline) — none block the phase, but two are real correctness/perf hazards on real-world inputs:
shard_footprint(...).boundsdegenerates to a globe-spanning bbox for HEALPix cells that genuinely cross the antimeridian (polar caps especially) — silently neutralizes the optimization on those shards. Worth a guard or TODO.n_base = ibeg_arr[-1] - index_base + cnt_arr[-1]is a stricter assumption than the contiguity contract documented in_expand_mask_to_base— it also requires the last parent's tail to be the array length. Worth pinning down what real ATL03 actually guarantees.- Test coverage gaps: the full-read fallback isn't exercised with filters wired through, and
_planned_read_group's multi-slice concat +global_idxpath isn't exercised end-to-end (plan_readcovers it offline, but the integration path isn't). _LatBboxGridparity test assertsdf["h"]only; addingdf["leaf_id"]would catch row-ordering drift between paths cheaply.- CLAUDE.md §4 module-size:
processing.pyis past 1400 lines, and_planned_read_groupis largely a structural twin of_read_group_full. Worth a follow-up to extract a shared mask/dict builder.
No new dependencies, no broad noqa / type: ignore, no disabled tests, no infra/CI touched.
Generated by Claude Code
|
🤖 from Claude Phases 5 and 6 pushed (commits Phase 5 —
|
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Inline findings for phase 5 — non-blocking; see the prior top-level review for the summary.
Generated by Claude Code
| poly = grid.shard_footprint(shard_key) | ||
| min_lon, min_lat, max_lon, max_lat = poly.bounds | ||
| bbox = (float(min_lon), float(min_lat), float(max_lon), float(max_lat)) |
There was a problem hiding this comment.
🤖 from Claude (review)
Antimeridian / polar HEALPix cells silently neutralize the optimization.
mort2polygon (mortie/tools.py:664) normalizes polygons that touch the antimeridian to one side, so the common case is fine. But cells that genuinely cross the antimeridian — polar caps, and any base-cell whose ring straddles ±180° — still get vertices on both sides after the lons = np.where(lons > 180, lons - 360, lons) wrap at tools.py:721. For those, poly.bounds reports a west ≈ -180, east ≈ +180 box, which plan_read will then accept ~every coarse parent against, triggering the selectivity fallback and reading the full base array.
That's a degradation, not a correctness bug — the planned path still produces correct output via full_read → _read_group_full. But it silently undoes the OOM mitigation that motivated #49 for exactly the shards that are hardest to process (polar). Worth either:
- guarding here (if
max_lon - min_lon > 180, skip the optimization and call_read_group_fulldirectly), or - emitting two plans for the wrapped halves and concatenating, or
- at minimum a TODO so it doesn't get lost.
Cross-ref src/zagg/grids/healpix.py:171–179 (which returns a single Polygon without antimeridian splitting) and the discussion in mortie/tools.py:505–593 (mort2bbox).
Generated by Claude Code
| # Under the contiguity assumption (#43): ``sum(count) == n_base``, so the | ||
| # last parent's tail gives the total length without an extra header read. | ||
| n_base = int(ibeg_arr[-1]) - index_base + int(cnt_arr[-1]) |
There was a problem hiding this comment.
🤖 from Claude (review)
n_base derivation needs a tighter statement of the assumption.
_expand_mask_to_base's docstring (processing.py:756) says "ranges do not overlap and together tile the full base array" — i.e. sum(count) == n_base and the union covers [0, n_base). That's what justifies the expansion at base rate, not what's computed here.
n_base = ibeg_arr[-1] - index_base + cnt_arr[-1] requires the additional assumption that the last parent's tail equals the base-array length — i.e. no trailing dropped photons, no implicit padding at the tail of the photon array. The contiguity contract as written allows tiles to cover [0, sum(count)) without committing to sum(count) == photon_array_len.
If that gap can occur on a real ATL03 granule (segment-cropped products, dropped-tail trailers), n_base underestimates len(lat_ph) and _expand_mask_to_base(..., n_base) produces a shorter mask than the actual base reads. That'd surface as a shape mismatch when cross_full[global_idx] indexes into the cross-mask later (line 1018).
Two ways to harden this:
- Add an explicit
len(lat_ph)read forn_base(one small extra header read, vs. the OOM-risk path it's avoiding). - Or commit in the docstring + a code assert: "ATL03 contract:
ph_index_beg[-1] + segment_ph_cnt[-1] == len(lat_ph)" and document where that's verified.
Either way, please pin down which it is on the PR thread before this lands on real granules.
Generated by Claude Code
| if ( | ||
| isinstance(data_source.get("read_plan"), dict) | ||
| and data_source["read_plan"].get("spatial_index") | ||
| and data_source.get("levels") | ||
| and data_source.get("base_level") | ||
| ): | ||
| return _planned_read_group(h5obj, group, data_source, shard_key, grid, arrow=arrow) |
There was a problem hiding this comment.
🤖 from Claude (review)
Nit on the dispatch predicate: data_source.get("levels") and data_source.get("base_level") are truthy-checked, so an explicit levels = {} or base_level = "" configuration silently routes back to _read_group_full even though the user wired up a read_plan. That's almost certainly not what they meant. Cheap fix:
levels = data_source.get("levels")
base_level = data_source.get("base_level")
rp = data_source.get("read_plan")
if (
isinstance(rp, dict)
and rp.get("spatial_index")
and isinstance(levels, dict)
and base_level
):
...…or raise a ValueError when read_plan.spatial_index is set but levels/base_level aren't — saves a future user a "why isn't my planned read planning?" debug session.
Generated by Claude Code
| def _planned_read_group( | ||
| h5obj, group: str, data_source: dict, shard_key: int, grid, arrow: bool = False | ||
| ): | ||
| """Planned (AOI-bounded) read of one HDF5 group via the coarse spatial index. | ||
|
|
||
| Issue #43 Phase C: when ``data_source.read_plan.spatial_index`` names a coarse | ||
| level whose ``link`` points at the base level, we read the coarse coordinates | ||
| + link arrays once (small), call :func:`zagg.read_plan.plan_read` to compute | ||
| which base-rate slices the AOI bbox actually touches, and read base-rate | ||
| coords + variables + filter datasets only over those slices via | ||
| :func:`zagg.read_plan.execute_read_plan`. This avoids the | ||
| ``lat_ph`` + ``lon_ph`` full-coord read (up to ~245 MB per ATL03 beam) that | ||
| drives Lambda OOMs (issue #43 motivation). | ||
|
|
||
| Falls back transparently to :func:`_read_group` when: | ||
| - the empty-AOI short-circuit fires (no parents match) → return ``None``; | ||
| - ``plan_read`` flags ``full_read=True`` (selectivity above threshold); | ||
| - the cell ``signal_conf_ph``-style 2-D structured filter would be re-read | ||
| via the planned slices either way (the helper handles that uniformly). | ||
|
|
||
| Returns the same ``pandas.DataFrame`` / ``pyarrow.Table`` / ``None`` contract | ||
| as :func:`_read_group`. Output rows are in plan-slice / spatial-mask / | ||
| filter order — which matches the full-read path's row ordering because the | ||
| plan's runs are emitted in increasing parent index. | ||
| """ | ||
| coordinates = data_source["coordinates"] | ||
| variables = data_source["variables"] | ||
| levels = data_source["levels"] | ||
| base_level_key = data_source["base_level"] | ||
| rp = data_source["read_plan"] | ||
| spatial_index_level = rp["spatial_index"] | ||
| pad = int(rp.get("pad", 1)) | ||
| full_read_threshold = float(rp.get("full_read_threshold", 0.9)) | ||
|
|
||
| si_lvl = levels[spatial_index_level] | ||
| link = si_lvl.get("link") | ||
| if not isinstance(link, dict): | ||
| raise ValueError( | ||
| f"read_plan.spatial_index level {spatial_index_level!r} requires a 'link'" | ||
| ) | ||
| if link["to"] != base_level_key: | ||
| raise ValueError( | ||
| f"read_plan.spatial_index level {spatial_index_level!r} must link " | ||
| f"directly to base level {base_level_key!r} (got link.to={link['to']!r})" | ||
| ) | ||
| index_base = int(link.get("index_base", 0)) | ||
|
|
||
| # Read coarse-level coordinates + link arrays in one go (small — geolocation | ||
| # rate is ~30x lighter than photon rate on ATL03). | ||
| si_lat_path, si_lon_path = _level_coord_paths(si_lvl, group) | ||
| ibeg_path = link["index_beg"].format(group=group) | ||
| cnt_path = link["count"].format(group=group) | ||
| coarse_data = h5obj.readDatasets([si_lat_path, si_lon_path, ibeg_path, cnt_path]) | ||
| coarse_lats = coarse_data[si_lat_path] | ||
| coarse_lons = coarse_data[si_lon_path] | ||
| ibeg_arr = coarse_data[ibeg_path] | ||
| cnt_arr = coarse_data[cnt_path] | ||
|
|
||
| if len(coarse_lats) == 0: | ||
| return None | ||
|
|
||
| # Under the contiguity assumption (#43): ``sum(count) == n_base``, so the | ||
| # last parent's tail gives the total length without an extra header read. | ||
| n_base = int(ibeg_arr[-1]) - index_base + int(cnt_arr[-1]) | ||
| if n_base <= 0: | ||
| return None | ||
|
|
||
| # Compute the shard's WGS84 bbox from the grid (every grid's shard_footprint | ||
| # returns a shapely Polygon). | ||
| poly = grid.shard_footprint(shard_key) | ||
| min_lon, min_lat, max_lon, max_lat = poly.bounds | ||
| bbox = (float(min_lon), float(min_lat), float(max_lon), float(max_lat)) | ||
|
|
||
| plan = plan_read( | ||
| np.asarray(coarse_lats), | ||
| np.asarray(coarse_lons), | ||
| np.asarray(ibeg_arr), | ||
| np.asarray(cnt_arr), | ||
| n_base, | ||
| bbox, | ||
| index_base=index_base, | ||
| pad=pad, | ||
| full_read_threshold=full_read_threshold, | ||
| ) | ||
|
|
||
| if not plan.parent_runs: | ||
| return None # empty AOI -- no parent intersects, skip the group entirely | ||
|
|
||
| if plan.full_read: | ||
| # Selectivity above threshold: many small reads would still sum to most | ||
| # of the file. Defer to the full-coord-read path; semantics identical. | ||
| return _read_group_full(h5obj, group, data_source, shard_key, grid, arrow=arrow) | ||
|
|
||
| # h5coro-compatible reader callback for execute_read_plan. | ||
| def _read_fn(path, hyperslice=None): | ||
| if hyperslice is None: | ||
| return h5obj.readDatasets([path])[path] | ||
| return h5obj.readDatasets([{"dataset": path, "hyperslice": hyperslice}])[path] | ||
|
|
||
| # ---- Read base coords + variables + filter datasets over the planned slices. | ||
| filters = filters_from_data_source(data_source) | ||
| base_structured = [ | ||
| f | ||
| for f in filters | ||
| if "expression" not in f and (f.get("level") is None or f.get("level") == base_level_key) | ||
| ] | ||
| coarse_structured = [ | ||
| f | ||
| for f in filters | ||
| if "expression" not in f | ||
| and f.get("level") is not None | ||
| and f.get("level") != base_level_key | ||
| ] | ||
| expressions = [f for f in filters if "expression" in f] | ||
|
|
||
| lat_path = coordinates["latitude"].format(group=group) | ||
| lon_path = coordinates["longitude"].format(group=group) | ||
| lats = execute_read_plan(plan, _read_fn, lat_path, np.float64) | ||
| lons = execute_read_plan(plan, _read_fn, lon_path, np.float64) | ||
|
|
||
| if len(lats) == 0: | ||
| return None | ||
|
|
||
| # Apply spatial / shard mask over the concatenated planned reads. | ||
| leaf_ids = grid.assign(lats, lons) | ||
| mask_spatial = grid.shards_of(leaf_ids) == shard_key | ||
| if np.sum(mask_spatial) == 0: | ||
| return None | ||
|
|
||
| # Read the variables and base-level filter datasets via the same plan. Read | ||
| # each distinct path once (the variable and filter dataset paths can coincide). | ||
| var_paths = {col: tmpl.format(group=group) for col, tmpl in variables.items()} | ||
| filter_paths = {id(f): f["dataset"].format(group=group) for f in base_structured} | ||
| paths_seen: set[str] = set() | ||
| arrays_by_path: dict[str, np.ndarray] = {} | ||
| for path in list(var_paths.values()) + list(filter_paths.values()): | ||
| if path in paths_seen: | ||
| continue | ||
| paths_seen.add(path) | ||
| # dtype hint isn't load-bearing -- execute_read_plan dtype-casts via | ||
| # np.asarray, which is a no-op when the source dtype already matches. | ||
| arrays_by_path[path] = execute_read_plan(plan, _read_fn, path, None) | ||
|
|
||
| # Base-level structured filters: ANDed keep-masks over the concatenated reads. | ||
| keep_mask: np.ndarray | None = None | ||
| for f in base_structured: | ||
| flag = arrays_by_path[filter_paths[id(f)]][mask_spatial] | ||
| fmask = _predicate_mask(flag, f) | ||
| keep_mask = fmask if keep_mask is None else (keep_mask & fmask) | ||
|
|
||
| # Cross-level (Phase B) filters: read coarse flags fully, expand to base | ||
| # rate (length n_base), then subset to the planned indices. | ||
| if coarse_structured: | ||
| # Build the global base-index array once: which original-base positions | ||
| # are present in the concatenated planned read. | ||
| global_idx = np.concatenate( | ||
| [np.arange(s, e, dtype=np.int64) for s, e in plan.base_slices] | ||
| ) | ||
| cross_full: np.ndarray | None = None | ||
| for f in coarse_structured: | ||
| level_key = f["level"] | ||
| cf_lvl = levels[level_key] | ||
| cf_link = cf_lvl["link"] | ||
| cf_index_base = int(cf_link.get("index_base", 0)) | ||
| cf_flag_path = f["dataset"].format(group=group) | ||
| cf_ibeg_path = cf_link["index_beg"].format(group=group) | ||
| cf_cnt_path = cf_link["count"].format(group=group) | ||
| cf_data = h5obj.readDatasets([cf_flag_path, cf_ibeg_path, cf_cnt_path]) | ||
| cf_flag = cf_data[cf_flag_path] | ||
| cf_ibeg = cf_data[cf_ibeg_path] | ||
| cf_cnt = cf_data[cf_cnt_path] | ||
| coarse_fmask = _predicate_mask(cf_flag, f) | ||
| expanded = _expand_mask_to_base(coarse_fmask, cf_ibeg, cf_cnt, cf_index_base, n_base) | ||
| cross_full = expanded if cross_full is None else (cross_full & expanded) | ||
| # Subset the full-length mask to the concatenated planned indices, then | ||
| # to the spatial keep window so it lines up with keep_mask above. | ||
| cross_planned = cross_full[global_idx][mask_spatial] | ||
| keep_mask = cross_planned if keep_mask is None else (keep_mask & cross_planned) | ||
|
|
||
| if keep_mask is not None and np.sum(keep_mask) == 0: | ||
| return None | ||
|
|
||
| # Build the data dict (variables sliced to mask_spatial, then to keep_mask). | ||
| leaf_after_spatial = leaf_ids[mask_spatial] | ||
| data_dict: dict[str, np.ndarray] = {} | ||
| for col_name, path in var_paths.items(): | ||
| values = arrays_by_path[path][mask_spatial] | ||
| if keep_mask is not None: | ||
| values = values[keep_mask] | ||
| data_dict[col_name] = values | ||
| data_dict["leaf_id"] = leaf_after_spatial[keep_mask] if keep_mask is not None else leaf_after_spatial | ||
|
|
||
| # Base-level expression filters (aggregation-time escape hatch, no pushdown). | ||
| for f in expressions: | ||
| cols = {c: data_dict[c] for c in variables if c in data_dict} | ||
| try: | ||
| emask = evaluate_filter_expression(f["expression"], cols) | ||
| except NameError as e: | ||
| raise NameError( | ||
| f"expression filter {f['expression']!r} references an undefined name: {e}" | ||
| ) from e | ||
| if emask.shape != data_dict["leaf_id"].shape: | ||
| raise ValueError( | ||
| f"expression filter {f['expression']!r} must yield a per-row " | ||
| f"boolean mask (got shape {emask.shape})" | ||
| ) | ||
| if np.sum(emask) == 0: | ||
| return None | ||
| data_dict = {k: v[emask] for k, v in data_dict.items()} | ||
|
|
||
| if arrow: | ||
| import pyarrow as pa | ||
|
|
||
| return pa.table(data_dict) | ||
| return pd.DataFrame(data_dict) |
There was a problem hiding this comment.
🤖 from Claude (review)
Module size / duplication (CLAUDE.md §4).
processing.py is now ~1448 lines (was 1293 before this PR); CLAUDE.md §4 says exceeding ~1000 lines requires prior discussion before continuing. This phase moves it further past the line, and _planned_read_group (lines 842–1056) is largely a structural twin of _read_group_full below it: same filter partition, same dedup-by-id pattern for var_paths/filter_paths, same base-vs-cross AND chain, same expression-filter loop, same arrow/pandas tail.
Not blocking phase 5 — but worth a follow-up that extracts a shared helper for the post-read shape:
def _apply_filters_and_build_dict(
lats, lons, leaf_ids, mask_spatial,
arrays_by_path, var_paths, filter_paths,
base_structured, cross_mask_planned, expressions, variables, ...
) -> dict | Noneand then both paths reduce to "read the right slices, then call the helper." Worth raising on the PR thread (or as a follow-up issue) for an @espg call before the next phase compounds it further.
Generated by Claude Code
| paths_seen: set[str] = set() | ||
| arrays_by_path: dict[str, np.ndarray] = {} | ||
| for path in list(var_paths.values()) + list(filter_paths.values()): | ||
| if path in paths_seen: | ||
| continue | ||
| paths_seen.add(path) | ||
| # dtype hint isn't load-bearing -- execute_read_plan dtype-casts via | ||
| # np.asarray, which is a no-op when the source dtype already matches. | ||
| arrays_by_path[path] = execute_read_plan(plan, _read_fn, path, None) |
There was a problem hiding this comment.
🤖 from Claude (review)
Small dtype subtlety worth a comment in code (not a bug): you pass dtype=None to execute_read_plan, which then does np.asarray(read_fn(...), dtype=None). That's a no-op cast (preserves source dtype), which is what you want for variables of mixed dtypes (uint8 flags vs float32 heights vs structured signal_conf rows). Worth one inline # dtype=None preserves source dtype — varies across var/filter datasets so a future reader doesn't "fix" it by hardcoding np.float64.
Also: arrays_by_path is keyed on the resolved HDF5 path string. The filter_paths = {id(f): ...} dict is keyed on filter identity so two filters can target the same dataset path without colliding the cache — same pattern as _read_group_full:1195. Looks correct, but worth making the intent explicit in a comment.
Generated by Claude Code
| {"dataset": "/seg/podppd", "op": "eq", "value": 0, "level": "segments"} | ||
| ) | ||
| if filters: | ||
| ds["filters"] = filters | ||
| return ds | ||
|
|
||
|
|
||
| def _planned_read_h5(*, qs=None, podppd=None): |
There was a problem hiding this comment.
🤖 from Claude (review)
Coverage gap: full_read=True fallback isn't exercised with filters wired through.
test_full_read_fallback_on_high_selectivity runs the fallback through _read_group_full with no filters at all. The interesting failure mode for that path on real ATL03 is the combination "planned read trips selectivity threshold → fallback to _read_group_full → cross-level filter expansion runs at coarse rate." That combination's hand-off (specifically: that _read_group_full re-reads link arrays it doesn't need from the data_source's read_plan config, vs. reusing the coarse arrays already in memory in _planned_read_group) goes uncovered.
Two cheap additions:
- Add a copy of this test with
with_coarse_filter=Trueand confirm output still matches the no-fallback planned-path output. - Add a copy of
test_parity_with_full_readthat forces the planned side into fallback (lowfull_read_threshold) so the entry point differs but the result is byte-identical.
Both stay offline (no h5coro, no S3) and use the existing fixtures.
Generated by Claude Code
| "link": { | ||
| "to": "photons", | ||
| "index_beg": "/seg/ph_index_beg", | ||
| "count": "/seg/segment_ph_cnt", | ||
| "index_base": 0, | ||
| }, | ||
| }, | ||
| }, | ||
| "read_plan": {"spatial_index": "segments", "pad": 0}, | ||
| } | ||
| filters = [] | ||
| if with_base_filter: | ||
| filters.append({"dataset": "/heights/qs", "op": "eq", "value": 0}) | ||
| if with_coarse_filter: | ||
| filters.append( | ||
| {"dataset": "/seg/podppd", "op": "eq", "value": 0, "level": "segments"} | ||
| ) | ||
| if filters: | ||
| ds["filters"] = filters | ||
| return ds | ||
|
|
||
|
|
||
| def _planned_read_h5(*, qs=None, podppd=None): | ||
| """Six-segment / 12-photon HDF5 stub with optional base/coarse flag arrays. | ||
|
|
||
| Segments live at lats 0,100,200,300,400,500 (lon 0); photons at lats | ||
| 0,50,100,150,200,250,...,550 (lon 0). The wide segment spacing keeps the | ||
| ``plan_read`` linestring-crossing check from sweeping unrelated segments | ||
| into the matched range -- a narrow bbox between two rep-points stays | ||
| bounded by the immediate neighbours. | ||
| """ | ||
| seg_lats = np.array([0.0, 100.0, 200.0, 300.0, 400.0, 500.0]) | ||
| seg_lons = np.zeros(6) | ||
| ibeg = np.arange(0, 12, 2, dtype=np.int64) | ||
| cnt = np.full(6, 2, dtype=np.int64) | ||
| ph_lats = np.array( | ||
| [0.0, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0] | ||
| ) | ||
| ph_lons = np.zeros(12) | ||
| h = np.arange(12.0, dtype=np.float32) * 10.0 | ||
| arrays = { | ||
| "/seg/lat": seg_lats, | ||
| "/seg/lon": seg_lons, | ||
| "/seg/ph_index_beg": ibeg, | ||
| "/seg/segment_ph_cnt": cnt, | ||
| "/heights/lat_ph": ph_lats, | ||
| "/heights/lon_ph": ph_lons, | ||
| "/heights/h": h, | ||
| } | ||
| if qs is not None: | ||
| arrays["/heights/qs"] = np.asarray(qs) | ||
| if podppd is not None: | ||
| arrays["/seg/podppd"] = np.asarray(podppd) | ||
| return _FakeH5(arrays) | ||
|
|
||
|
|
||
| class TestPlannedReadGroup: | ||
| """Phase C: ``_read_group`` dispatches to ``_planned_read_group`` when | ||
| ``data_source.read_plan.spatial_index`` is set, bounding the base-rate IO | ||
| via the coarse-level rep-point coords + link arrays. | ||
|
|
||
| The shared fixture lays out 6 segments at lats ``[0, 100, 200, 300, 400, | ||
| 500]`` covering 12 photons (2 each). The wide spacing keeps ``plan_read``'s | ||
| linestring-crossing sweep bounded: a bbox between two adjacent rep-points | ||
| pulls in exactly its two neighbours.""" | ||
|
|
||
| def test_planned_path_bounds_io_to_matched_segments(self): | ||
| # Bbox (-0.1, 175, 0.1, 225) directly contains segment 2 (lat=200); | ||
| # segment 1's (lat 100 -> 200) linestring crosses the lower edge so | ||
| # plan_read sweeps segment 1 in too. Two adjacent segments -> one | ||
| # contiguous run -> photons 2..5 in the base array. | ||
| ds = _planned_read_data_source() | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0)) | ||
| df = _read_group(h5, "gt1l", ds, 0, grid) | ||
| assert df["h"].tolist() == [20.0, 30.0, 40.0, 50.0] | ||
|
|
||
| def test_empty_aoi_returns_none(self): | ||
| # Bbox far from any segment rep-point or linestring -> no parents | ||
| # match -> short-circuit return None before any base-rate read. | ||
| ds = _planned_read_data_source() | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((10000.0, 10000.0, 10001.0, 10001.0)) | ||
| assert _read_group(h5, "gt1l", ds, 0, grid) is None | ||
|
|
||
| def test_full_read_fallback_on_high_selectivity(self): | ||
| # full_read_threshold lowered so any plan covering >=10% of n_base | ||
| # (>=2/12 photons) triggers the fallback. Same bbox as the basic test | ||
| # selects 4/12 = 33% -> falls through to _read_group_full and reads | ||
| # everything; the permissive grid keeps all 12. | ||
| ds = _planned_read_data_source() | ||
| ds["read_plan"]["full_read_threshold"] = 0.1 | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0)) | ||
| df = _read_group(h5, "gt1l", ds, 0, grid) | ||
| assert df["h"].tolist() == [float(i * 10) for i in range(12)] | ||
|
|
||
| def test_parity_with_full_read(self): | ||
| # Both paths produce the same row set when the spatial mask is keyed | ||
| # on lat: the planned read narrows IO to photons 2..5 (via plan_read); | ||
| # _LatBboxGrid.shards_of further restricts to photons with lat in | ||
| # bbox range (photon 4, lat=200). qs drops nothing in-shard. | ||
| qs = np.array([0] * 12, dtype=np.int8) | ||
| h5 = _planned_read_h5(qs=qs) | ||
| grid = _LatBboxGrid((-0.1, 175.0, 0.1, 225.0)) | ||
|
|
||
| ds_planned = _planned_read_data_source(with_base_filter=True) | ||
| ds_full = { | ||
| "coordinates": { | ||
| "latitude": "/heights/lat_ph", | ||
| "longitude": "/heights/lon_ph", | ||
| }, | ||
| "variables": {"h": "/heights/h"}, | ||
| "filters": [{"dataset": "/heights/qs", "op": "eq", "value": 0}], | ||
| } | ||
|
|
||
| df_planned = _read_group(h5, "gt1l", ds_planned, 0, grid) | ||
| df_full = _read_group(h5, "gt1l", ds_full, 0, grid) | ||
| # Photon 4 (lat=200, h=40) is the only one in the bbox lat range. | ||
| assert df_planned["h"].tolist() == [40.0] | ||
| assert df_full["h"].tolist() == [40.0] | ||
|
|
||
| def test_coarse_filter_via_planned_path(self): | ||
| # Cross-level (Phase B) filter ANDs with the planned path: drop | ||
| # segment 1 via podppd; segment 2 (also pulled in by the linestring | ||
| # sweep) survives. Photons 2,3 dropped; 4,5 kept. | ||
| ds = _planned_read_data_source(with_coarse_filter=True) | ||
| podppd = np.array([0, 1, 0, 0, 0, 0], dtype=np.int8) | ||
| h5 = _planned_read_h5(podppd=podppd) | ||
| grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0)) | ||
| df = _read_group(h5, "gt1l", ds, 0, grid) | ||
| assert df["h"].tolist() == [40.0, 50.0] | ||
|
|
||
| def test_pad_extends_selection(self): | ||
| # bbox (490..510) covers segment 5 (last, lat=500) directly; segment | ||
| # 4's linestring (400 -> 500) crosses the lower edge. With pad=0: | ||
| # segments 4,5 -> photons 8..11. With pad=1: segments 3,4,5,6(clamped | ||
| # back to 5) -> photons 6..11. | ||
| ds = _planned_read_data_source() | ||
| ds["read_plan"]["pad"] = 1 | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((-0.1, 490.0, 0.1, 510.0)) | ||
| df = _read_group(h5, "gt1l", ds, 0, grid) | ||
| assert df["h"].tolist() == [60.0, 70.0, 80.0, 90.0, 100.0, 110.0] | ||
|
|
||
| def test_invalid_link_target_raises(self): | ||
| # The spatial_index level's link must point at the base level. | ||
| ds = _planned_read_data_source() | ||
| ds["levels"]["segments"]["link"]["to"] = "not_a_level" | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0)) |
There was a problem hiding this comment.
🤖 from Claude (review)
Coverage gap: multi-slice plan.base_slices not exercised end-to-end.
Every test in TestPlannedReadGroup produces a single contiguous run (the wide segment spacing in _planned_read_h5 was deliberately chosen for that). That's good for keeping the assertions tractable, but it means the load-bearing concat-and-realign code paths in _planned_read_group:
lats = execute_read_plan(...)over multi-chunkchunk_lists,global_idx = np.concatenate([np.arange(s, e) for s, e in plan.base_slices])withlen(base_slices) > 1,cross_full[global_idx][mask_spatial]realignment when multiple ranges are present,
are not actually run by these tests. plan_read unit tests cover the multi-run case offline, but the integration through _planned_read_group (where mask_spatial/cross_planned lengths and data_dict["leaf_id"] ordering can diverge between paths) doesn't get exercised.
A targeted test: bbox composed of two disjoint mini-boxes by enabling pad=0 and crafting a fixture with two AOI-matched segments separated by an unmatched one. Then assert df["h"] ordering matches the full-read path on the same fixture.
Generated by Claude Code
| The shared fixture lays out 6 segments at lats ``[0, 100, 200, 300, 400, | ||
| 500]`` covering 12 photons (2 each). The wide spacing keeps ``plan_read``'s | ||
| linestring-crossing sweep bounded: a bbox between two adjacent rep-points | ||
| pulls in exactly its two neighbours.""" | ||
|
|
||
| def test_planned_path_bounds_io_to_matched_segments(self): | ||
| # Bbox (-0.1, 175, 0.1, 225) directly contains segment 2 (lat=200); | ||
| # segment 1's (lat 100 -> 200) linestring crosses the lower edge so | ||
| # plan_read sweeps segment 1 in too. Two adjacent segments -> one | ||
| # contiguous run -> photons 2..5 in the base array. | ||
| ds = _planned_read_data_source() | ||
| h5 = _planned_read_h5() | ||
| grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0)) | ||
| df = _read_group(h5, "gt1l", ds, 0, grid) | ||
| assert df["h"].tolist() == [20.0, 30.0, 40.0, 50.0] | ||
|
|
||
| def test_empty_aoi_returns_none(self): | ||
| # Bbox far from any segment rep-point or linestring -> no parents | ||
| # match -> short-circuit return None before any base-rate read. |
There was a problem hiding this comment.
🤖 from Claude (review)
test_parity_with_full_read asserts only df_planned["h"].tolist() == df_full["h"].tolist() == [40.0]. Since leaf_id is what carries row-order information through both paths (and is the column whose ordering would expose a re-slice bug between plan-slice order vs. base-array order), please add:
assert df_planned["leaf_id"].tolist() == df_full["leaf_id"].tolist()Free strengthening of the parity claim, given _LatBboxGrid already encodes lat into leaf_id. Without it, a future regression that reorders rows between paths would still pass on a single-row result.
Also: _LatBboxGrid is deliberately a hack (lat-in-leaf-id) — a one-line comment in the test file pinning down "this is only valid because both paths feed identical lats to assign(); real grids return integer cell ids" would save the next reader 10 minutes.
Generated by Claude Code
| base_slices.append((base_start, base_end)) | ||
| chunk_lists.append([(base_start, base_end - 1)]) # h5coro inclusive end | ||
| chunk_lists.append([(base_start, base_end)]) # h5coro half-open [start, end) | ||
| total_base += base_end - base_start | ||
|
|
||
| # -- Selectivity fallback -- | ||
| if total_base > full_read_threshold * n_base: | ||
| return ReadPlan( | ||
| parent_runs=[(0, n_coarse - 1)], | ||
| base_slices=[(0, n_base)], | ||
| chunk_lists=[[(0, n_base - 1)]], | ||
| chunk_lists=[[(0, n_base)]], |
There was a problem hiding this comment.
🤖 from Claude (review)
Confirmed fix: h5coro is half-open. h5coro/h5dataset.py:182 raises FatalError("must provide as list of ranges [x,y)") and the boundary check at h5dataset.py:188 uses self.hyperslice[d][1] > self.meta.dimensions[d] (strict >), so (0, n_base) is in-bounds when n_base == dim. The selectivity-fallback's [(0, n_base)] is correct.
The previous (start, end-1) tests only passed because the test callbacks compensated with data[lo:hi+1]; updating them to data[lo:hi] aligns them with how _planned_read_group._read_fn actually drives h5coro. No subtle landmine here, but worth a one-line cross-reference in the docstring so this convention survives the next refactor:
chunk_lists : list of list of (int, int)
h5coro-style ``[(start, end)]`` hyperslice lists (half-open
``[start, end)``, matches h5coro's ``h5dataset.py:182``
"must provide as list of ranges [x,y)" contract).You've got most of that — just adding the explicit h5dataset.py:182 anchor makes the convention searchable.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 6 is overwhelmingly YAML and reads cleanly against the phase 5 wiring + the issue #43 body's schema. The HDF5 paths (/{group}/geolocation/ for reference_photon_lat/reference_photon_lon/ph_index_beg/segment_ph_cnt) match the schema example in #43, and index_base: 1 matches #43's "both 1-based" claim — and is arithmetically consistent end-to-end (e.g. ph_index_beg=[1,3,5], segment_ph_cnt=[2,2,2] → n_base = 5 - 1 + 2 = 6; per-parent base_start = ibeg[s] - 1 gives 0/2/4, which match 0-based starts). The signal_conf_ph filter dataset and data_source.variables map are unchanged, so the read-path dedup in _planned_read_group (paths_seen) cannot double-read.
Five diff-scoped findings below; the biggest two are (1) no integration test exercises the shipped templates through _planned_read_group — TestPlannedReadGroup uses synthetic data with index_base: 0, so the index_base: 1 arithmetic plus the exact ATL03 path templates and pad: 1 boundary behavior have zero offline coverage — and (2) signal_conf_ph in levels.photons.variables is declarative-only and will mislead the next reader, because the read path never consults level.variables.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 6's YAML reads cleanly against the phase 5 wiring + #43 spec. HDF5 paths (/{group}/geolocation/ for reference_photon_lat/reference_photon_lon/ph_index_beg/segment_ph_cnt) match #43's schema example; index_base: 1 matches the "both 1-based" claim and is arithmetically consistent (ph_index_beg=[1,3,5] + cnt=[2,2,2] → n_base = 5-1+2 = 6, base_start = ibeg[s]-1 ∈ {0,2,4}). The signal_conf_ph filter and data_source.variables are unchanged so dedup in _planned_read_group (paths_seen) holds.
Five findings, the two biggest first:
- No integration test exercises the shipped templates through
_planned_read_group—TestPlannedReadGroupuses synthetic data withindex_base: 0and bespoke/seg/paths, so this PR has zero offline coverage ofindex_base=1arithmetic + the exact ATL03 path templates resolving end-to-end. Mirrors the phase 4 gap that was folded astest_atl03_shipped_template_2d_signal_conf. signal_conf_phinlevels.photons.variablesis declarative-only — read path usesdata_source.variables, neverlevel.variables. Misleads the next reader.- Asymmetric path conventions: coordinates are bare names (joined to
path), link fields are absolute templates. Both work today but they will trip the next person who copies #43's schema example literally (it uses bare names for both). - Cross-template link-equality assertion is too tight — whole-dict comparison couples to fields that don't affect plan parity.
pad: 1choice is undocumented and not exercised — both shipped templates usepad: 1butTestPlannedReadGroupusespad: 0. No test covers the chosen default.
None are correctness bugs against the synthetic test surface. (1) is the one I'd hold over before merge — without it a future YAML edit that drops index_base (defaulting to 0) lands green.
Generated by Claude Code
| def test_read_plan_targets_segments_level(self, atl03_config): | ||
| rp = atl03_config.data_source["read_plan"] | ||
| assert rp["spatial_index"] == "segments" | ||
| assert rp["pad"] == 1 |
There was a problem hiding this comment.
🤖 from Claude (review)
Coverage gap: phase 6 has no offline integration test that runs the shipped templates through _planned_read_group.
These three new tests assert the YAML parses into the expected shape — base_level == "photons", two level keys, pad == 1, index_base == 1. They do not verify the load-bearing combination at runtime:
- the
index_base: 1arithmetic against the actual paths the YAML declares (_planned_read_groupdoesn_base = ibeg[-1] - 1 + cnt[-1]andbase_start = ibeg[s] - 1— both go throughint(...)casts that mask sign errors silently if the YAML ever shipsindex_base: 0by mistake), - the absolute path templates in the link arrays correctly resolving via
link["index_beg"].format(group=group)against a_FakeH5with a realistic ATL03 group layout (/gt1l/heights/...+/gt1l/geolocation/...), - the segments-level
coordinates: {latitude: reference_photon_lat, longitude: reference_photon_lon}joining againstpath: "/{group}/geolocation"to produce/gt1l/geolocation/reference_photon_lat(via_level_coord_pathsatprocessing.py:837).
TestPlannedReadGroup exercises _planned_read_group end-to-end, but its _planned_read_data_source() fixture uses index_base: 0 and bespoke /seg/ paths (tests/test_processing.py:1744) — so the index_base: 1 code path through the shipped template has zero test coverage in this PR.
This is the same gap pattern the phase 4 review caught and folded as TestReadGroupFilters::test_atl03_shipped_template_2d_signal_conf: load default_config("atl03").data_source, rewrite the dataset paths onto a _FakeH5 with a realistic ATL03-shape group layout (6 segments tiling 12 photons with 1-based ph_index_beg = [1, 3, 5, 7, 9, 11]), and run it through _read_group (which now dispatches to _planned_read_group). Confirm the planned path produces the same row set as the full-read path on the same fixture, and assert at least one pad: 1 boundary case (AOI sitting between two segments' rep-points).
Without that test, a future YAML edit that drops index_base (defaulting to 0) or that swaps the link to a different array name would land green.
Generated by Claude Code
| photons: | ||
| path: "/{group}/heights" | ||
| coordinates: {latitude: lat_ph, longitude: lon_ph} | ||
| variables: {h_ph: h_ph, signal_conf_ph: signal_conf_ph} |
There was a problem hiding this comment.
🤖 from Claude (review)
signal_conf_ph in levels.photons.variables is declarative-only and will mislead the next reader.
I checked the read path: _planned_read_group (src/zagg/processing.py:973) builds var_paths from data_source["variables"] (the top-level flat map), not from levels[base_level]["variables"]. _read_group_full does the same. A grep confirms nothing in src/zagg/ consumes level["variables"] for read or output:
$ grep -n 'lvl\[.variables.\]\|level\[.variables.\]\|levels\[.*\]\[.variables.\]' src/zagg/*.py
(no results)
So adding signal_conf_ph to levels.photons.variables here (and identically in atl03_waveform_counts.yaml:46) has zero functional effect — the dataset is read because the filter's dataset: "/{group}/heights/signal_conf_ph" triggers it via filter_paths in _planned_read_group, not because it appears in the level's variables map. But a reader of the YAML who hasn't read processing.py will reasonably expect that listing signal_conf_ph under the level's variables either (a) makes it appear as an output column or (b) bounds the read scope. Neither is true.
Two options:
- Drop
signal_conf_phfromlevels.photons.variableson both YAMLs. The filter reads it via its owndatasetfield; nothing needs the declaration. (h_phsimilarly is read fromdata_source.variables, not the level — also redundant, but at least it's a real output column.) - Keep it for documentation but add a YAML comment noting that
levels.<x>.variablesis currently a documentation hint only — the read path usesdata_source.variables. This will rot the moment someone wires the level's variables into the read path (which Generic hierarchical filtering & targeted reads for multi-level point datasets #43's design implies should happen eventually).
(1) is the cleaner ship — levels.<x>.variables reads as authoritative when in fact it's the top-level map that's authoritative.
Generated by Claude Code
| link: | ||
| to: photons | ||
| index_beg: "/{group}/geolocation/ph_index_beg" | ||
| count: "/{group}/geolocation/segment_ph_cnt" |
There was a problem hiding this comment.
🤖 from Claude (review)
Asymmetric path conventions: coordinates use bare names (joined to path), link fields use absolute /{group}/... templates.
The level's coordinates here are bare names — reference_photon_lat / reference_photon_lon — which _level_coord_paths (src/zagg/processing.py:837) joins to the level's path template (/{group}/geolocation) to produce /{group}/geolocation/reference_photon_lat. That join behavior is explicit:
lat_path = lat_name if lat_name.startswith("/") else f"{base}/{lat_name}"But the link fields one line below — index_beg: "/{group}/geolocation/ph_index_beg", count: "/{group}/geolocation/segment_ph_cnt" — are absolute templates. _planned_read_group resolves them with link["index_beg"].format(group=group) (processing.py:892); there is no path-join. Both shapes work today (absolute paths bypass the join short-circuit), but the inconsistency will trip the next person who copies issue #43's schema example literally — that example uses bare names for both coordinates and link fields:
link: to: photon index_beg: ph_index_beg # start of each parent's children in `to` count: segment_ph_cnt # children per parent
Two options, both small:
- Pick relative everywhere on both ATL03 YAMLs:
index_beg: ph_index_beg,count: segment_ph_cnt. Requires extending_planned_read_group(and the cross-level filter atprocessing.py:1007–1009) to apply the same join rule the coordinate resolver uses. That's where this is heading per Generic hierarchical filtering & targeted reads for multi-level point datasets #43, so it's the future-proof choice. - Pick absolute everywhere on both ATL03 YAMLs: rewrite the coords as
latitude: "/{group}/geolocation/reference_photon_lat". Smaller change, but it freezes the asymmetry in for ATL03 specifically.
If you want neither now, please at minimum add a YAML comment block at the top of levels: noting which convention the link fields follow — otherwise the next mission's template will be written against the issue #43 spec and fail at link["index_beg"].format(group=group) when no / is there to absolute-prefix.
Generated by Claude Code
| # Cross-check the link is identical to atl03's so the two templates | ||
| # share the same plan_read inputs on the same granule. | ||
| atl03_link = default_config("atl03").data_source["levels"]["segments"]["link"] | ||
| assert ds["levels"]["segments"]["link"] == atl03_link |
There was a problem hiding this comment.
🤖 from Claude (review)
Cross-template link-equality assertion is too tight.
The whole-dict comparison ds["levels"]["segments"]["link"] == atl03_link enforces bit-identical link blocks across the two templates. The motivation as documented in the comment is "share plan_read inputs on the same granule" — which only requires that index_beg, count, and index_base resolve to the same arrays on disk (so the planned slices match). It does not require pad (rightly not in the link, but the same logic applies to other future link knobs like reference_index) to be identical, and it does not require future-only fields like the reserved reference_index slot to stay in lock-step if one template legitimately needs to diverge.
A waveform-specific edit that legitimately diverges (say, a different reference_index once that field activates per _validate_levels at config.py:586) would fail this assertion even though plan_read parity is preserved. That makes the test a maintenance footgun on any future link-schema growth.
Tighten to the fields that genuinely must match for plan parity:
for key in ("to", "index_beg", "count", "index_base"):
assert ds["levels"]["segments"]["link"][key] == atl03_link[key]Same intent, surfaces a real divergence cleanly without coupling to fields whose drift doesn't break the shared-plan invariant.
Generated by Claude Code
| index_base: 1 # ATL03 ph_index_beg is 1-based per the v3 dict | ||
| read_plan: | ||
| spatial_index: segments | ||
| pad: 1 # one segment of padding on each side per #43 |
There was a problem hiding this comment.
🤖 from Claude (review)
pad: 1 choice is undocumented and not exercised by any test in this PR.
The YAML comment says # one segment of padding on each side per #43. #43's issue body in fact defines pad ("pad the matched parent range by N elements per side") without committing to a default; the prototype used 1 but doesn't justify it as the operational default for the grids zagg actually ships. The phase 5 planned-read tests use pad: 0 (tests/test_processing.py:1748) for tractability; TestPlanReadIndexBase uses pad: 0 for the index_base=1 case (tests/test_read_plan.py:131); the only pad: 1 exercise is implicit in TestPlannedReadGroup::test_planned_path_bounds_io_to_matched_segments where the wide segment spacing makes pad behavior irrelevant.
Concretely:
- ATL03 segment spacing is ~20 m along-track.
pad: 1= one segment = ~20 m of padding on each side of a matched run. - The rectilinear grid in both YAMLs is
resolution: 0.0001(~10 m at the equator). A shard footprint of 10 m × 10 m sits inside a single 20 m segment —pad: 1means up to ~20 m of always-discarded reads per shard, on top of the actual span. Not wasteful in absolute terms but worth knowing. - At high latitude (~88°S — the
--hardAOI in the benchmark), a rectilinear cell of 0.0001° in longitude is meters across, but the segment footprint stretches differently. The 1-segment pad might genuinely under-pad there if the AOI clips a segment boundary at an angle.
This isn't blocking, but two requests:
- Add a YAML comment that justifies
pad: 1from the grid resolution, not just "per Generic hierarchical filtering & targeted reads for multi-level point datasets #43" — e.g. "1 segment ≈ 20 m, matches the 0.0001° rectilinear resolution to within an order of magnitude; tune if HEALPix order-19 lands." - Either add an offline test that exercises
pad: 1at a segment boundary on the shipped template (would fit naturally into the integration test I'm flagging ontests/test_config.py:164), or at least add a unit test intests/test_read_plan.pythat confirmspad: 1extends a single-segment match into a 3-segment slice withindex_base=1arithmetic. The currentTestPlanReadIndexBasetest usespad: 0.
Generated by Claude Code
…hardening, +5 tests
|
🤖 from Claude Phase 5 self-review came back (review at pullrequestreview-4527...) — 9 findings, of which 2 were real hazards, 3 were test gaps, and 4 were quality / docstring nits. Folded the 5 diff-scoped ones in Folded (
Standing:
Gates: Phase 6 self-review is still running on a separate fresh agent; its findings will land below. Generated by Claude Code |
… shipped template
|
🤖 from Claude Phase 6 self-review came back (review 4527201199) — 5 findings. The headline gap was "no integration test exercises the shipped templates through Folded (
Gates: Standing (carried from earlier phases):
Both phases 5 and 6 are now complete + folded. Setting Generated by Claude Code |
|
First result back (simple case, 2D counts): |
|
Updated benchmark for 2018 to mid 2026 (combined): |
|
Updated-- getting better! Note that this a more compute intensive t-digest run, rather than just counts aggregation... |
|
this one is suspicious... wall time is exceeding the lambda timeout, so kinda suspect that we're losing data... |
|
for reference, we were testing cost scaling as a function of shard size-- it decreases with larger shards (as expected); here's the opposite order direction that drives the cost up: |
Refs #30, Refs #43
Implements the three #30 deliverables plus the missing #43 Phase C wiring that they need to actually run on Lambda: a built-in ATL03 template, a real-data region timing benchmark, an ATL03 waveform-counts template (128-bin fixed-width vector histogram), the
read_planintegration into_read_group(the #49 Phase C module landed but was never plumbed), and the multi-level ATL03 templates that ride it. Without phases 5+6 the ATL03 templates hit the ~245 MB-per-beam coord-read floor and OOM on Lambda — per @espg in #47 (comment).Phases
benchmarks/region_timing.py(written + import/construct-checked + ruff-clean; not executed against S3) +handoffpassthrough onagg()atl03_waveform_counts.yamltemplate + benchmark extension (useskind: vectorfrom Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29; fully CI-testable)dem_h, centre onnp.median(h_ph))plan_read/execute_read_planinto_read_group(the missing Generic hierarchical filtering & targeted reads (#43) #49 Phase C integration; AOI-bounded base-rate reads via the coarse-level spatial index; full-read fallback on selectivity / no-plan / wide-bbox antimeridian; review fold added 5 more tests including multi-slice + dispatch hardening)read_plan: { spatial_index: segments, pad: 1 }; review fold added a shipped-template integration test through_planned_read_groupcoveringindex_base: 1arithmetic + real{group}path templatesPhase 1 — what / approach
src/zagg/configs/atl03.yaml, registered likeatl06/atl06_polar(loaded viadefault_config("atl03")). Scalar fields only:count(density),h_min/h_max,h_mean,h_median,h_variance.signal_conf_ph != -2(drop only TEP photons; keep everything else) per the ATL03 v3 data dictionary.Phase 2 — what / approach
benchmarks/region_timing.py— drives the realQuery → make_shardmap → runner.agg()path for ATL03 over three 10 km × 10 km AOIs (NEON Maryland / Russell Glacier / Bahamas) ×{1y, all}windows, recording wall-time, peak RSS, total_obs, cells_with_data, output bytes for{pandas, arrow}and asserting byte-for-byte scalar parity.--hardadds the deferred 88°S AOI behind the flag. Lives underbenchmarks/, not run in CI.handoff="pandas"|"arrow"kwarg fromagg()→process_shard.Phase 3 — what / approach
src/zagg/configs/atl03_waveform_counts.yamlwithwaveform_counts(kind: vector,trailing_shape: 128,dtype: uint32) and a companion scalarbin_start(float32).atl03.yaml.benchmarks/region_timing.pyextended:run_one_waveformrunsatl03_waveform_countsalongside the scalaratl03for each (region, window, handoff).Recordgains atemplatefield; the output table gains atemplatecolumn.Phase 4 — merge main + option A
maininto the branch. Main brought issue Generic hierarchical filtering & targeted reads for multi-level point datasets #43 (structuredfilters:list + hierarchical levels) and Tier-2 vectors: CSR ragged (values/offsets/cell_ids) + t-digest as List<FixedSizeList<2>> #48 (ragged outputs, CSR, t-digest). The branch's_quality_maskhelper was dropped —_predicate_maskfrom Generic hierarchical filtering & targeted reads for multi-level point datasets #43 subsumes it. Both ATL03 templates moved to the structuredfilters:form. Pickedcolumn: 0(land) — per the ATL03 v3 data dictionary, TEP is uniform across surface types, so column 0 is operationally equivalent to the prior.any(axis=1)reduction.atl03_waveform_counts.yaml: droppeddem_h(segment-level path was wrong —dem_hlives in/{group}/geophys_corr/, notheights/), centred onnp.median(h_ph),bin_startrecords the per-cell median.Phase 5 —
read_planwiringThe OOM-driver fix from #43 — landed in
02764fc+ the review fold inf79ad6b._read_groupnow dispatches to a new_planned_read_groupwhendata_source.read_plan.spatial_indexis set with a completelevels+base_levelconfig; otherwise falls through to a renamed_read_group_full(verbatim of the previous body). The planned path:lat_ph/lon_ph).grid.shard_footprint(shard_key).bounds. When the bbox spans ≥180° in lon (HEALPix antimeridian / polar caps) defers to_read_group_full— the planning benefit is gone for those shards.plan_read(...). On empty parent runs returnsNone; onfull_read=True(selectivity above threshold) defers to_read_group_full.execute_read_plan. Cross-level (Phase B) filters compute a full-length base mask, then subset to the planned indices via aglobal_idxarray built fromplan.base_slices. Expression filters apply last.pandas/arrow/Nonecontract as the full path.Also fixed an off-by-one bug in
src/zagg/read_plan.py—chunk_listscarried inclusive ends with a comment claiming h5coro inclusive, but.venv/.../h5coro/h5dataset.py:182says "must provide as list of ranges [x,y)" (half-open). The previous read_plan tests passed only because the testread_fncallbacks compensated withdata[lo:hi+1]; updating chunk_lists + the 4 affected tests keeps everything green.tests/test_processing.py::TestPlannedReadGroup(13 tests): planned path bounds IO; empty-AOI short-circuit; selectivity fallback (now also with filters wired through); planned-vs-full parity (includingleaf_idrow-order assertion); cross-level filter via planned path; pad extends selection; invalid link target raises; dispatch rejects emptylevelsconfig; antimeridian grid falls back to full read; multi-slice plan exercisesglobal_idxrealignment; shipped ATL03 template through_planned_read_groupend-to-end (added in the phase-6 review fold).Phase 6 — ATL03 templates migrate to multi-level form
Landed in
5486494+ the review fold in9741eae. Both ATL03 templates now declarebase_level: photons,levels: { photons, segments }, andread_plan: { spatial_index: segments, pad: 1 }. Thesegmentslevel points at/{group}/geolocation/and carriesreference_photon_lat/reference_photon_lon+ph_index_beg/segment_ph_cnt. Link declaresindex_base: 1per the ATL03 v3 dictionary.Review fold (
9741eae) cleaned the YAMLs (dropped misleadinglevel.variablesentries — the read path never consults them; the entries were declarative-only), narrowedtest_multi_level_form_matches_atl03to compare only the link fields that driveplan_readparity (to/index_beg/count/index_base), and added the shipped-template integration test mentioned above soindex_base: 1arithmetic +pad: 1are covered offline.How tested
uv run pytest --ignore=tests/test_shardmap.py— 500 passed (test_shardmap.pycontinues to skip on the unrelatedcatalogextra).uv run ruff check src tests benchmarks— clean.test (3.12)/test (3.13)/ruff/build/build-x86_64/build-arm64all success.benchmarks/and isn't executed in CI.Questions for review
Both standing items moved out into #61 for a follow-up PR (per @espg's direction in #47):
column: 0vs.any(axis=1)decision (A / B / C from ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43) #47 (comment)). The PR ships with (A) as default.processing.pydedup between_planned_read_groupand_read_group_full(CLAUDE.md §4 module-size flag — now ~1448 lines).Neither blocks this PR's correctness; both are tracked at #61 for later.
Authored by Claude for @espg under the #30 routine.