Skip to content

ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43)#47

Merged
espg merged 18 commits into
mainfrom
claude/30-benchmark-template
Jun 18, 2026
Merged

ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43)#47
espg merged 18 commits into
mainfrom
claude/30-benchmark-template

Conversation

@espg

@espg espg commented Jun 16, 2026

Copy link
Copy Markdown
Member

Refs #30, Refs #43

Implements the three #30 deliverables plus the missing #43 Phase C wiring that they need to actually run on Lambda: a built-in ATL03 template, a real-data region timing benchmark, an ATL03 waveform-counts template (128-bin fixed-width vector histogram), the read_plan integration into _read_group (the #49 Phase C module landed but was never plumbed), and the multi-level ATL03 templates that ride it. Without phases 5+6 the ATL03 templates hit the ~245 MB-per-beam coord-read floor and OOM on Lambda — per @espg in #47 (comment).

Phases

  • Phase 1 — ATL03 template + validation tests (fully CI-testable)
  • Phase 2 — benchmarks/region_timing.py (written + import/construct-checked + ruff-clean; not executed against S3) + handoff passthrough on agg()
  • Phase 3 — atl03_waveform_counts.yaml template + benchmark extension (uses kind: vector from Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29; fully CI-testable)
  • Phase 4 — merge main + option A (drop dem_h, centre on np.median(h_ph))
  • Phase 5 — wire plan_read/execute_read_plan into _read_group (the missing Generic hierarchical filtering & targeted reads (#43) #49 Phase C integration; AOI-bounded base-rate reads via the coarse-level spatial index; full-read fallback on selectivity / no-plan / wide-bbox antimeridian; review fold added 5 more tests including multi-slice + dispatch hardening)
  • Phase 6 — migrate the two ATL03 templates to the multi-level form with read_plan: { spatial_index: segments, pad: 1 }; review fold added a shipped-template integration test through _planned_read_group covering index_base: 1 arithmetic + real {group} path templates

Phase 1 — what / approach

  • New built-in config src/zagg/configs/atl03.yaml, registered like atl06/atl06_polar (loaded via default_config("atl03")). Scalar fields only: count (density), h_min/h_max, h_mean, h_median, h_variance.
  • Confidence filter signal_conf_ph != -2 (drop only TEP photons; keep everything else) per the ATL03 v3 data dictionary.
  • Grid is rectilinear; HEALPix order-19 (~10 m) waits on mortie docs: mark SSO execution-role path out of date #35.

Phase 2 — what / approach

  • benchmarks/region_timing.py — drives the real Query → make_shardmap → runner.agg() path for ATL03 over three 10 km × 10 km AOIs (NEON Maryland / Russell Glacier / Bahamas) × {1y, all} windows, recording wall-time, peak RSS, total_obs, cells_with_data, output bytes for {pandas, arrow} and asserting byte-for-byte scalar parity. --hard adds the deferred 88°S AOI behind the flag. Lives under benchmarks/, not run in CI.
  • Threaded handoff="pandas"|"arrow" kwarg from agg()process_shard.

Phase 3 — what / approach

  • New built-in config src/zagg/configs/atl03_waveform_counts.yaml with waveform_counts (kind: vector, trailing_shape: 128, dtype: uint32) and a companion scalar bin_start (float32).
  • Same TEP confidence filter and rectilinear grid as atl03.yaml.
  • benchmarks/region_timing.py extended: run_one_waveform runs atl03_waveform_counts alongside the scalar atl03 for each (region, window, handoff). Record gains a template field; the output table gains a template column.

Phase 4 — merge main + option A

Phase 5 — read_plan wiring

The OOM-driver fix from #43 — landed in 02764fc + the review fold in f79ad6b. _read_group now dispatches to a new _planned_read_group when data_source.read_plan.spatial_index is set with a complete levels + base_level config; otherwise falls through to a renamed _read_group_full (verbatim of the previous body). The planned path:

  • Reads the coarse-level coords + link arrays in one call (~30× smaller than per-photon lat_ph/lon_ph).
  • Computes the shard's bbox from grid.shard_footprint(shard_key).bounds. When the bbox spans ≥180° in lon (HEALPix antimeridian / polar caps) defers to _read_group_full — the planning benefit is gone for those shards.
  • Calls plan_read(...). On empty parent runs returns None; on full_read=True (selectivity above threshold) defers to _read_group_full.
  • Reads base coords + variables + base-filter datasets via execute_read_plan. Cross-level (Phase B) filters compute a full-length base mask, then subset to the planned indices via a global_idx array built from plan.base_slices. Expression filters apply last.
  • Same pandas / arrow / None contract as the full path.

Also fixed an off-by-one bug in src/zagg/read_plan.pychunk_lists carried inclusive ends with a comment claiming h5coro inclusive, but .venv/.../h5coro/h5dataset.py:182 says "must provide as list of ranges [x,y)" (half-open). The previous read_plan tests passed only because the test read_fn callbacks compensated with data[lo:hi+1]; updating chunk_lists + the 4 affected tests keeps everything green.

tests/test_processing.py::TestPlannedReadGroup (13 tests): planned path bounds IO; empty-AOI short-circuit; selectivity fallback (now also with filters wired through); planned-vs-full parity (including leaf_id row-order assertion); cross-level filter via planned path; pad extends selection; invalid link target raises; dispatch rejects empty levels config; antimeridian grid falls back to full read; multi-slice plan exercises global_idx realignment; shipped ATL03 template through _planned_read_group end-to-end (added in the phase-6 review fold).

Phase 6 — ATL03 templates migrate to multi-level form

Landed in 5486494 + the review fold in 9741eae. Both ATL03 templates now declare base_level: photons, levels: { photons, segments }, and read_plan: { spatial_index: segments, pad: 1 }. The segments level points at /{group}/geolocation/ and carries reference_photon_lat/reference_photon_lon + ph_index_beg/segment_ph_cnt. Link declares index_base: 1 per the ATL03 v3 dictionary.

Review fold (9741eae) cleaned the YAMLs (dropped misleading level.variables entries — the read path never consults them; the entries were declarative-only), narrowed test_multi_level_form_matches_atl03 to compare only the link fields that drive plan_read parity (to / index_beg / count / index_base), and added the shipped-template integration test mentioned above so index_base: 1 arithmetic + pad: 1 are covered offline.

How tested

  • uv run pytest --ignore=tests/test_shardmap.py500 passed (test_shardmap.py continues to skip on the unrelated catalog extra).
  • uv run ruff check src tests benchmarks — clean.
  • CI on the head commit: test (3.12) / test (3.13) / ruff / build / build-x86_64 / build-arm64 all success.
  • Benchmark imports + constructs cleanly without creds; the region-timing run itself lives under benchmarks/ and isn't executed in CI.

Questions for review

Both standing items moved out into #61 for a follow-up PR (per @espg's direction in #47):

Neither blocks this PR's correctness; both are tracked at #61 for later.


Authored by Claude for @espg under the #30 routine.

@espg espg added the implement label Jun 16, 2026
@espg

espg commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Phase 1 pushed (fb8ef18 phase 1 of issue #30): ATL03 template + validation tests.

Gates: pytest 288 passed / 1 skipped (spherely fork); ruff-check + codespell clean on changed files. Two items raised under "Questions for review" in the PR body (the signal_conf_ph 2-D surface-type column, and pre-existing ruff format repo drift from the pinned-hook version mismatch).

Next: Phase 2benchmarks/region_timing.py (NEON Maryland / Russell Glacier / Bahamas × {1d,5d,15d,30d,91d,1y}; --hard 88°S deferred behind a flag; written + import-checked + ruff-clean, not run against S3).


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 1 review (commit fb8ef18) against issue #30 acceptance criteria and CLAUDE.md. The _quality_mask helper + op wiring is clean and the ATL06 default is preserved exactly — but the ATL03 template as written will crash at runtime on real data. One blocker, plus a few smaller items.


🔴 Blocker — signal_conf_ph is 2-D; the template will raise IndexError on real ATL03 granules

/{group}/heights/signal_conf_ph in ATL03 is shaped (n_photons, 5) — one column per surface type (land/ocean/sea-ice/land-ice/inland-water). The yaml comment acknowledges this ("this template reads the land surface_type column (index 0)"), but the code never selects column 0. In _read_group:

q_flag = data[qf_path][mask_sliced]          # 2-D: (n_rows, 5)
quality_mask = _quality_mask(q_flag, quality_filter)  # returns 2-D mask (n_rows, 5)
...
values = values[quality_mask]                # values is 1-D h_ph -> IndexError

Reproduced directly:

2D mask shape: (5, 5)
FAIL applying 2D mask to 1D values: IndexError too many indices for array: array is 1-dimensional, but 2 were indexed

So default_config("atl03") loads and validates fine, but the first granule read fails. Issue #30(b) specifies signal_conf_ph != 0, which for a 2-D column requires picking the surface-type index. Options:

  1. Add a column:/index: key to quality_filter (e.g. index: 0) and slice q_flag[:, index] in _read_group before _quality_mask. Smallest, explicit, testable.
  2. Hardcode column 0 when the flag array is 2-D (fragile; surface type 0 = land isn't always what an ice user wants).

Either way this needs a test that feeds a 2-D q_flag through the read path — see coverage gap below.

🟠 mean / median / var propagate NaN — confirm that's intended for h_ph

The template uses plain mean/median/var, not the NaN-aware family (nanmean/…). A single NaN/fill h_ph value in a cell nukes that cell's h_mean/h_median/h_variance to NaN (verified: np.mean([1,2,nan]) -> nan). For ATL06 that's the documented contract (fill values are real), but ATL03 photon heights are a different product. If geolocated h_ph can carry invalid/fill values, these should be nanmean/nanmedian/nanvar. If h_ph is guaranteed clean post-confidence-filter, fine as-is — but please state which, since it's a silent correctness lever.

Note: h_variance is _kernel_able (var→variance kernel), so on the experimental arrow-kernel path it's computed by pyarrow and agrees only within KERNEL_RTOL, not byte-identical — consistent with the documented dual contract. mean/median are not kernel-able (no "mean" key in _KERNEL_FUNCS, median absent), so they degrade gracefully to the numpy fallback. That part is correct.

🟡 validate_config doesn't validate quality_filter.op

A typo'd op (e.g. op: neq) passes validate_config and only blows up at runtime via _quality_mask's ValueError, deep inside a Lambda worker. Cheap to validate up front alongside the other data_source checks. Minor, but it's the kind of config error you want to catch at load time.

🟡 Test coverage gap

TestQualityMask covers the helper in isolation (good — eq default, ne, unknown-op raise), and TestATL03Template asserts the yaml's shape. But nothing exercises:

  • a 2-D q_flag through _quality_mask / _read_group (the blocker above would have been caught),
  • the op: ne mask actually wired through _read_group end-to-end.

CLAUDE.md §4 ("every behavioral change needs tests"): the op addition's integration into _read_group is untested. Adding the 2-D fixture test would cover both.

Non-blocking / confirmations

  • ATL06 regression: none. atl06.yaml has value: 0 with no op, so _quality_mask defaults to eq — identical to the old q_flag == value. ATL06 outputs stay byte-for-byte.
  • default_config("atl03") loads + validates; all six scalar functions resolve. Matches #30(a) scalar-only and #29 deferral.
  • ✅ Rectilinear grid only, no HEALPix — matches #30(c)/mortie #35.
  • ✅ The 8 new tests pass; ruff check clean on the Python files.
  • ruff format --check reports the two test files + processing.py would reformat, but the diffs are pre-existing (fixture/blank-line spacing, _read_group signature wrap) — not introduced here, so flagging rather than attributing. pre-commit/§4 local-green will trip on it regardless; worth a separate cleanup, not this PR's job.

Bottom line: the helper/op work is solid and ATL06-safe, but the ATL03 template can't actually run until the 2-D signal_conf_ph column-selection is handled. I'd treat that as a phase-1 fix before advancing.


Generated by Claude Code

@espg

espg commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Folded the Phase 1 self-review's diff-scoped findings (commits ca52b90, 3893d2d):

  • Blocker fixed — 2-D signal_conf_ph. The review correctly caught that signal_conf_ph is (n_photons, 5), so the old path would have built a 2-D mask and crashed on h_ph[mask]. _quality_mask now takes an optional column selector (slices the surface-type column before comparing; raises if a 2-D flag has no column, or a column is set on a 1-D flag). The template sets column: 0 (land). Added 4 tests including one that runs the shipped template's filter against a 2-D flag.
  • op validation. validate_config now rejects any quality_filter.op other than eq/ne at load time (was runtime-only). Added a test.

Two review findings I'm leaving standing for your call rather than acting on:

  1. NaN-aware reductions. The template uses plain mean/median/var. If post-filter h_ph can carry NaN/fill, those nuke the cell. ATL06 treats its h_li as clean post-quality-filter; should ATL03 do the same, or switch to nanmean/nanmedian/nanvar? (One-line change per field if you want nan-aware.)
  2. The signal_conf_ph land-column default (index 0) — fine for Maryland, but Bahamas bathymetry wants an ocean/sea index and Russell wants land-ice. The column knob makes this per-config; the benchmark (Phase 2) will pick per region. Flagging in case you'd prefer a different default.

Gates after fold: pytest 292 passed / 1 skipped; ruff check src tests clean. Moving on to Phase 2 (benchmarks/region_timing.py).


Generated by Claude Code

@espg

espg commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Phase 2 pushed (9956c0a, 4427b7f): the real-data region timing benchmark + a handoff passthrough.

  • benchmarks/region_timing.py — drives the real Query → make_shardmap → runner.agg() path for ATL03 over the three 10 km × 10 km AOIs (NEON Maryland / Russell Glacier / Bahamas) × {1d,5d,15d,30d,91d,1y}, recording wall-time, peak RSS, total_obs, cells_with_data, output bytes for {pandas, arrow} and asserting byte-for-byte scalar parity. --hard adds the deferred 88°S AOI behind the flag. Output to stdout + optional --out file. Lives under benchmarks/, not run in CI (needs earthaccess/S3 creds); actual S3 execution is left to a credentialed session.
  • To run both carriers through the public path, threaded handoff="pandas"|"arrow" from agg() down to process_shard (default pandas; ATL06 path unchanged). Added TestHandoffPassthrough.

Gates: pytest 295 passed / 1 skipped; ruff check src tests benchmarks clean; the benchmark imports + constructs cleanly without creds (region boxes verified ~10 km, window parsing, NaN-aware parity helper). Both phases are now complete — PR body checklist is done. Marking the PR ready for @espg review; leaving it draft + implement per the routine (a human merges).


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Fresh-context adversarial review of the Phase 2 diff (9956c0a + 4427b7f) against issue #30 acceptance and CLAUDE.md. Overall this is in good shape — the spec is met and I verified the wiring against the real signatures. No blockers. Findings below, ordered by importance.

Verified working (not just claimed):

  • Imports and constructs with no creds touched — exec'd the module standalone; from zagg.runner import agg / from zagg.config import ... are import-time-safe, and all catalog/S3 calls live inside run_one behind the __main__ guard. ✓ (#30's hard requirement)
  • All four regions measure ~10.0 km × 10.0 km and sit at the right places: neon_maryland (-76.56, 38.89, matches the make_shardmap CLI example bbox), russell_glacier (-50.0, 67.09, on the ice E of Kangerlussuaq), bahamas (-76.0, 24.0), antarctica_88s (0, -88). The 1/cos(lat) lon scaling holds E-W extent at high latitude (Russell/88S both come out ~10 km). ✓
  • Catalog→agg wiring is correct: Query("ATL03", version, start, end, region=bbox) matches the dataclass field order (short_name, version, start_date, end_date, region, provider); make_shardmap(query, grid).to_json(path) and agg(cfg, catalog=, store=, handoff=, max_cells=, max_workers=, overwrite=) all match signatures (agg's config is positional, rest keyword-only). ✓
  • _store_scalars resolves from_config(cfg).group_path and reads get_data_vars(cfg) = [count, h_min, h_max, h_mean, h_median, h_variance] — the correct group and var set; x/y coords are excluded (identical by construction, so no parity impact). ✓
  • Parity is sound and NaN-aware: np.array_equal(x, y, equal_nan=np.issubdtype(x.dtype, np.floating)) correctly enables equal_nan only for floats (passing equal_nan=True on the int32 count would raise). Both carriers load the same catalog JSON and max_cells truncates deterministically to the first N shards, so both runs aggregate the identical cell set. ✓
  • signal_conf_ph != 0 confidence filter + rectilinear grid come from the shipped atl03.yaml template (not hand-rolled here), as the spec intends. ✓
  • ruff check + format --check clean; both new TestHandoffPassthrough tests pass. ✓

1. (minor, correctness) handoff is silently dropped on the lambda backend. In runner.agg, _run_lambda(...) is called without handoff=, so agg(backend="lambda", handoff="arrow") runs pandas with no error. The docstring does say "Only honored by the local backend," and the benchmark only uses local, so this is acceptable for a benchmarking flag — but a silent no-op invites a future foot-gun. Options: (a) leave as-is (documented); (b) raise ValueError if handoff != "pandas" and backend == "lambda"; (c) add a one-line note in the PR body that lambda-handoff is out of scope. I'd lean (a)+(c).

2. (minor, tidiness) Region.bbox leaks np.float64 into config/Query/JSON. _box uses np.cos/np.radians, so lon_min/lon_max are numpy scalars; cfg.output["grid"]["bounds"] and the Query(region=...) bbox then carry np.float64 (confirmed at runtime). It works today, but to_json and any strict bounds validation are happier with plain floats. Cheap fix: cast in _box (float(lon - half_lon), …) or in the bbox property.

3. (minor, redundancy) double-copy in _region_config. default_config("atl03") already returns a freshly-parsed config (re-reads the YAML each call), then copy.deepcopy(cfg) copies it again. The deepcopy is harmless but redundant given each call is fresh; the copy import exists only for this. Not load-bearing — drop it or keep for defensiveness.

4. (test coverage) passthrough tests cover _process_and_write, not _run_local. TestHandoffPassthrough asserts _process_and_write forwards handoff and defaults to pandas — good, and the default test confirms the ATL06/pandas path is unchanged. But the _run_local_process_and_write hop (the driver=driver, handoff=handoff kwargs in the executor submit) isn't directly exercised. Low risk since it's a literal passthrough, but a one-test extension monkeypatching _process_and_write to capture the kwarg from a _run_local call would close the gap. Optional.

5. (spec / RSS caveat — please confirm) peak RSS is process-cumulative, not per-run. resource.getrusage(RUSAGE_SELF).ru_maxrss is a high-water mark for the whole process, so it never decreases across (region × window × handoff) iterations — the arrow run in a pair, and every later region, inherits the max of everything before it. The KB→MB conversion is right (/1e3 Linux, /1e6 darwin), but the recorded peak_rss_mb is not the memory of that run. Issue #30 asks for per-(region×window) peak RSS for each carrier. Options: (a) acknowledge the caveat in a comment/PR-body and keep the simple metric; (b) fork each run into a subprocess and read its ru_maxrss (true per-run peak); (c) sample RSS via a thread during the run for a per-run delta. (a) is probably fine for a relative benchmark but the limitation should be stated so a reader doesn't over-trust the column.

6. (nit) --out appends (open(..., "a")) with no run delimiter/header separation, so reruns accumulate stacked tables. Fine if intentional; a timestamp separator would help. Optional.

Nothing here blocks advancing. Items 1, 2, and the #5 caveat note are the ones I'd fold into the next phase commit; 3, 4, 6 are taste.


Generated by Claude Code

@espg

espg commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Confidence filter signal_conf_ph != 0 (retain flags 1–4) per @espg (#30 (comment)). The existing quality_filter only did equality

this should actually be signal_conf_ph != -2 ; we want to filter the TEP events, and keep everything else. We may decide to switch to signal_conf_ph != [-2, 0] in the future though

If this doesn't cleanly resolve, let me know in your comments, and just disable the signal_conf_ph flag for now.

Sweeps three 10 km × 10 km AOIs (NEON Maryland, Russell Glacier, Bahamas bathymetry) over windows {1d, 5d, 15d, 30d, 91d, 1y}

this was too restrictive; we only have 10 granules for the whole year over the AOI. We should do 1y (2025) and all years (2018 - Jan 2026 ; we can clip 2026 so it stays stable on future runs and doesn't add another pass while we're tracking benchmark drift).

To let the benchmark drive both carriers through the public path, I threaded an optional handoff="pandas"|"arrow" kwarg from agg() → _run_local → _process_and_write → process_shard (default "pandas", ATL06 path unchanged; process_shard already validated the value).

what does ATL06 path unchanged refer to? We want ATL06 for this.

signal_conf_ph surface-type default. Template uses the land column (index 0). Bahamas bathymetry likely wants an ocean/sea index and Russell a land-ice index — the new column knob makes this per-config; the benchmark picks per region if you want region-specific surface types (currently all use the template default).

see first comment above -- we want to filter our TEP events only, across all surface types, and keep everything else

mean / median / var propagate NaN — confirm that's intended for h_ph

sure, fine for now

@espg espg removed the waiting label Jun 17, 2026

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Fresh-context adversarial self-review of this run's diff (cae77fa..771870e) addressing @espg's review feedback.

Item 1 — TEP filter (_quality_mask + atl03.yaml): correct. The change drops the single-surface-type column: 0 keying and uses value: -2, op: ne reduced across all surface-type columns via match.any(axis=1). I verified the reduction semantics: (q != -2).any(axis=1) ("keep if any column is non-TEP") is provably equivalent by De Morgan to ~(q == -2).all(axis=1) ("drop only when every column is -2"). Since genuine TEP photons are flagged -2 in all 5 surface-type columns (per the ATL03 v007 data dictionary), this drops exactly the TEP photons and keeps everything else — matching @espg's "filter TEP events only, across all surface types, and keep everything else." A photon that is -2 in only some columns is kept, which is correct (it is not a TEP photon). The single-column selector path is preserved and still tested for backward compatibility (ATL06 eq path unchanged).

Item 2 — benchmark windows: correct and matches the spec exactly. WINDOWS = {"1y": ("2025-01-01","2026-01-01"), "all": ("2018-01-01","2026-01-01")}. The all upper bound is a static literal clipped at 2026-01-01, so future runs stay stable and don't drift as new granules land. The old --end-date/timedelta/days machinery is fully removed with no dangling references; run_one, parse_args, and main are consistent.

Folded this run: the review caught one stale docstring example (!= 0!= -2) in _quality_mask's op-list — fixed in 771870e.

No blockers, no majors. Gates green: ruff check src tests benchmarks clean, pytest 296 passed / 1 skipped.

Inline notes below.


Generated by Claude Code

@espg

espg commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Addressed two of the three review items; the third needs your call (a clarifying question below). Pushed cae77fa, 6b6c74a, 771870e.

1. signal_conf_ph TEP filter — implemented the clean fix (not disabled). Changed the template from signal_conf_ph != 0 on the land column to != -2 reduced across all surface types, exactly per your "filter our TEP events only, across all surface types, and keep everything else."

  • src/zagg/configs/atl03.yaml: quality_filter is now value: -2, op: ne with no column selector (the old column: 0 single-surface-type keying is gone).
  • src/zagg/processing.py: _quality_mask now reduces a 2-D flag across columns when no column is given:
    if match.ndim == 2:
        return match.any(axis=1)   # ne: keep if ANY surface type is non-TEP
    This fit cleanly into the existing op/column mechanism — no new masking DSL needed, so no need to fall back to disabling the flag. The semantics: (q != -2).any(axis=1)~(q == -2).all(axis=1), i.e. a photon is dropped only when it is -2 across every surface type (true TEP per the ATL03 v007 data dictionary); everything else is kept. The column selector is still supported (and tested) for backward compatibility; ATL06's eq path is unchanged. Tests updated: test_confidence_filter_drops_tep, test_2d_ne_reduces_across_surface_types, test_2d_eq_reduces_across_surface_types, test_atl03_template_filter_runs.
  • Future != [-2, 0] is a small extension on this same reduction path when you want it.

2. Benchmark windows — done. benchmarks/region_timing.py now sweeps exactly two windows instead of {1d,5d,15d,30d,91d,1y}:

WINDOWS = {
    "1y":  ("2025-01-01", "2026-01-01"),   # calendar 2025
    "all": ("2018-01-01", "2026-01-01"),   # full mission, upper bound clipped
}

The all upper bound is a static 2026-01-01 literal so future runs stay stable and don't drift as new granules land. The --hard 88°S AOI flag is unchanged. Removed the now-unused --end-date/days/timedelta machinery; run_one takes an explicit (start, end) range.

3. "ATL06 path unchanged" — clarifying question (blocking item 3). "ATL06 path unchanged" in my Phase 2 note meant only that threading the handoff="pandas"|"arrow" kwarg leaves ATL06's default code path byte-for-byte identical (default stays "pandas"), since the benchmark currently drives ATL03. Your "We want ATL06 for this" could mean a few different things, so rather than guess a product switch I'd like you to pick:

  • (A) The region benchmark should run on the ATL06 product instead of ATL03 — more granules per AOI, and it exercises the production pandas-vs-arrow comparison on the product you actually ship. (I'd point Query("ATL06", ...) and use default_config("atl06")/atl06_polar, adjust the per-region grid bounds, and keep the TEP/signal_conf_ph filter out since that's ATL03-specific.)
  • (B) Keep ATL03 as the benchmarked product — the "ATL06 path unchanged" line was only confirming the default handoff stays identical — and no benchmark change is needed here.
  • (C) Benchmark both ATL03 and ATL06 (add ATL06 as a second product axis alongside the existing three regions × two windows × two carriers).

Which do you want? I'll implement A or C as the next phase; B means item 3 is already satisfied.

Gates (local):

ruff check src tests benchmarks  ->  All checks passed!
pytest -q                        ->  296 passed, 1 skipped

(The 1 skip is the spherely-fork backend; pre-existing. ruff format repo-wide drift from the pinned-hook version mismatch is untouched per CLAUDE.md §4 — none of it is on lines this run changed.)

Setting waiting on the PR pending your answer to item 3. The two clear fixes above land regardless.


Generated by Claude Code

@espg espg added the waiting label Jun 17, 2026
@espg

espg commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Keep ATL03 as the benchmarked product...

was a typo, meant to write "We want ATL03 for this"

@espg espg removed the waiting label Jun 17, 2026
@espg

espg commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Got it — keeping ATL03 as the benchmarked product for benchmarks/region_timing.py (your "ATL06" was the typo). That resolves the only open question on this PR, and it needs no code change: the built-in atl03.yaml template and the region benchmark already target ATL03, so the prior signal_conf_ph != -2 (TEP across surface types) filter and the 1y/all windows stand as-is.

CI is green (test 3.12/3.13, ruff, build x86_64 + arm64); the benchmark itself is benchmarks/-only and not run in CI (needs earthaccess/S3 creds). All phases complete — leaving implement + waiting (awaiting your review/merge).


Generated by Claude Code

@espg

espg commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

blocked on plan resolution in #30 for scope expansion of this pr.

unblocked now, see recent discussion in #30

We''ll need to also resolve merge conflicts for this branch ( #40 landed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MznHvCXwruEszMEffRTMyw
@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

are the aggregation templates implementing hierarchical filtering from #43 and #49 ? They won't work without the filtering (i.e., out of memory errors on the lambda workers)

@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Honest answer, after digging through the #43/#49 code on main:

Phase A (structured filters:) — yes, both templates use it. The TEP filter in atl03.yaml and atl03_waveform_counts.yaml is the structured form (op: ne, column: 0, value: -2), via the read path that #49 landed (_predicate_mask).

Phase B (multi-level levels: / base_level: / link:) — no. The templates are still flat single-level. No coarse-rate filters are declared (e.g. podppd_flag at segment rate is not in the templates). The Phase B machinery in _read_group is wired and tested (TestReadGroupCrossLevel in tests/test_processing.py:1552) — the templates just don't use it.

Phase C (read_plan / targeted hierarchical reads) — no, and there is a wiring gap. This is the OOM driver from your #43 issue body ("lat_ph+lon_ph full read is up to 245 MB per strong beam ... dominant driver of the Lambda OOMs on the ATL03 SERC benchmark"). I went looking for the read_plan integration and found:

$ grep -rn "read_plan\|plan_read\|spatial_index\|ReadPlan" src/zagg/ --include="*.py" | grep -v read_plan.py
(no results)

src/zagg/read_plan.py exists with plan_read + execute_read_plan + ReadPlan, and tests/test_read_plan.py (16 offline tests) exercises the pure functions, but nothing imports or calls them from _read_group (or anywhere else). PR #49's body said "Wired into _read_group" but the wiring isn't there on main at 4a54c85_read_group still does a full h5obj.readDatasets(coord_paths) of lat_ph/lon_ph for every group before any spatial filtering. So adding read_plan: { spatial_index: ..., pad: ... } to the YAMLs today would be ignored.

So you're right — without Phase B+C, an ATL03 region run on Lambda will keep hitting the 245 MB/beam coord-read floor, before anything I do at the aggregation handoff matters.

Concrete fix is two pieces of work, neither tiny:

  1. Wire plan_read/execute_read_plan into _read_group (the missing Generic hierarchical filtering &amp; targeted reads (#43) #49 Phase C integration). On a group, when data_source.read_plan.spatial_index names a coarse level:

    • Full-read the coarse level's lat/lon (~30× smaller than lat_ph/lon_ph).
    • Compute the shard's bbox, call plan_read(...) with the link arrays.
    • Hyperslice the base-rate coords + variables + filter datasets via execute_read_plan.
    • Falls back to today's full-read path when read_plan is absent or selectivity exceeds the threshold.
    • Needs integration tests against _FakeH5 + an offline-runnable parity check that the bounded read produces the same per-cell stats as the full-read path.
  2. Migrate the two ATL03 templates to the multi-level form + declare read_plan: { spatial_index: segment, pad: 1 }. Once (1) lands, this is mostly YAML:

    • Two levels: photon (/{group}/heights) and segment (/{group}/geolocation — carries reference_photon_lat/reference_photon_lon, ph_index_beg, segment_ph_cnt).
    • Link segment -> photon with index_beg: ph_index_beg, count: segment_ph_cnt, index_base: 1 (ATL03 is 1-based per Generic hierarchical filtering & targeted reads for multi-level point datasets #43).
    • Optional follow-on: add a level: segment filter on podppd_flag (cross-level Phase B).

Scope question for you — three options, want your call:

My recommendation is (II). The _read_group wiring is its own self-contained change with its own correctness story (parity with full-read path), and treating it as a #43 follow-up matches what actually happened — the module landed, the wiring didn't. The two ATL03 templates then migrate cleanly in this PR once that's in.

Setting waiting on this PR pending your call.


Generated by Claude Code

@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Option 1 / I : Expand the PR and do (1) + (2) here as phases 5 and 6 before this PR merges.

@espg espg removed the waiting label Jun 18, 2026
@espg espg changed the title ATL03 template + real-data region benchmark (Refs #30) ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43) Jun 18, 2026
@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Acknowledged — going with (I), folding (1) and (2) into this PR as phases 5 and 6. Updated the PR title + body to reflect the expanded scope; the phases checklist now carries 5 and 6 as the remaining items, and the original "Refs #30" is joined by "Refs #43" since phase 5 is the missing #49 Phase C wiring.

Starting on phase 5 (plan_read / execute_read_plan_read_group). Will land it with an offline _FakeH5 parity test (the bounded read produces the same per-cell stats as the full-read path), the empty-AOI short-circuit, and the selectivity fallback. Clearing waiting.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 5 wires read_plan into _read_group via a new _planned_read_group dispatch, plus a half-open-vs-inclusive fix in read_plan.py. Confirmed correct on the load-bearing pieces:

  • h5coro half-open contract is real. h5dataset.py:182 raises FatalError("must provide as list of ranges [x,y)"), and the boundary check at h5dataset.py:188 uses > self.meta.dimensions[d] (not >=), so [(0, n_base)] is in-bounds when n_base == dim. The previous [(0, n_base - 1)] was wrong; the tests passed only because read_fn callbacks did data[lo:hi+1]. Fix is correct and the test updates mirror the new semantics.
  • readDatasets accepts bare strings. h5coro.py:121–127 wraps a str element as {"dataset": dataset, "hyperslice": []}, so the no-hyperslice branch in _read_fn (h5obj.readDatasets([path])[path]) is fine.
  • global_idx / cross-level alignment is sound. execute_read_plan concatenates plan.chunk_lists in base_slices order, so lats[i] is base index global_idx[i]; cross_full[global_idx] is therefore lat-aligned, and [mask_spatial] reduces it to the same length as the per-base-filter keep_mask. AND lines up.

What I flagged (all inline) — none block the phase, but two are real correctness/perf hazards on real-world inputs:

  1. shard_footprint(...).bounds degenerates to a globe-spanning bbox for HEALPix cells that genuinely cross the antimeridian (polar caps especially) — silently neutralizes the optimization on those shards. Worth a guard or TODO.
  2. n_base = ibeg_arr[-1] - index_base + cnt_arr[-1] is a stricter assumption than the contiguity contract documented in _expand_mask_to_base — it also requires the last parent's tail to be the array length. Worth pinning down what real ATL03 actually guarantees.
  3. Test coverage gaps: the full-read fallback isn't exercised with filters wired through, and _planned_read_group's multi-slice concat + global_idx path isn't exercised end-to-end (plan_read covers it offline, but the integration path isn't).
  4. _LatBboxGrid parity test asserts df["h"] only; adding df["leaf_id"] would catch row-ordering drift between paths cheaply.
  5. CLAUDE.md §4 module-size: processing.py is past 1400 lines, and _planned_read_group is largely a structural twin of _read_group_full. Worth a follow-up to extract a shared mask/dict builder.

No new dependencies, no broad noqa / type: ignore, no disabled tests, no infra/CI touched.


Generated by Claude Code

@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Phases 5 and 6 pushed (commits 02764fc phase 5 of issue #30, 5486494 phase 6 of issue #30). Both phases run; self-review subagents are running against each in parallel.

Phase 5 — _read_groupplan_read wiring

src/zagg/processing.py:

  • New _planned_read_group(...) (~155 LoC) + _level_coord_paths(level, group) helper.
  • _read_group dispatches to _planned_read_group when data_source.read_plan.spatial_index is set + levels/base_level present; otherwise to a renamed _read_group_full(...) (verbatim of the previous body).
  • Planned path: reads coarse-level coords + link arrays in one call → computes shard bbox via grid.shard_footprint(shard_key).boundsplan_read(...) → on empty parent runs returns None; on full_read=True defers to _read_group_full; otherwise reads base coords + variables + base-filter datasets via execute_read_plan. Cross-level (Phase B) filters compute a full-length base mask, then subset to the planned indices via a global_idx array built from plan.base_slices. Expression filters apply last. Same pandas / arrow / None contract as the full path.

src/zagg/read_plan.py:

  • Off-by-one bug fix. The pre-merge code did chunk_lists.append([(base_start, base_end - 1)]) with the comment "h5coro inclusive end". h5coro's actual contract per h5dataset.py:182 is "must provide as list of ranges [x,y)" — half-open. Without this fix, plan_read's chunks would have under-read by 1 element per slice against real h5coro. The previous read_plan tests passed only because the test read_fn callbacks compensated for the off-by-one (data[lo:hi+1]); updating those to data[lo:hi] keeps all 16 tests green.

tests/test_processing.py::TestPlannedReadGroup:

  • 7 new tests: planned path bounds IO; empty-AOI short-circuit; full-read fallback on selectivity; planned-vs-full parity; cross-level filter via planned path; pad extends selection; invalid link target raises. Permissive _BboxGrid + strict _LatBboxGrid test stubs. Six segments at lats [0, 100, 200, 300, 400, 500] (wide spacing keeps the linestring-crossing sweep bounded to immediate neighbours).

Phase 6 — ATL03 templates migrate to multi-level form

src/zagg/configs/atl03.yaml + src/zagg/configs/atl03_waveform_counts.yaml:

  • Both templates now declare base_level: photons, levels: { photons, segments }, and read_plan: { spatial_index: segments, pad: 1 }.
  • The segments level points at /{group}/geolocation/ and carries reference_photon_lat/reference_photon_lon + ph_index_beg/segment_ph_cnt. Link declares index_base: 1 per the ATL03 v3 dictionary.

tests/test_config.py:

  • TestATL03Template::test_multi_level_form_for_planned_reads + test_read_plan_targets_segments_level.
  • TestATL03WaveformCountsTemplate::test_multi_level_form_matches_atl03 cross-checks the segments link is identical between the two templates so they share plan_read inputs on the same granule.

Gates

uv run pytest --ignore=tests/test_shardmap.py  →  494 passed
uv run ruff check src tests benchmarks         →  All checks passed!

test_shardmap.py continues to skip on the catalog extra; unrelated.

Standing question (still from phase 4)

The column: 0 vs .any(axis=1) semantic divergence in the structured filter — three options (A/B/C) laid out at #47 (comment). Phase 6's multi-level migration does not change this — the filter is still base-level with column: 0.

Self-review findings will land below as inline comments; will fold the diff-scoped ones and surface anything substantive.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Inline findings for phase 5 — non-blocking; see the prior top-level review for the summary.


Generated by Claude Code

Comment thread src/zagg/processing.py
Comment on lines +911 to +913
poly = grid.shard_footprint(shard_key)
min_lon, min_lat, max_lon, max_lat = poly.bounds
bbox = (float(min_lon), float(min_lat), float(max_lon), float(max_lat))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Antimeridian / polar HEALPix cells silently neutralize the optimization.

mort2polygon (mortie/tools.py:664) normalizes polygons that touch the antimeridian to one side, so the common case is fine. But cells that genuinely cross the antimeridian — polar caps, and any base-cell whose ring straddles ±180° — still get vertices on both sides after the lons = np.where(lons > 180, lons - 360, lons) wrap at tools.py:721. For those, poly.bounds reports a west ≈ -180, east ≈ +180 box, which plan_read will then accept ~every coarse parent against, triggering the selectivity fallback and reading the full base array.

That's a degradation, not a correctness bug — the planned path still produces correct output via full_read_read_group_full. But it silently undoes the OOM mitigation that motivated #49 for exactly the shards that are hardest to process (polar). Worth either:

  • guarding here (if max_lon - min_lon > 180, skip the optimization and call _read_group_full directly), or
  • emitting two plans for the wrapped halves and concatenating, or
  • at minimum a TODO so it doesn't get lost.

Cross-ref src/zagg/grids/healpix.py:171–179 (which returns a single Polygon without antimeridian splitting) and the discussion in mortie/tools.py:505–593 (mort2bbox).


Generated by Claude Code

Comment thread src/zagg/processing.py Outdated
Comment on lines +903 to +905
# Under the contiguity assumption (#43): ``sum(count) == n_base``, so the
# last parent's tail gives the total length without an extra header read.
n_base = int(ibeg_arr[-1]) - index_base + int(cnt_arr[-1])

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

n_base derivation needs a tighter statement of the assumption.

_expand_mask_to_base's docstring (processing.py:756) says "ranges do not overlap and together tile the full base array" — i.e. sum(count) == n_base and the union covers [0, n_base). That's what justifies the expansion at base rate, not what's computed here.

n_base = ibeg_arr[-1] - index_base + cnt_arr[-1] requires the additional assumption that the last parent's tail equals the base-array length — i.e. no trailing dropped photons, no implicit padding at the tail of the photon array. The contiguity contract as written allows tiles to cover [0, sum(count)) without committing to sum(count) == photon_array_len.

If that gap can occur on a real ATL03 granule (segment-cropped products, dropped-tail trailers), n_base underestimates len(lat_ph) and _expand_mask_to_base(..., n_base) produces a shorter mask than the actual base reads. That'd surface as a shape mismatch when cross_full[global_idx] indexes into the cross-mask later (line 1018).

Two ways to harden this:

  1. Add an explicit len(lat_ph) read for n_base (one small extra header read, vs. the OOM-risk path it's avoiding).
  2. Or commit in the docstring + a code assert: "ATL03 contract: ph_index_beg[-1] + segment_ph_cnt[-1] == len(lat_ph)" and document where that's verified.

Either way, please pin down which it is on the PR thread before this lands on real granules.


Generated by Claude Code

Comment thread src/zagg/processing.py Outdated
Comment on lines +1086 to +1092
if (
isinstance(data_source.get("read_plan"), dict)
and data_source["read_plan"].get("spatial_index")
and data_source.get("levels")
and data_source.get("base_level")
):
return _planned_read_group(h5obj, group, data_source, shard_key, grid, arrow=arrow)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Nit on the dispatch predicate: data_source.get("levels") and data_source.get("base_level") are truthy-checked, so an explicit levels = {} or base_level = "" configuration silently routes back to _read_group_full even though the user wired up a read_plan. That's almost certainly not what they meant. Cheap fix:

levels = data_source.get("levels")
base_level = data_source.get("base_level")
rp = data_source.get("read_plan")
if (
    isinstance(rp, dict)
    and rp.get("spatial_index")
    and isinstance(levels, dict)
    and base_level
):
    ...

…or raise a ValueError when read_plan.spatial_index is set but levels/base_level aren't — saves a future user a "why isn't my planned read planning?" debug session.


Generated by Claude Code

Comment thread src/zagg/processing.py
Comment on lines +842 to +1056
def _planned_read_group(
h5obj, group: str, data_source: dict, shard_key: int, grid, arrow: bool = False
):
"""Planned (AOI-bounded) read of one HDF5 group via the coarse spatial index.

Issue #43 Phase C: when ``data_source.read_plan.spatial_index`` names a coarse
level whose ``link`` points at the base level, we read the coarse coordinates
+ link arrays once (small), call :func:`zagg.read_plan.plan_read` to compute
which base-rate slices the AOI bbox actually touches, and read base-rate
coords + variables + filter datasets only over those slices via
:func:`zagg.read_plan.execute_read_plan`. This avoids the
``lat_ph`` + ``lon_ph`` full-coord read (up to ~245 MB per ATL03 beam) that
drives Lambda OOMs (issue #43 motivation).

Falls back transparently to :func:`_read_group` when:
- the empty-AOI short-circuit fires (no parents match) → return ``None``;
- ``plan_read`` flags ``full_read=True`` (selectivity above threshold);
- the cell ``signal_conf_ph``-style 2-D structured filter would be re-read
via the planned slices either way (the helper handles that uniformly).

Returns the same ``pandas.DataFrame`` / ``pyarrow.Table`` / ``None`` contract
as :func:`_read_group`. Output rows are in plan-slice / spatial-mask /
filter order — which matches the full-read path's row ordering because the
plan's runs are emitted in increasing parent index.
"""
coordinates = data_source["coordinates"]
variables = data_source["variables"]
levels = data_source["levels"]
base_level_key = data_source["base_level"]
rp = data_source["read_plan"]
spatial_index_level = rp["spatial_index"]
pad = int(rp.get("pad", 1))
full_read_threshold = float(rp.get("full_read_threshold", 0.9))

si_lvl = levels[spatial_index_level]
link = si_lvl.get("link")
if not isinstance(link, dict):
raise ValueError(
f"read_plan.spatial_index level {spatial_index_level!r} requires a 'link'"
)
if link["to"] != base_level_key:
raise ValueError(
f"read_plan.spatial_index level {spatial_index_level!r} must link "
f"directly to base level {base_level_key!r} (got link.to={link['to']!r})"
)
index_base = int(link.get("index_base", 0))

# Read coarse-level coordinates + link arrays in one go (small — geolocation
# rate is ~30x lighter than photon rate on ATL03).
si_lat_path, si_lon_path = _level_coord_paths(si_lvl, group)
ibeg_path = link["index_beg"].format(group=group)
cnt_path = link["count"].format(group=group)
coarse_data = h5obj.readDatasets([si_lat_path, si_lon_path, ibeg_path, cnt_path])
coarse_lats = coarse_data[si_lat_path]
coarse_lons = coarse_data[si_lon_path]
ibeg_arr = coarse_data[ibeg_path]
cnt_arr = coarse_data[cnt_path]

if len(coarse_lats) == 0:
return None

# Under the contiguity assumption (#43): ``sum(count) == n_base``, so the
# last parent's tail gives the total length without an extra header read.
n_base = int(ibeg_arr[-1]) - index_base + int(cnt_arr[-1])
if n_base <= 0:
return None

# Compute the shard's WGS84 bbox from the grid (every grid's shard_footprint
# returns a shapely Polygon).
poly = grid.shard_footprint(shard_key)
min_lon, min_lat, max_lon, max_lat = poly.bounds
bbox = (float(min_lon), float(min_lat), float(max_lon), float(max_lat))

plan = plan_read(
np.asarray(coarse_lats),
np.asarray(coarse_lons),
np.asarray(ibeg_arr),
np.asarray(cnt_arr),
n_base,
bbox,
index_base=index_base,
pad=pad,
full_read_threshold=full_read_threshold,
)

if not plan.parent_runs:
return None # empty AOI -- no parent intersects, skip the group entirely

if plan.full_read:
# Selectivity above threshold: many small reads would still sum to most
# of the file. Defer to the full-coord-read path; semantics identical.
return _read_group_full(h5obj, group, data_source, shard_key, grid, arrow=arrow)

# h5coro-compatible reader callback for execute_read_plan.
def _read_fn(path, hyperslice=None):
if hyperslice is None:
return h5obj.readDatasets([path])[path]
return h5obj.readDatasets([{"dataset": path, "hyperslice": hyperslice}])[path]

# ---- Read base coords + variables + filter datasets over the planned slices.
filters = filters_from_data_source(data_source)
base_structured = [
f
for f in filters
if "expression" not in f and (f.get("level") is None or f.get("level") == base_level_key)
]
coarse_structured = [
f
for f in filters
if "expression" not in f
and f.get("level") is not None
and f.get("level") != base_level_key
]
expressions = [f for f in filters if "expression" in f]

lat_path = coordinates["latitude"].format(group=group)
lon_path = coordinates["longitude"].format(group=group)
lats = execute_read_plan(plan, _read_fn, lat_path, np.float64)
lons = execute_read_plan(plan, _read_fn, lon_path, np.float64)

if len(lats) == 0:
return None

# Apply spatial / shard mask over the concatenated planned reads.
leaf_ids = grid.assign(lats, lons)
mask_spatial = grid.shards_of(leaf_ids) == shard_key
if np.sum(mask_spatial) == 0:
return None

# Read the variables and base-level filter datasets via the same plan. Read
# each distinct path once (the variable and filter dataset paths can coincide).
var_paths = {col: tmpl.format(group=group) for col, tmpl in variables.items()}
filter_paths = {id(f): f["dataset"].format(group=group) for f in base_structured}
paths_seen: set[str] = set()
arrays_by_path: dict[str, np.ndarray] = {}
for path in list(var_paths.values()) + list(filter_paths.values()):
if path in paths_seen:
continue
paths_seen.add(path)
# dtype hint isn't load-bearing -- execute_read_plan dtype-casts via
# np.asarray, which is a no-op when the source dtype already matches.
arrays_by_path[path] = execute_read_plan(plan, _read_fn, path, None)

# Base-level structured filters: ANDed keep-masks over the concatenated reads.
keep_mask: np.ndarray | None = None
for f in base_structured:
flag = arrays_by_path[filter_paths[id(f)]][mask_spatial]
fmask = _predicate_mask(flag, f)
keep_mask = fmask if keep_mask is None else (keep_mask & fmask)

# Cross-level (Phase B) filters: read coarse flags fully, expand to base
# rate (length n_base), then subset to the planned indices.
if coarse_structured:
# Build the global base-index array once: which original-base positions
# are present in the concatenated planned read.
global_idx = np.concatenate(
[np.arange(s, e, dtype=np.int64) for s, e in plan.base_slices]
)
cross_full: np.ndarray | None = None
for f in coarse_structured:
level_key = f["level"]
cf_lvl = levels[level_key]
cf_link = cf_lvl["link"]
cf_index_base = int(cf_link.get("index_base", 0))
cf_flag_path = f["dataset"].format(group=group)
cf_ibeg_path = cf_link["index_beg"].format(group=group)
cf_cnt_path = cf_link["count"].format(group=group)
cf_data = h5obj.readDatasets([cf_flag_path, cf_ibeg_path, cf_cnt_path])
cf_flag = cf_data[cf_flag_path]
cf_ibeg = cf_data[cf_ibeg_path]
cf_cnt = cf_data[cf_cnt_path]
coarse_fmask = _predicate_mask(cf_flag, f)
expanded = _expand_mask_to_base(coarse_fmask, cf_ibeg, cf_cnt, cf_index_base, n_base)
cross_full = expanded if cross_full is None else (cross_full & expanded)
# Subset the full-length mask to the concatenated planned indices, then
# to the spatial keep window so it lines up with keep_mask above.
cross_planned = cross_full[global_idx][mask_spatial]
keep_mask = cross_planned if keep_mask is None else (keep_mask & cross_planned)

if keep_mask is not None and np.sum(keep_mask) == 0:
return None

# Build the data dict (variables sliced to mask_spatial, then to keep_mask).
leaf_after_spatial = leaf_ids[mask_spatial]
data_dict: dict[str, np.ndarray] = {}
for col_name, path in var_paths.items():
values = arrays_by_path[path][mask_spatial]
if keep_mask is not None:
values = values[keep_mask]
data_dict[col_name] = values
data_dict["leaf_id"] = leaf_after_spatial[keep_mask] if keep_mask is not None else leaf_after_spatial

# Base-level expression filters (aggregation-time escape hatch, no pushdown).
for f in expressions:
cols = {c: data_dict[c] for c in variables if c in data_dict}
try:
emask = evaluate_filter_expression(f["expression"], cols)
except NameError as e:
raise NameError(
f"expression filter {f['expression']!r} references an undefined name: {e}"
) from e
if emask.shape != data_dict["leaf_id"].shape:
raise ValueError(
f"expression filter {f['expression']!r} must yield a per-row "
f"boolean mask (got shape {emask.shape})"
)
if np.sum(emask) == 0:
return None
data_dict = {k: v[emask] for k, v in data_dict.items()}

if arrow:
import pyarrow as pa

return pa.table(data_dict)
return pd.DataFrame(data_dict)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Module size / duplication (CLAUDE.md §4).

processing.py is now ~1448 lines (was 1293 before this PR); CLAUDE.md §4 says exceeding ~1000 lines requires prior discussion before continuing. This phase moves it further past the line, and _planned_read_group (lines 842–1056) is largely a structural twin of _read_group_full below it: same filter partition, same dedup-by-id pattern for var_paths/filter_paths, same base-vs-cross AND chain, same expression-filter loop, same arrow/pandas tail.

Not blocking phase 5 — but worth a follow-up that extracts a shared helper for the post-read shape:

def _apply_filters_and_build_dict(
    lats, lons, leaf_ids, mask_spatial,
    arrays_by_path, var_paths, filter_paths,
    base_structured, cross_mask_planned, expressions, variables, ...
) -> dict | None

and then both paths reduce to "read the right slices, then call the helper." Worth raising on the PR thread (or as a follow-up issue) for an @espg call before the next phase compounds it further.


Generated by Claude Code

Comment thread src/zagg/processing.py
Comment on lines +975 to +983
paths_seen: set[str] = set()
arrays_by_path: dict[str, np.ndarray] = {}
for path in list(var_paths.values()) + list(filter_paths.values()):
if path in paths_seen:
continue
paths_seen.add(path)
# dtype hint isn't load-bearing -- execute_read_plan dtype-casts via
# np.asarray, which is a no-op when the source dtype already matches.
arrays_by_path[path] = execute_read_plan(plan, _read_fn, path, None)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Small dtype subtlety worth a comment in code (not a bug): you pass dtype=None to execute_read_plan, which then does np.asarray(read_fn(...), dtype=None). That's a no-op cast (preserves source dtype), which is what you want for variables of mixed dtypes (uint8 flags vs float32 heights vs structured signal_conf rows). Worth one inline # dtype=None preserves source dtype — varies across var/filter datasets so a future reader doesn't "fix" it by hardcoding np.float64.

Also: arrays_by_path is keyed on the resolved HDF5 path string. The filter_paths = {id(f): ...} dict is keyed on filter identity so two filters can target the same dataset path without colliding the cache — same pattern as _read_group_full:1195. Looks correct, but worth making the intent explicit in a comment.


Generated by Claude Code

Comment thread tests/test_processing.py
Comment on lines +1755 to +1762
{"dataset": "/seg/podppd", "op": "eq", "value": 0, "level": "segments"}
)
if filters:
ds["filters"] = filters
return ds


def _planned_read_h5(*, qs=None, podppd=None):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Coverage gap: full_read=True fallback isn't exercised with filters wired through.

test_full_read_fallback_on_high_selectivity runs the fallback through _read_group_full with no filters at all. The interesting failure mode for that path on real ATL03 is the combination "planned read trips selectivity threshold → fallback to _read_group_full → cross-level filter expansion runs at coarse rate." That combination's hand-off (specifically: that _read_group_full re-reads link arrays it doesn't need from the data_source's read_plan config, vs. reusing the coarse arrays already in memory in _planned_read_group) goes uncovered.

Two cheap additions:

  • Add a copy of this test with with_coarse_filter=True and confirm output still matches the no-fallback planned-path output.
  • Add a copy of test_parity_with_full_read that forces the planned side into fallback (low full_read_threshold) so the entry point differs but the result is byte-identical.

Both stay offline (no h5coro, no S3) and use the existing fixtures.


Generated by Claude Code

Comment thread tests/test_processing.py
Comment on lines +1740 to +1890
"link": {
"to": "photons",
"index_beg": "/seg/ph_index_beg",
"count": "/seg/segment_ph_cnt",
"index_base": 0,
},
},
},
"read_plan": {"spatial_index": "segments", "pad": 0},
}
filters = []
if with_base_filter:
filters.append({"dataset": "/heights/qs", "op": "eq", "value": 0})
if with_coarse_filter:
filters.append(
{"dataset": "/seg/podppd", "op": "eq", "value": 0, "level": "segments"}
)
if filters:
ds["filters"] = filters
return ds


def _planned_read_h5(*, qs=None, podppd=None):
"""Six-segment / 12-photon HDF5 stub with optional base/coarse flag arrays.

Segments live at lats 0,100,200,300,400,500 (lon 0); photons at lats
0,50,100,150,200,250,...,550 (lon 0). The wide segment spacing keeps the
``plan_read`` linestring-crossing check from sweeping unrelated segments
into the matched range -- a narrow bbox between two rep-points stays
bounded by the immediate neighbours.
"""
seg_lats = np.array([0.0, 100.0, 200.0, 300.0, 400.0, 500.0])
seg_lons = np.zeros(6)
ibeg = np.arange(0, 12, 2, dtype=np.int64)
cnt = np.full(6, 2, dtype=np.int64)
ph_lats = np.array(
[0.0, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0]
)
ph_lons = np.zeros(12)
h = np.arange(12.0, dtype=np.float32) * 10.0
arrays = {
"/seg/lat": seg_lats,
"/seg/lon": seg_lons,
"/seg/ph_index_beg": ibeg,
"/seg/segment_ph_cnt": cnt,
"/heights/lat_ph": ph_lats,
"/heights/lon_ph": ph_lons,
"/heights/h": h,
}
if qs is not None:
arrays["/heights/qs"] = np.asarray(qs)
if podppd is not None:
arrays["/seg/podppd"] = np.asarray(podppd)
return _FakeH5(arrays)


class TestPlannedReadGroup:
"""Phase C: ``_read_group`` dispatches to ``_planned_read_group`` when
``data_source.read_plan.spatial_index`` is set, bounding the base-rate IO
via the coarse-level rep-point coords + link arrays.

The shared fixture lays out 6 segments at lats ``[0, 100, 200, 300, 400,
500]`` covering 12 photons (2 each). The wide spacing keeps ``plan_read``'s
linestring-crossing sweep bounded: a bbox between two adjacent rep-points
pulls in exactly its two neighbours."""

def test_planned_path_bounds_io_to_matched_segments(self):
# Bbox (-0.1, 175, 0.1, 225) directly contains segment 2 (lat=200);
# segment 1's (lat 100 -> 200) linestring crosses the lower edge so
# plan_read sweeps segment 1 in too. Two adjacent segments -> one
# contiguous run -> photons 2..5 in the base array.
ds = _planned_read_data_source()
h5 = _planned_read_h5()
grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0))
df = _read_group(h5, "gt1l", ds, 0, grid)
assert df["h"].tolist() == [20.0, 30.0, 40.0, 50.0]

def test_empty_aoi_returns_none(self):
# Bbox far from any segment rep-point or linestring -> no parents
# match -> short-circuit return None before any base-rate read.
ds = _planned_read_data_source()
h5 = _planned_read_h5()
grid = _BboxGrid((10000.0, 10000.0, 10001.0, 10001.0))
assert _read_group(h5, "gt1l", ds, 0, grid) is None

def test_full_read_fallback_on_high_selectivity(self):
# full_read_threshold lowered so any plan covering >=10% of n_base
# (>=2/12 photons) triggers the fallback. Same bbox as the basic test
# selects 4/12 = 33% -> falls through to _read_group_full and reads
# everything; the permissive grid keeps all 12.
ds = _planned_read_data_source()
ds["read_plan"]["full_read_threshold"] = 0.1
h5 = _planned_read_h5()
grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0))
df = _read_group(h5, "gt1l", ds, 0, grid)
assert df["h"].tolist() == [float(i * 10) for i in range(12)]

def test_parity_with_full_read(self):
# Both paths produce the same row set when the spatial mask is keyed
# on lat: the planned read narrows IO to photons 2..5 (via plan_read);
# _LatBboxGrid.shards_of further restricts to photons with lat in
# bbox range (photon 4, lat=200). qs drops nothing in-shard.
qs = np.array([0] * 12, dtype=np.int8)
h5 = _planned_read_h5(qs=qs)
grid = _LatBboxGrid((-0.1, 175.0, 0.1, 225.0))

ds_planned = _planned_read_data_source(with_base_filter=True)
ds_full = {
"coordinates": {
"latitude": "/heights/lat_ph",
"longitude": "/heights/lon_ph",
},
"variables": {"h": "/heights/h"},
"filters": [{"dataset": "/heights/qs", "op": "eq", "value": 0}],
}

df_planned = _read_group(h5, "gt1l", ds_planned, 0, grid)
df_full = _read_group(h5, "gt1l", ds_full, 0, grid)
# Photon 4 (lat=200, h=40) is the only one in the bbox lat range.
assert df_planned["h"].tolist() == [40.0]
assert df_full["h"].tolist() == [40.0]

def test_coarse_filter_via_planned_path(self):
# Cross-level (Phase B) filter ANDs with the planned path: drop
# segment 1 via podppd; segment 2 (also pulled in by the linestring
# sweep) survives. Photons 2,3 dropped; 4,5 kept.
ds = _planned_read_data_source(with_coarse_filter=True)
podppd = np.array([0, 1, 0, 0, 0, 0], dtype=np.int8)
h5 = _planned_read_h5(podppd=podppd)
grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0))
df = _read_group(h5, "gt1l", ds, 0, grid)
assert df["h"].tolist() == [40.0, 50.0]

def test_pad_extends_selection(self):
# bbox (490..510) covers segment 5 (last, lat=500) directly; segment
# 4's linestring (400 -> 500) crosses the lower edge. With pad=0:
# segments 4,5 -> photons 8..11. With pad=1: segments 3,4,5,6(clamped
# back to 5) -> photons 6..11.
ds = _planned_read_data_source()
ds["read_plan"]["pad"] = 1
h5 = _planned_read_h5()
grid = _BboxGrid((-0.1, 490.0, 0.1, 510.0))
df = _read_group(h5, "gt1l", ds, 0, grid)
assert df["h"].tolist() == [60.0, 70.0, 80.0, 90.0, 100.0, 110.0]

def test_invalid_link_target_raises(self):
# The spatial_index level's link must point at the base level.
ds = _planned_read_data_source()
ds["levels"]["segments"]["link"]["to"] = "not_a_level"
h5 = _planned_read_h5()
grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Coverage gap: multi-slice plan.base_slices not exercised end-to-end.

Every test in TestPlannedReadGroup produces a single contiguous run (the wide segment spacing in _planned_read_h5 was deliberately chosen for that). That's good for keeping the assertions tractable, but it means the load-bearing concat-and-realign code paths in _planned_read_group:

  • lats = execute_read_plan(...) over multi-chunk chunk_lists,
  • global_idx = np.concatenate([np.arange(s, e) for s, e in plan.base_slices]) with len(base_slices) > 1,
  • cross_full[global_idx][mask_spatial] realignment when multiple ranges are present,

are not actually run by these tests. plan_read unit tests cover the multi-run case offline, but the integration through _planned_read_group (where mask_spatial/cross_planned lengths and data_dict["leaf_id"] ordering can diverge between paths) doesn't get exercised.

A targeted test: bbox composed of two disjoint mini-boxes by enabling pad=0 and crafting a fixture with two AOI-matched segments separated by an unmatched one. Then assert df["h"] ordering matches the full-read path on the same fixture.


Generated by Claude Code

Comment thread tests/test_processing.py
Comment on lines +1801 to +1819
The shared fixture lays out 6 segments at lats ``[0, 100, 200, 300, 400,
500]`` covering 12 photons (2 each). The wide spacing keeps ``plan_read``'s
linestring-crossing sweep bounded: a bbox between two adjacent rep-points
pulls in exactly its two neighbours."""

def test_planned_path_bounds_io_to_matched_segments(self):
# Bbox (-0.1, 175, 0.1, 225) directly contains segment 2 (lat=200);
# segment 1's (lat 100 -> 200) linestring crosses the lower edge so
# plan_read sweeps segment 1 in too. Two adjacent segments -> one
# contiguous run -> photons 2..5 in the base array.
ds = _planned_read_data_source()
h5 = _planned_read_h5()
grid = _BboxGrid((-0.1, 175.0, 0.1, 225.0))
df = _read_group(h5, "gt1l", ds, 0, grid)
assert df["h"].tolist() == [20.0, 30.0, 40.0, 50.0]

def test_empty_aoi_returns_none(self):
# Bbox far from any segment rep-point or linestring -> no parents
# match -> short-circuit return None before any base-rate read.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

test_parity_with_full_read asserts only df_planned["h"].tolist() == df_full["h"].tolist() == [40.0]. Since leaf_id is what carries row-order information through both paths (and is the column whose ordering would expose a re-slice bug between plan-slice order vs. base-array order), please add:

assert df_planned["leaf_id"].tolist() == df_full["leaf_id"].tolist()

Free strengthening of the parity claim, given _LatBboxGrid already encodes lat into leaf_id. Without it, a future regression that reorders rows between paths would still pass on a single-row result.

Also: _LatBboxGrid is deliberately a hack (lat-in-leaf-id) — a one-line comment in the test file pinning down "this is only valid because both paths feed identical lats to assign(); real grids return integer cell ids" would save the next reader 10 minutes.


Generated by Claude Code

Comment thread src/zagg/read_plan.py
Comment on lines 163 to +172
base_slices.append((base_start, base_end))
chunk_lists.append([(base_start, base_end - 1)]) # h5coro inclusive end
chunk_lists.append([(base_start, base_end)]) # h5coro half-open [start, end)
total_base += base_end - base_start

# -- Selectivity fallback --
if total_base > full_read_threshold * n_base:
return ReadPlan(
parent_runs=[(0, n_coarse - 1)],
base_slices=[(0, n_base)],
chunk_lists=[[(0, n_base - 1)]],
chunk_lists=[[(0, n_base)]],

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Confirmed fix: h5coro is half-open. h5coro/h5dataset.py:182 raises FatalError("must provide as list of ranges [x,y)") and the boundary check at h5dataset.py:188 uses self.hyperslice[d][1] > self.meta.dimensions[d] (strict >), so (0, n_base) is in-bounds when n_base == dim. The selectivity-fallback's [(0, n_base)] is correct.

The previous (start, end-1) tests only passed because the test callbacks compensated with data[lo:hi+1]; updating them to data[lo:hi] aligns them with how _planned_read_group._read_fn actually drives h5coro. No subtle landmine here, but worth a one-line cross-reference in the docstring so this convention survives the next refactor:

chunk_lists : list of list of (int, int)
    h5coro-style ``[(start, end)]`` hyperslice lists (half-open
    ``[start, end)``, matches h5coro's ``h5dataset.py:182``
    "must provide as list of ranges [x,y)" contract).

You've got most of that — just adding the explicit h5dataset.py:182 anchor makes the convention searchable.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 6 is overwhelmingly YAML and reads cleanly against the phase 5 wiring + the issue #43 body's schema. The HDF5 paths (/{group}/geolocation/ for reference_photon_lat/reference_photon_lon/ph_index_beg/segment_ph_cnt) match the schema example in #43, and index_base: 1 matches #43's "both 1-based" claim — and is arithmetically consistent end-to-end (e.g. ph_index_beg=[1,3,5], segment_ph_cnt=[2,2,2]n_base = 5 - 1 + 2 = 6; per-parent base_start = ibeg[s] - 1 gives 0/2/4, which match 0-based starts). The signal_conf_ph filter dataset and data_source.variables map are unchanged, so the read-path dedup in _planned_read_group (paths_seen) cannot double-read.

Five diff-scoped findings below; the biggest two are (1) no integration test exercises the shipped templates through _planned_read_groupTestPlannedReadGroup uses synthetic data with index_base: 0, so the index_base: 1 arithmetic plus the exact ATL03 path templates and pad: 1 boundary behavior have zero offline coverage — and (2) signal_conf_ph in levels.photons.variables is declarative-only and will mislead the next reader, because the read path never consults level.variables.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase 6's YAML reads cleanly against the phase 5 wiring + #43 spec. HDF5 paths (/{group}/geolocation/ for reference_photon_lat/reference_photon_lon/ph_index_beg/segment_ph_cnt) match #43's schema example; index_base: 1 matches the "both 1-based" claim and is arithmetically consistent (ph_index_beg=[1,3,5] + cnt=[2,2,2]n_base = 5-1+2 = 6, base_start = ibeg[s]-1 ∈ {0,2,4}). The signal_conf_ph filter and data_source.variables are unchanged so dedup in _planned_read_group (paths_seen) holds.

Five findings, the two biggest first:

  1. No integration test exercises the shipped templates through _planned_read_groupTestPlannedReadGroup uses synthetic data with index_base: 0 and bespoke /seg/ paths, so this PR has zero offline coverage of index_base=1 arithmetic + the exact ATL03 path templates resolving end-to-end. Mirrors the phase 4 gap that was folded as test_atl03_shipped_template_2d_signal_conf.
  2. signal_conf_ph in levels.photons.variables is declarative-only — read path uses data_source.variables, never level.variables. Misleads the next reader.
  3. Asymmetric path conventions: coordinates are bare names (joined to path), link fields are absolute templates. Both work today but they will trip the next person who copies #43's schema example literally (it uses bare names for both).
  4. Cross-template link-equality assertion is too tight — whole-dict comparison couples to fields that don't affect plan parity.
  5. pad: 1 choice is undocumented and not exercised — both shipped templates use pad: 1 but TestPlannedReadGroup uses pad: 0. No test covers the chosen default.

None are correctness bugs against the synthetic test surface. (1) is the one I'd hold over before merge — without it a future YAML edit that drops index_base (defaulting to 0) lands green.


Generated by Claude Code

Comment thread tests/test_config.py
def test_read_plan_targets_segments_level(self, atl03_config):
rp = atl03_config.data_source["read_plan"]
assert rp["spatial_index"] == "segments"
assert rp["pad"] == 1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Coverage gap: phase 6 has no offline integration test that runs the shipped templates through _planned_read_group.

These three new tests assert the YAML parses into the expected shape — base_level == "photons", two level keys, pad == 1, index_base == 1. They do not verify the load-bearing combination at runtime:

  • the index_base: 1 arithmetic against the actual paths the YAML declares (_planned_read_group does n_base = ibeg[-1] - 1 + cnt[-1] and base_start = ibeg[s] - 1 — both go through int(...) casts that mask sign errors silently if the YAML ever ships index_base: 0 by mistake),
  • the absolute path templates in the link arrays correctly resolving via link["index_beg"].format(group=group) against a _FakeH5 with a realistic ATL03 group layout (/gt1l/heights/... + /gt1l/geolocation/...),
  • the segments-level coordinates: {latitude: reference_photon_lat, longitude: reference_photon_lon} joining against path: "/{group}/geolocation" to produce /gt1l/geolocation/reference_photon_lat (via _level_coord_paths at processing.py:837).

TestPlannedReadGroup exercises _planned_read_group end-to-end, but its _planned_read_data_source() fixture uses index_base: 0 and bespoke /seg/ paths (tests/test_processing.py:1744) — so the index_base: 1 code path through the shipped template has zero test coverage in this PR.

This is the same gap pattern the phase 4 review caught and folded as TestReadGroupFilters::test_atl03_shipped_template_2d_signal_conf: load default_config("atl03").data_source, rewrite the dataset paths onto a _FakeH5 with a realistic ATL03-shape group layout (6 segments tiling 12 photons with 1-based ph_index_beg = [1, 3, 5, 7, 9, 11]), and run it through _read_group (which now dispatches to _planned_read_group). Confirm the planned path produces the same row set as the full-read path on the same fixture, and assert at least one pad: 1 boundary case (AOI sitting between two segments' rep-points).

Without that test, a future YAML edit that drops index_base (defaulting to 0) or that swaps the link to a different array name would land green.


Generated by Claude Code

Comment thread src/zagg/configs/atl03.yaml Outdated
photons:
path: "/{group}/heights"
coordinates: {latitude: lat_ph, longitude: lon_ph}
variables: {h_ph: h_ph, signal_conf_ph: signal_conf_ph}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

signal_conf_ph in levels.photons.variables is declarative-only and will mislead the next reader.

I checked the read path: _planned_read_group (src/zagg/processing.py:973) builds var_paths from data_source["variables"] (the top-level flat map), not from levels[base_level]["variables"]. _read_group_full does the same. A grep confirms nothing in src/zagg/ consumes level["variables"] for read or output:

$ grep -n 'lvl\[.variables.\]\|level\[.variables.\]\|levels\[.*\]\[.variables.\]' src/zagg/*.py
(no results)

So adding signal_conf_ph to levels.photons.variables here (and identically in atl03_waveform_counts.yaml:46) has zero functional effect — the dataset is read because the filter's dataset: "/{group}/heights/signal_conf_ph" triggers it via filter_paths in _planned_read_group, not because it appears in the level's variables map. But a reader of the YAML who hasn't read processing.py will reasonably expect that listing signal_conf_ph under the level's variables either (a) makes it appear as an output column or (b) bounds the read scope. Neither is true.

Two options:

  • Drop signal_conf_ph from levels.photons.variables on both YAMLs. The filter reads it via its own dataset field; nothing needs the declaration. (h_ph similarly is read from data_source.variables, not the level — also redundant, but at least it's a real output column.)
  • Keep it for documentation but add a YAML comment noting that levels.<x>.variables is currently a documentation hint only — the read path uses data_source.variables. This will rot the moment someone wires the level's variables into the read path (which Generic hierarchical filtering & targeted reads for multi-level point datasets #43's design implies should happen eventually).

(1) is the cleaner ship — levels.<x>.variables reads as authoritative when in fact it's the top-level map that's authoritative.


Generated by Claude Code

link:
to: photons
index_beg: "/{group}/geolocation/ph_index_beg"
count: "/{group}/geolocation/segment_ph_cnt"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Asymmetric path conventions: coordinates use bare names (joined to path), link fields use absolute /{group}/... templates.

The level's coordinates here are bare names — reference_photon_lat / reference_photon_lon — which _level_coord_paths (src/zagg/processing.py:837) joins to the level's path template (/{group}/geolocation) to produce /{group}/geolocation/reference_photon_lat. That join behavior is explicit:

lat_path = lat_name if lat_name.startswith("/") else f"{base}/{lat_name}"

But the link fields one line below — index_beg: "/{group}/geolocation/ph_index_beg", count: "/{group}/geolocation/segment_ph_cnt" — are absolute templates. _planned_read_group resolves them with link["index_beg"].format(group=group) (processing.py:892); there is no path-join. Both shapes work today (absolute paths bypass the join short-circuit), but the inconsistency will trip the next person who copies issue #43's schema example literally — that example uses bare names for both coordinates and link fields:

link:
  to: photon
  index_beg: ph_index_beg          # start of each parent's children in `to`
  count: segment_ph_cnt            # children per parent

Two options, both small:

  • Pick relative everywhere on both ATL03 YAMLs: index_beg: ph_index_beg, count: segment_ph_cnt. Requires extending _planned_read_group (and the cross-level filter at processing.py:1007–1009) to apply the same join rule the coordinate resolver uses. That's where this is heading per Generic hierarchical filtering & targeted reads for multi-level point datasets #43, so it's the future-proof choice.
  • Pick absolute everywhere on both ATL03 YAMLs: rewrite the coords as latitude: "/{group}/geolocation/reference_photon_lat". Smaller change, but it freezes the asymmetry in for ATL03 specifically.

If you want neither now, please at minimum add a YAML comment block at the top of levels: noting which convention the link fields follow — otherwise the next mission's template will be written against the issue #43 spec and fail at link["index_beg"].format(group=group) when no / is there to absolute-prefix.


Generated by Claude Code

Comment thread tests/test_config.py Outdated
# Cross-check the link is identical to atl03's so the two templates
# share the same plan_read inputs on the same granule.
atl03_link = default_config("atl03").data_source["levels"]["segments"]["link"]
assert ds["levels"]["segments"]["link"] == atl03_link

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Cross-template link-equality assertion is too tight.

The whole-dict comparison ds["levels"]["segments"]["link"] == atl03_link enforces bit-identical link blocks across the two templates. The motivation as documented in the comment is "share plan_read inputs on the same granule" — which only requires that index_beg, count, and index_base resolve to the same arrays on disk (so the planned slices match). It does not require pad (rightly not in the link, but the same logic applies to other future link knobs like reference_index) to be identical, and it does not require future-only fields like the reserved reference_index slot to stay in lock-step if one template legitimately needs to diverge.

A waveform-specific edit that legitimately diverges (say, a different reference_index once that field activates per _validate_levels at config.py:586) would fail this assertion even though plan_read parity is preserved. That makes the test a maintenance footgun on any future link-schema growth.

Tighten to the fields that genuinely must match for plan parity:

for key in ("to", "index_beg", "count", "index_base"):
    assert ds["levels"]["segments"]["link"][key] == atl03_link[key]

Same intent, surfaces a real divergence cleanly without coupling to fields whose drift doesn't break the shared-plan invariant.


Generated by Claude Code

index_base: 1 # ATL03 ph_index_beg is 1-based per the v3 dict
read_plan:
spatial_index: segments
pad: 1 # one segment of padding on each side per #43

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

pad: 1 choice is undocumented and not exercised by any test in this PR.

The YAML comment says # one segment of padding on each side per #43. #43's issue body in fact defines pad ("pad the matched parent range by N elements per side") without committing to a default; the prototype used 1 but doesn't justify it as the operational default for the grids zagg actually ships. The phase 5 planned-read tests use pad: 0 (tests/test_processing.py:1748) for tractability; TestPlanReadIndexBase uses pad: 0 for the index_base=1 case (tests/test_read_plan.py:131); the only pad: 1 exercise is implicit in TestPlannedReadGroup::test_planned_path_bounds_io_to_matched_segments where the wide segment spacing makes pad behavior irrelevant.

Concretely:

  • ATL03 segment spacing is ~20 m along-track. pad: 1 = one segment = ~20 m of padding on each side of a matched run.
  • The rectilinear grid in both YAMLs is resolution: 0.0001 (~10 m at the equator). A shard footprint of 10 m × 10 m sits inside a single 20 m segment — pad: 1 means up to ~20 m of always-discarded reads per shard, on top of the actual span. Not wasteful in absolute terms but worth knowing.
  • At high latitude (~88°S — the --hard AOI in the benchmark), a rectilinear cell of 0.0001° in longitude is meters across, but the segment footprint stretches differently. The 1-segment pad might genuinely under-pad there if the AOI clips a segment boundary at an angle.

This isn't blocking, but two requests:

  1. Add a YAML comment that justifies pad: 1 from the grid resolution, not just "per Generic hierarchical filtering & targeted reads for multi-level point datasets #43" — e.g. "1 segment ≈ 20 m, matches the 0.0001° rectilinear resolution to within an order of magnitude; tune if HEALPix order-19 lands."
  2. Either add an offline test that exercises pad: 1 at a segment boundary on the shipped template (would fit naturally into the integration test I'm flagging on tests/test_config.py:164), or at least add a unit test in tests/test_read_plan.py that confirms pad: 1 extends a single-segment match into a 3-segment slice with index_base=1 arithmetic. The current TestPlanReadIndexBase test uses pad: 0.

Generated by Claude Code

@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Phase 5 self-review came back (review at pullrequestreview-4527...) — 9 findings, of which 2 were real hazards, 3 were test gaps, and 4 were quality / docstring nits. Folded the 5 diff-scoped ones in f79ad6b fold phase 5 review. Module-size + duplication flag (CLAUDE.md §4) is real but pre-existing-on-main and noted under Standing Items below.

Folded (f79ad6b):

  • Antimeridian / polar-cap guard. grid.shard_footprint(shard_key).bounds collapses an antimeridian-crossing or polar-cap shard to a near-globe-spanning bbox, which silently intersects every segment and neutralises the IO bound on the shards that need it most. Added an explicit guard in _planned_read_group: when the bbox lon span ≥ 180°, fall back to _read_group_full instead of pretending to plan. New test test_antimeridian_grid_falls_back_to_full_read asserts the fallback fires.
  • n_base derivation hardened. Was int(ibeg_arr[-1]) - index_base + int(cnt_arr[-1]) which is only correct under Generic hierarchical filtering & targeted reads for multi-level point datasets #43's strict-contiguity assumption (no trailing photons dropped, no gaps). Changed to int(np.asarray(cnt_arr).sum()), identical to the old expression under contiguity but making the assumption explicit, with a comment that a non-contiguous future granule format would need its own Generic hierarchical filtering & targeted reads for multi-level point datasets #43 follow-up.
  • Dispatch predicate hardening. _read_group previously truthy-checked levels/base_level, so a config typo with an empty {} would silently route back to _read_group_full. Now: when read_plan.spatial_index is set, an empty levels or missing base_level raises explicitly. New test test_dispatch_rejects_empty_levels covers this.
  • Multi-slice plan + global_idx alignment test. The original TestPlannedReadGroup exercised only single-contiguous-run plans; the multi-slice global_idx realignment path through _planned_read_group was un-tested end-to-end. New test_multi_slice_plan_global_idx_alignment constructs a 10-segment track that zig-zags through the AOI lat band (so plan_read produces two disjoint base-slices), adds a cross-level podppd filter on a specific segment, and asserts the dropped photon lands on the correct row through the global_idx realignment.
  • Full-read fallback carries filters. The original fallback test ran with no filters. New test_full_read_fallback_carries_filters exercises a base-level structured filter through the selectivity-fallback path (threshold lowered so the plan trips the fallback), asserting the filter still drops the right rows.
  • Parity test row-order coverage. Original test_parity_with_full_read asserted only df["h"]. New test_parity_with_full_read_includes_leaf_id adds a sibling assertion on df["leaf_id"] to lock in row-order parity (under the _LatBboxGrid stub leaf_id carries the lat used to assign the row, so the comparison is real).

Standing:

  • Module size (src/zagg/processing.py now ~1448 lines, CLAUDE.md §4 targets ~1000). _planned_read_group is a structural twin of _read_group_full — both follow the same "read coords → spatial mask → variables → structured filters → cross-level mask → expressions → output" template. Worth refactoring out a shared post-read helper (_apply_filters_and_build_output(...)) so the two paths share the back half. Raising as a follow-up — substantial enough to be its own change, and the review subagent flagged it as a §4 nit rather than a blocker. Want your call on whether to do it in this PR (as a phase 7 cleanup) or open a separate issue.
  • Standing from phase 4: the column: 0 vs .any(axis=1) semantic divergence (A/B/C options) — still open from ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43) #47 (comment).

Gates: pytest --ignore=tests/test_shardmap.py499 passed. ruff check src tests benchmarks clean.

Phase 6 self-review is still running on a separate fresh agent; its findings will land below.


Generated by Claude Code

@espg

espg commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Phase 6 self-review came back (review 4527201199) — 5 findings. The headline gap was "no integration test exercises the shipped templates through _planned_read_group" — exactly the shape of the phase-4 gap we caught with test_atl03_shipped_template_2d_signal_conf. All five folded in 9741eae fold phase 6 review.

Folded (9741eae):

  • Integration test for the shipped ATL03 template through _planned_read_group. New test_shipped_atl03_template_through_planned_read (tests/test_processing.py) loads default_config("atl03") and runs it against an ATL03-shaped _FakeH5 with real {group} path templates (/gt1l/heights/, /gt1l/geolocation/), 1-based ph_index_beg, the 2-D signal_conf_ph filter, and pad: 1 from the shipped config. Without this, a future YAML edit dropping index_base: 1 (defaulting to 0) would land green on the synthetic _planned_read_h5 fixture.
  • signal_conf_ph dropped from levels.photons.variables. The read path consults the top-level data_source.variables + filter datasets — never level.variables — so the entry was declarative-only and misleading. Same for ph_index_beg/segment_ph_cnt at the segments level (they're referenced through link.index_beg/link.count, not the variables map). Both YAMLs cleaned up; coordinates stay since they document what's at each level even if _read_group doesn't read them.
  • Path-convention comment. Added a comment on both YAMLs noting that coordinates uses bare names joined to level.path while link index_beg/count are absolute path templates — both forms are accepted by _level_coord_paths in processing.py. The asymmetry is intentional (link arrays may live outside the level's path) but now explicitly called out so the next reader copying Generic hierarchical filtering & targeted reads for multi-level point datasets #43's schema example doesn't trip on it.
  • Cross-template link-equality narrowed. test_multi_level_form_matches_atl03 now compares only the four link fields that affect plan_read parity (to, index_beg, count, index_base) rather than the whole link dict. Other level fields (documentation-only variables, coord names, formatting) can legitimately diverge across templates without affecting the plan, so the loose comparison won't false-positive on cosmetic edits.
  • pad: 1 coverage. The new shipped-template integration test exercises pad: 1 end-to-end (the bbox pulls in segments {1, 2} directly + via linestring sweep; pad: 1 widens to {0, 1, 2, 3}). test_pad_extends_selection continues to cover pad: 1 over the synthetic fixture, so both pad: 0 and pad: 1 paths now have offline coverage.

Gates: pytest --ignore=tests/test_shardmap.py500 passed. ruff check src tests benchmarks clean.

Standing (carried from earlier phases):

  • column: 0 vs .any(axis=1) semantic divergence (A/B/C options from ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43) #47 (comment)). Phase 6's multi-level migration does not change this.
  • processing.py module size (~1448 lines, CLAUDE.md §4 targets ~1000) and the duplication between _planned_read_group and _read_group_full. Phase-5 review's suggestion was to extract a shared _apply_filters_and_build_output(...) helper as a phase 7 cleanup. Want your call: do it in this PR, or open a follow-up issue?

Both phases 5 and 6 are now complete + folded. Setting waiting.


Generated by Claude Code

@espg espg added the waiting label Jun 18, 2026
@espg espg marked this pull request as ready for review June 18, 2026 17:58
@espg espg merged commit 02febd6 into main Jun 18, 2026
12 checks passed
@espg espg deleted the claude/30-benchmark-template branch June 18, 2026 18:00
@espg

espg commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

First result back (simple case, 2D counts):

  (zagg) espg@fedora ~/s/zagg (main)> python -m zagg --config atl03_counts_neon_utm.yaml \                                                                                                                                         [15
  tasks]
                                              --catalog shardmap_atl03_neon_utm.json \
                                              --backend lambda --output-creds sourcecoop_out.json \
                                              --overwrite
  Processing 16 of 16 cells (lambda)
  You're now authenticated with NASA Earthdata Login
  Using token with expiration date 08/18/2026
  Loading cached SSO token for smdc
  Account concurrency: limit=1000, current=1, padding=100, available=899 -> using 16 workers
  Done: 8 cells, 142,253 obs, 0 errors, 122.1s
  Lambda compute: 1140s total, 2280 GB-s, ~$0.03

  Done: 8 cells with data, 142,253 obs, 0 errors, 122.1s
  Lambda compute: 1140s total, 2280 GB-s, ~$0.03
  Output: s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl03_counts_neon_utm.zarr

@espg

espg commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

Updated benchmark for 2018 to mid 2026 (combined):

(zagg) espg@fedora ~/s/zagg (main)> python -m zagg --config atl03_counts_neon_utm.yaml \                                                                                                                                        
                                            --catalog shardmap_atl03_neon_utm_2018_2026_beams.json \
                                            --backend lambda --output-creds sourcecoop_out.json \
                                            --overwrite
Processing 16 of 16 cells (lambda)
You're now authenticated with NASA Earthdata Login
Using token with expiration date 08/18/2026
Loading cached SSO token for smdc
Account concurrency: limit=1000, current=1, padding=100, available=899 -> using 16 workers
Done: 16 cells, 2,139,598 obs, 0 errors, 463.4s
Lambda compute: 4729s total, 9458 GB-s, ~$0.13

Done: 16 cells with data, 2,139,598 obs, 0 errors, 463.4s
Lambda compute: 4729s total, 9458 GB-s, ~$0.13
Output: s3://us-west-2.opendata.source.coop/englacial/zagg/benchmarks/atl03_counts_neon_utm.zarr

@espg

espg commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Updated-- getting better! Note that this a more compute intensive t-digest run, rather than just counts aggregation...

(zagg) espg@Max ~/s/zagg (main)> python -m zagg \                                                  
                                       --config src/zagg/configs/atl03_tdigest_healpix.yaml \
                                       --catalog neon_tdigest_shardmap.json \
                                       --store s3://sliderule-public/test_neon_order11.zarr \
                                       --region us-west-2 \
                                       --backend lambda --overwrite
Processing 22 of 22 cells (lambda)
You're now authenticated with NASA Earthdata Login
Using token with expiration date 08/18/2026
Loading cached SSO token for smdc
Account concurrency: limit=1000, current=1, padding=100, available=899 -> using 22 workers
Done: 22 cells, 5,391,489 obs, 0 errors, 188.3s
Lambda compute: 2594s total, 5189 GB-s, ~$0.07

Done: 22 cells with data, 5,391,489 obs, 0 errors, 188.3s
Lambda compute: 2594s total, 5189 GB-s, ~$0.07
Output: s3://sliderule-public/test_neon_order11.zarr

@espg

espg commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

this one is suspicious... wall time is exceeding the lambda timeout, so kinda suspect that we're losing data...

(zagg) espg@Max ~/s/zagg (main)> python -m zagg \                                                           
                                       --config src/zagg/configs/atl03_tdigest_healpix.yaml \
                                       --catalog neon_tdigest_shardmap.json \
                                       --store s3://sliderule-public/test_neon_order10.zarr \
                                       --region us-west-2 \
                                       --backend lambda --overwrite
Processing 9 of 9 cells (lambda)
You're now authenticated with NASA Earthdata Login
Using token with expiration date 08/18/2026
Loading cached SSO token for smdc
Account concurrency: limit=1000, current=0, padding=100, available=900 -> using 9 workers
Done: 9 cells, 10,130,471 obs, 0 errors, 1150.6s
Lambda compute: 1705s total, 3411 GB-s, ~$0.05

Done: 9 cells with data, 10,130,471 obs, 0 errors, 1150.6s
Lambda compute: 1705s total, 3411 GB-s, ~$0.05
Output: s3://sliderule-public/test_neon_order10.zarr

@espg

espg commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

for reference, we were testing cost scaling as a function of shard size-- it decreases with larger shards (as expected); here's the opposite order direction that drives the cost up:

(zagg) espg@Max ~/s/zagg (main)> python -m zagg \                                         
                                       --config src/zagg/configs/atl03_tdigest_healpix.yaml \
                                       --catalog neon_tdigest_shardmap.json \
                                       --store s3://sliderule-public/test_neon_full.zarr \
                                       --region us-west-2 \
                                       --backend lambda --overwrite
Processing 196 of 196 cells (lambda)
You're now authenticated with NASA Earthdata Login
Using token with expiration date 08/18/2026
Loading cached SSO token for smdc
Account concurrency: limit=1000, current=1, padding=100, available=899 -> using 196 workers
  [  50/196] 0.8 cells/s
  [ 100/196] 1.5 cells/s
  [ 150/196] 1.9 cells/s
Done: 196 cells, 2,602,459 obs, 0 errors, 165.4s
Lambda compute: 13726s total, 27451 GB-s, ~$0.37

Done: 196 cells with data, 2,602,459 obs, 0 errors, 165.4s
Lambda compute: 13726s total, 27451 GB-s, ~$0.37
Output: s3://sliderule-public/test_neon_full.zarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants