Fix gain_bias dem_h broadcast dropping all photons on empty ATL03 segments#129
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of all three phases against issue #116 and CLAUDE.md. The fix is correct and in scope — no blocking findings; the comments below are minor (wording/grammar) plus confirmations.
Correctness (verified, no bug):
- The
cnt == 0guard in both_broadcast_segment_to_baseand_expand_mask_to_baseonly short-circuits empty parents. A non-empty segment withbeg < 0still raises, andbeg + cnt > total_base_sizestill raises for non-empty segments. Both existing raise-contracts hold —test_negative_beg_raises(cnt=[3]) andtest_broadcast_out_of_bounds_raises(cnt=[2,5]) use non-empty counts, so the guard never bypasses them. n_base = int(cnt_arr.sum())is unaffected by empties (they contribute 0), so broadcast/expand outputs stay length-aligned withseg_global_idx/global_idxon both full and planned paths. The planned path's cross-level call (_expand_mask_to_baseat read.py:460) gets the same skip.- WARNING-promotion is not noisy: a legitimately-empty group returns
Nonewithout raising, so it never incrementsread_errorsor logs here — only genuine read failures do.
Test quality: the parity test reproduces the 0-vs-N drop and would fail without the fix (pre-fix broadcast raises on the first empty under index_base=1); the planned-path test covers the separate _planned_read_group code path. Good coverage.
Scope/conventions: stays on the @espg-confirmed fix (skip guards + WARNING promotion + read_errors metadata/schema field); no index-space reconciliation, no swallowed errors, no weakened tests, terse and style-matching. Compliant.
Diff-scoped nits to consider folding (all non-blocking):
- worker.py: the "All group reads raised" message overstates the mixed case (some groups may have returned
Nonerather than raised); "No data and N group read(s) raised" is more precise — and update the coupled exact-match assertion at tests line 4638 in the same commit. - "(1 read errors)" — pluralize on N or use "(N read error(s))".
- read.py: optionally note in the guard comment that it intentionally bypasses the
beg < 0validation for empty segments (correct, since empties map to zero base rows).
Generated by Claude Code
| # that gives ``beg = 0 - 1 = -1`` and would raise below, which is what | ||
| # made the gain_bias dem_h broadcast drop every photon. Skip them. | ||
| if cnt == 0: | ||
| continue |
There was a problem hiding this comment.
🤖 from Claude (review)
Correctness of the guard checks out: it only continues when cnt == 0, so a non-empty segment with beg < 0 still hits the ValueError below, and the beg + cnt > total_base_size raise is likewise untouched for non-empty segments. I confirmed both raise-contracts hold — test_negative_beg_raises (cnt=[3]) and test_broadcast_out_of_bounds_raises (cnt=[2,5]) both use non-empty counts, so the guard never short-circuits them.
One behavioral subtlety worth a one-line note in the comment (non-blocking): the guard now bypasses the beg < 0 validation entirely for any cnt == 0 segment, regardless of what index_beg actually holds. That is exactly right for real ATL03 (empties carry ph_index_beg == 0), but it does mean a hypothetically corrupt empty segment (e.g. ph_index_beg = -5, cnt = 0) is now silently skipped rather than flagged. Since an empty segment maps to zero base rows by definition, skipping is the correct semantics — just calling it out so the looser validation is intentional, not an oversight.
Generated by Claude Code
| # raise below; skip them, mirroring the non-empty-only contract | ||
| # ``read_plan.plan_read`` already uses (its ``cnt > 0`` skip). | ||
| if cnt == 0: | ||
| continue |
There was a problem hiding this comment.
🤖 from Claude (review)
The twin guard in _expand_mask_to_base matches the broadcast helper exactly and is consistent with the planned-path call at line 460 (_expand_mask_to_base(coarse_fmask, cf_ibeg, cf_cnt, ...)), so cross-level filters get the same empty-parent skip. No n_base drift either: n_base = int(cnt_arr.sum()) (line 348) is unaffected by empties since they contribute 0, so the broadcast/expand outputs stay length-aligned with seg_global_idx/global_idx. Looks correct.
Generated by Claude Code
| metadata["error"] = f"All group reads raised ({read_errors} read errors)" | ||
| else: | ||
| logger.info(f" No data after filtering for shard {shard_key} - skipping") | ||
| metadata["error"] = "No data after filtering" |
There was a problem hiding this comment.
🤖 from Claude (review)
Minor wording nit (non-blocking): this branch fires on not all_reads and read_errors, which is reached whenever no group produced data and at least one raised — but some of the other groups may have returned None (legitimately empty) rather than raised. In that mixed case the message "Read raised on all groups" / "All group reads raised (N read errors)" overstates it (not literally all groups raised; some were simply empty). The read_errors count is accurate; only the prose is slightly too strong. Consider e.g. "No data and N group read(s) raised" to stay precise. Not a correctness problem — the WARNING + count still surface the real failure, which is the point of issue #116.
The WARNING-promotion itself is sound and won't get noisy: a legitimately-empty group returns None without raising (it doesn't increment read_errors or log here), so only genuine read failures hit this path.
Generated by Claude Code
| 0, | ||
| _ShardGrid(), | ||
| ) | ||
| assert len(gain_bias) == len(tdigest) == 8 |
There was a problem hiding this comment.
🤖 from Claude (review)
This parity test is the right regression for issue #116 and would genuinely fail without the fix: with index_base=1 and the first empty segment at ph_index_beg=0, cnt=0, the pre-fix broadcast computes beg = 0 - 1 = -1 and raises, so the with_dem_h=True read returns 0 rows while with_dem_h=False returns 8 — exactly the "gain_bias 0 vs tdigest N" drop. Asserting len(gain_bias) == len(tdigest) == 8 plus the exact per-photon dem_h values pins both the count parity and the alignment. The companion planned-path test (test_planned_read_skips_empty_segments_index_base_1) covers the separate _planned_read_group path with a bbox run that starts past file index 0 and straddles the empty — good, since that path was not exercised by the full-read parity test. Solid coverage.
Generated by Claude Code
| # A raised read is a real error, NOT a legitimately-empty read (issue | ||
| # #116): the shard reports the read-error path and counts the failure, | ||
| # rather than the misleading "No data after filtering". | ||
| assert meta["error"] == "All group reads raised (1 read errors)" |
There was a problem hiding this comment.
🤖 from Claude (review)
Updating this assertion is correct, not a weakened test (§4): the worker's behavior genuinely changed — a read that raised now reports the read-error path and counts the failure instead of the misleading "No data after filtering". Asserting both meta["error"] and meta["read_errors"] == 1 locks in the new contract. Note this string assertion is exact-match coupled to the message wording; if you adopt the "No data and N group read(s) raised" rewording I suggested on worker.py, update this literal in the same commit.
Grammar nit (cosmetic): "(1 read errors)" reads awkwardly for the singular case. Low priority given it's an internal diagnostic string, but "(N read error(s))" or pluralize-on-N would tidy it.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context review of fold commit baa6878 only (7c7fe9e..baa6878). No blocking findings — the fold is correct.
- worker.py message reword renders exactly
No data after filtering (1 group reads raised)atread_errors == 1, matching the updated test assertion intests/test_processing.pyverbatim, and no longer overclaims "all groups raised" (empty groups returnNone, only raises bumpread_errors). - read.py guard comments accurately describe the
cnt == 0skip and are correctly scoped per function (_expand_mask_to_basebypasses onlybeg < 0;_broadcast_segment_to_basebypassesbeg < 0/beg + cnt > base), both noting non-empty parents still raise.
No grammar, style, or CLAUDE.md issues introduced by this commit. Earlier accepted phases were not re-litigated.
Generated by Claude Code
| f" No data after filtering for shard {shard_key} and " | ||
| f"{read_errors} group read(s) raised - skipping" | ||
| ) | ||
| metadata["error"] = f"No data after filtering ({read_errors} group reads raised)" |
There was a problem hiding this comment.
🤖 from Claude (review)
Message reword is correct and accurate. With read_errors == 1 the f-string renders exactly No data after filtering (1 group reads raised), which matches the updated test assertion in tests/test_processing.py verbatim. And the new phrasing avoids the earlier overclaim: a legitimately-empty group returns None (no exception, never enters all_reads) while only a raised read increments read_errors, so not all_reads with read_errors > 0 genuinely means "no data AND ≥1 raised" — some groups may simply have returned None. No "all groups raised" overstatement remains. Comment at lines 268-272 documents this distinction accurately.
Generated by Claude Code
| # ``read_plan.plan_read`` already uses (its ``cnt > 0`` skip). Skipping | ||
| # intentionally bypasses the ``beg < 0`` validation for these parents -- | ||
| # correct, since they map to zero base rows (a non-empty parent with | ||
| # ``beg < 0`` still raises). |
There was a problem hiding this comment.
🤖 from Claude (review)
Expanded guard comments are accurate and correctly differentiated per function. _expand_mask_to_base has only the beg < 0 check below, and this comment scopes its bypass claim to exactly that ("bypasses the beg < 0 validation"). The companion comment in _broadcast_segment_to_base correctly widens it to "beg < 0 / beg + cnt > base" because that function carries both checks. Both correctly state that a non-empty parent with an invalid range still raises. No correctness issue.
Generated by Claude Code
|
🤖 from Claude Status: all three phases are complete and both fresh-context self-reviews folded (no blocking findings). Locally green — CI runners are still queued/in-progress at the time of this run, so the draft is left in place (not flipped to ready) pending CI green. Next run: once CI is green and nothing is waiting on @espg, flip to ready for review. Note for @espg: Refs #117 — the gain_bias OOM is a distinct root cause re-exposed once this fix re-enables the waveform path; whether to fold #117 onto this branch (per the "land together" tracking note) is raised under Questions for review since it would change this PR's agreed scope. Generated by Claude Code |
Closes #116
Root cause
On the same shard/granules the
gain_biastemplate read 0 observations whiletdigestread ~200k. The onlydata_sourcedifference is the segment-leveldem_hread on thesegmentslevel (src/zagg/configs/atl03_gain_bias_healpix.yaml), which routesgain_biasthrough the segment→photon broadcast (_read_segment_broadcasts/_broadcast_segment_to_baseinsrc/zagg/processing/read.py) thattdigestnever touches.Real ATL03 marks empty segments with
count == 0andph_index_beg == 0. Underindex_base: 1(atl03_gain_bias_healpix.yaml), such a segment givesbeg = 0 - 1 = -1 < 0, which raisedindex_beg_arr[..]=0 is less than index_base=1in_broadcast_segment_to_base. The per-group read inworker.pywas wrapped in a bareexcept Exceptionlogged at DEBUG, so the raise was swallowed for every beam/granule,all_readsstayed empty, and the worker reported the misleadingNo data after filtering.tdigest, with no broadcast, never raised.read_plan.plan_readalready skips empty (cnt == 0) segments;_broadcast_segment_to_base(and its twin_expand_mask_to_base) had no such guard. This was empirically confirmed in the issue thread: every beam on every granule raised the empty-segmentValueError, and acnt == 0guard alone restored byte-identical obs totdigest. The "index-space reconciliation" (step 3 of the original diagnosis) is not needed and was confirmed out of scope by @espg.Approach (phases)
worker.py(a raised read is always a real error; a legitimately-empty group returnsNone, so this does not get noisy). Track a per-shardread_errorscount inProcessingMetadataand, when every group raised, report the real error instead ofNo data after filtering.if cnt == 0: continueguard (before thebeg < 0check) to_broadcast_segment_to_base, mirroringplan_read's non-empty-only contract. Apply the identical guard to the twin_expand_mask_to_base(same latent bug; @espg approved the consistent guard).ph_index_beg == 0, cnt == 0) underindex_base = 1, plus a planned-path run that does not start at file index 0.What changed
src/zagg/processing/read.py—cnt == 0skip in_broadcast_segment_to_baseand_expand_mask_to_base.src/zagg/processing/worker.py— per-group read error → WARNING;read_errorscounter; all-raised path reports a real error.src/zagg/schema.py—read_errors: NotRequired[int]onProcessingMetadata.tests/test_processing.py— new regression tests + updatedtest_read_exception_still_releases_in_finallyto assert the new (correct) read-error reporting.How tested
gain_biasreads the same observation count astdigeston a table with empty segments (the unit analogue of the locally-observed20,710 == 20,710), that the broadcast does not raise underindex_base=1withcnt == 0empties, and thatdem_hlands on the correct photons on both the full and planned read paths.tests/test_processing.pyandtests/test_runner.pygreen viauv run --extra test python -m pytest.ruff check --select=E,F,W,I --ignore=E501 src testsandruff format --checkon touched files pass.Questions for review
Closes #116 #117). This PR is scoped to thedem_hbroadcast (the gain_bias template reads 0 observations where tdigest reads 200k on the same shard (dem_h segment read) #116 root cause) only — Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 is a distinct root cause (full-grid vector allocation) inside the same template and re-surfaces once this fix re-enables the waveform path. Should the Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 footprint work be folded onto this branch, or land as its own PR? Left to @espg since folding it in changes this PR's agreed scope.pandas/h5corostubs, thephase_timings: dict | Noneunion, theDataSourceDictarg-type) on lines this PR does not touch; the newread_errorsfield and assignments are clean. Not "fixed" here per the convention on pre-existing unrelated failures — flagging instead.Generated by Claude Code