Skip to content

Fix gain_bias dem_h broadcast dropping all photons on empty ATL03 segments#129

Merged
espg merged 4 commits into
mainfrom
claude/116-gain-bias-dem-h
Jun 29, 2026
Merged

Fix gain_bias dem_h broadcast dropping all photons on empty ATL03 segments#129
espg merged 4 commits into
mainfrom
claude/116-gain-bias-dem-h

Conversation

@espg

@espg espg commented Jun 29, 2026

Copy link
Copy Markdown
Member

Closes #116

Root cause

On the same shard/granules the gain_bias template read 0 observations while tdigest read ~200k. The only data_source difference is the segment-level dem_h read on the segments level (src/zagg/configs/atl03_gain_bias_healpix.yaml), which routes gain_bias through the segment→photon broadcast (_read_segment_broadcasts / _broadcast_segment_to_base in src/zagg/processing/read.py) that tdigest never touches.

Real ATL03 marks empty segments with count == 0 and ph_index_beg == 0. Under index_base: 1 (atl03_gain_bias_healpix.yaml), such a segment gives beg = 0 - 1 = -1 < 0, which raised index_beg_arr[..]=0 is less than index_base=1 in _broadcast_segment_to_base. The per-group read in worker.py was wrapped in a bare except Exception logged at DEBUG, so the raise was swallowed for every beam/granule, all_reads stayed empty, and the worker reported the misleading No data after filtering. tdigest, with no broadcast, never raised.

read_plan.plan_read already skips empty (cnt == 0) segments; _broadcast_segment_to_base (and its twin _expand_mask_to_base) had no such guard. This was empirically confirmed in the issue thread: every beam on every granule raised the empty-segment ValueError, and a cnt == 0 guard alone restored byte-identical obs to tdigest. The "index-space reconciliation" (step 3 of the original diagnosis) is not needed and was confirmed out of scope by @espg.

Approach (phases)

  • Phase 1 — make the failure visible. Promote the swallowed per-group read error from DEBUG → WARNING in worker.py (a raised read is always a real error; a legitimately-empty group returns None, so this does not get noisy). Track a per-shard read_errors count in ProcessingMetadata and, when every group raised, report the real error instead of No data after filtering.
  • Phase 2 — fix the broadcast. Add an if cnt == 0: continue guard (before the beg < 0 check) to _broadcast_segment_to_base, mirroring plan_read's non-empty-only contract. Apply the identical guard to the twin _expand_mask_to_base (same latent bug; @espg approved the consistent guard).
  • Phase 3 — regression tests. Realistic ATL03-style segment table with interspersed empty segments (ph_index_beg == 0, cnt == 0) under index_base = 1, plus a planned-path run that does not start at file index 0.

What changed

  • src/zagg/processing/read.pycnt == 0 skip in _broadcast_segment_to_base and _expand_mask_to_base.
  • src/zagg/processing/worker.py — per-group read error → WARNING; read_errors counter; all-raised path reports a real error.
  • src/zagg/schema.pyread_errors: NotRequired[int] on ProcessingMetadata.
  • tests/test_processing.py — new regression tests + updated test_read_exception_still_releases_in_finally to assert the new (correct) read-error reporting.

How tested

  • New regression tests assert gain_bias reads the same observation count as tdigest on a table with empty segments (the unit analogue of the locally-observed 20,710 == 20,710), that the broadcast does not raise under index_base=1 with cnt == 0 empties, and that dem_h lands on the correct photons on both the full and planned read paths.
  • tests/test_processing.py and tests/test_runner.py green via uv run --extra test python -m pytest.
  • ruff check --select=E,F,W,I --ignore=E501 src tests and ruff format --check on touched files pass.

Questions for review

  1. Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 sequencing. The tracking note asks for gain_bias template reads 0 observations where tdigest reads 200k on the same shard (dem_h segment read) #116 + Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 (gain_bias OOM) to land together as one PR (Closes #116 #117). This PR is scoped to the dem_h broadcast (the gain_bias template reads 0 observations where tdigest reads 200k on the same shard (dem_h segment read) #116 root cause) only — Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 is a distinct root cause (full-grid vector allocation) inside the same template and re-surfaces once this fix re-enables the waveform path. Should the Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 footprint work be folded onto this branch, or land as its own PR? Left to @espg since folding it in changes this PR's agreed scope.
  2. mypy baseline. Running mypy ad-hoc on the touched modules surfaces only pre-existing errors (missing pandas/h5coro stubs, the phase_timings: dict | None union, the DataSourceDict arg-type) on lines this PR does not touch; the new read_errors field and assignments are clean. Not "fixed" here per the convention on pre-existing unrelated failures — flagging instead.

Generated by Claude Code

@espg espg added the implement label Jun 29, 2026

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Adversarial review of all three phases against issue #116 and CLAUDE.md. The fix is correct and in scope — no blocking findings; the comments below are minor (wording/grammar) plus confirmations.

Correctness (verified, no bug):

  • The cnt == 0 guard in both _broadcast_segment_to_base and _expand_mask_to_base only short-circuits empty parents. A non-empty segment with beg < 0 still raises, and beg + cnt > total_base_size still raises for non-empty segments. Both existing raise-contracts hold — test_negative_beg_raises (cnt=[3]) and test_broadcast_out_of_bounds_raises (cnt=[2,5]) use non-empty counts, so the guard never bypasses them.
  • n_base = int(cnt_arr.sum()) is unaffected by empties (they contribute 0), so broadcast/expand outputs stay length-aligned with seg_global_idx/global_idx on both full and planned paths. The planned path's cross-level call (_expand_mask_to_base at read.py:460) gets the same skip.
  • WARNING-promotion is not noisy: a legitimately-empty group returns None without raising, so it never increments read_errors or logs here — only genuine read failures do.

Test quality: the parity test reproduces the 0-vs-N drop and would fail without the fix (pre-fix broadcast raises on the first empty under index_base=1); the planned-path test covers the separate _planned_read_group code path. Good coverage.

Scope/conventions: stays on the @espg-confirmed fix (skip guards + WARNING promotion + read_errors metadata/schema field); no index-space reconciliation, no swallowed errors, no weakened tests, terse and style-matching. Compliant.

Diff-scoped nits to consider folding (all non-blocking):

  1. worker.py: the "All group reads raised" message overstates the mixed case (some groups may have returned None rather than raised); "No data and N group read(s) raised" is more precise — and update the coupled exact-match assertion at tests line 4638 in the same commit.
  2. "(1 read errors)" — pluralize on N or use "(N read error(s))".
  3. read.py: optionally note in the guard comment that it intentionally bypasses the beg < 0 validation for empty segments (correct, since empties map to zero base rows).

Generated by Claude Code

# that gives ``beg = 0 - 1 = -1`` and would raise below, which is what
# made the gain_bias dem_h broadcast drop every photon. Skip them.
if cnt == 0:
continue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Correctness of the guard checks out: it only continues when cnt == 0, so a non-empty segment with beg < 0 still hits the ValueError below, and the beg + cnt > total_base_size raise is likewise untouched for non-empty segments. I confirmed both raise-contracts hold — test_negative_beg_raises (cnt=[3]) and test_broadcast_out_of_bounds_raises (cnt=[2,5]) both use non-empty counts, so the guard never short-circuits them.

One behavioral subtlety worth a one-line note in the comment (non-blocking): the guard now bypasses the beg < 0 validation entirely for any cnt == 0 segment, regardless of what index_beg actually holds. That is exactly right for real ATL03 (empties carry ph_index_beg == 0), but it does mean a hypothetically corrupt empty segment (e.g. ph_index_beg = -5, cnt = 0) is now silently skipped rather than flagged. Since an empty segment maps to zero base rows by definition, skipping is the correct semantics — just calling it out so the looser validation is intentional, not an oversight.


Generated by Claude Code

# raise below; skip them, mirroring the non-empty-only contract
# ``read_plan.plan_read`` already uses (its ``cnt > 0`` skip).
if cnt == 0:
continue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

The twin guard in _expand_mask_to_base matches the broadcast helper exactly and is consistent with the planned-path call at line 460 (_expand_mask_to_base(coarse_fmask, cf_ibeg, cf_cnt, ...)), so cross-level filters get the same empty-parent skip. No n_base drift either: n_base = int(cnt_arr.sum()) (line 348) is unaffected by empties since they contribute 0, so the broadcast/expand outputs stay length-aligned with seg_global_idx/global_idx. Looks correct.


Generated by Claude Code

metadata["error"] = f"All group reads raised ({read_errors} read errors)"
else:
logger.info(f" No data after filtering for shard {shard_key} - skipping")
metadata["error"] = "No data after filtering"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Minor wording nit (non-blocking): this branch fires on not all_reads and read_errors, which is reached whenever no group produced data and at least one raised — but some of the other groups may have returned None (legitimately empty) rather than raised. In that mixed case the message "Read raised on all groups" / "All group reads raised (N read errors)" overstates it (not literally all groups raised; some were simply empty). The read_errors count is accurate; only the prose is slightly too strong. Consider e.g. "No data and N group read(s) raised" to stay precise. Not a correctness problem — the WARNING + count still surface the real failure, which is the point of issue #116.

The WARNING-promotion itself is sound and won't get noisy: a legitimately-empty group returns None without raising (it doesn't increment read_errors or log here), so only genuine read failures hit this path.


Generated by Claude Code

Comment thread tests/test_processing.py
0,
_ShardGrid(),
)
assert len(gain_bias) == len(tdigest) == 8

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

This parity test is the right regression for issue #116 and would genuinely fail without the fix: with index_base=1 and the first empty segment at ph_index_beg=0, cnt=0, the pre-fix broadcast computes beg = 0 - 1 = -1 and raises, so the with_dem_h=True read returns 0 rows while with_dem_h=False returns 8 — exactly the "gain_bias 0 vs tdigest N" drop. Asserting len(gain_bias) == len(tdigest) == 8 plus the exact per-photon dem_h values pins both the count parity and the alignment. The companion planned-path test (test_planned_read_skips_empty_segments_index_base_1) covers the separate _planned_read_group path with a bbox run that starts past file index 0 and straddles the empty — good, since that path was not exercised by the full-read parity test. Solid coverage.


Generated by Claude Code

Comment thread tests/test_processing.py Outdated
# A raised read is a real error, NOT a legitimately-empty read (issue
# #116): the shard reports the read-error path and counts the failure,
# rather than the misleading "No data after filtering".
assert meta["error"] == "All group reads raised (1 read errors)"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Updating this assertion is correct, not a weakened test (§4): the worker's behavior genuinely changed — a read that raised now reports the read-error path and counts the failure instead of the misleading "No data after filtering". Asserting both meta["error"] and meta["read_errors"] == 1 locks in the new contract. Note this string assertion is exact-match coupled to the message wording; if you adopt the "No data and N group read(s) raised" rewording I suggested on worker.py, update this literal in the same commit.

Grammar nit (cosmetic): "(1 read errors)" reads awkwardly for the singular case. Low priority given it's an internal diagnostic string, but "(N read error(s))" or pluralize-on-N would tidy it.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Fresh-context review of fold commit baa6878 only (7c7fe9e..baa6878). No blocking findings — the fold is correct.

  • worker.py message reword renders exactly No data after filtering (1 group reads raised) at read_errors == 1, matching the updated test assertion in tests/test_processing.py verbatim, and no longer overclaims "all groups raised" (empty groups return None, only raises bump read_errors).
  • read.py guard comments accurately describe the cnt == 0 skip and are correctly scoped per function (_expand_mask_to_base bypasses only beg < 0; _broadcast_segment_to_base bypasses beg < 0 / beg + cnt > base), both noting non-empty parents still raise.

No grammar, style, or CLAUDE.md issues introduced by this commit. Earlier accepted phases were not re-litigated.


Generated by Claude Code

f" No data after filtering for shard {shard_key} and "
f"{read_errors} group read(s) raised - skipping"
)
metadata["error"] = f"No data after filtering ({read_errors} group reads raised)"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Message reword is correct and accurate. With read_errors == 1 the f-string renders exactly No data after filtering (1 group reads raised), which matches the updated test assertion in tests/test_processing.py verbatim. And the new phrasing avoids the earlier overclaim: a legitimately-empty group returns None (no exception, never enters all_reads) while only a raised read increments read_errors, so not all_reads with read_errors > 0 genuinely means "no data AND ≥1 raised" — some groups may simply have returned None. No "all groups raised" overstatement remains. Comment at lines 268-272 documents this distinction accurately.


Generated by Claude Code

# ``read_plan.plan_read`` already uses (its ``cnt > 0`` skip). Skipping
# intentionally bypasses the ``beg < 0`` validation for these parents --
# correct, since they map to zero base rows (a non-empty parent with
# ``beg < 0`` still raises).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Expanded guard comments are accurate and correctly differentiated per function. _expand_mask_to_base has only the beg < 0 check below, and this comment scopes its bypass claim to exactly that ("bypasses the beg < 0 validation"). The companion comment in _broadcast_segment_to_base correctly widens it to "beg < 0 / beg + cnt > base" because that function carries both checks. Both correctly state that a non-empty parent with an invalid range still raises. No correctness issue.


Generated by Claude Code

@espg

espg commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Status: all three phases are complete and both fresh-context self-reviews folded (no blocking findings). Locally green — ruff check --select=E,F,W,I --ignore=E501 src tests, ruff format --check on touched files, and pytest for the affected suites (test_processing.py, test_runner.py) all pass. The fix is the @espg-confirmed cnt == 0 skip guard in _broadcast_segment_to_base and its _expand_mask_to_base twin, plus the per-group read error promoted to WARNING with a read_errors count; index-space reconciliation was confirmed out of scope.

CI runners are still queued/in-progress at the time of this run, so the draft is left in place (not flipped to ready) pending CI green. Next run: once CI is green and nothing is waiting on @espg, flip to ready for review. Note for @espg: Refs #117 — the gain_bias OOM is a distinct root cause re-exposed once this fix re-enables the waveform path; whether to fold #117 onto this branch (per the "land together" tracking note) is raised under Questions for review since it would change this PR's agreed scope.


Generated by Claude Code

@espg espg added the waiting label Jun 29, 2026
@espg espg marked this pull request as ready for review June 29, 2026 20:39
@espg espg merged commit fe8e6da into main Jun 29, 2026
7 checks passed
@espg espg deleted the claude/116-gain-bias-dem-h branch June 29, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gain_bias template reads 0 observations where tdigest reads 200k on the same shard (dem_h segment read)

2 participants