sort/hash grouping refactor ( #30 ) by espg · Pull Request #33 · englacial/zagg

espg · 2026-06-12T23:13:36Z

Refs #30

What this changes

Reworks the per-shard aggregation hot path in processing.py. Phase 1 replaced the O(n_children × n_obs) per-child boolean-mask loop with an O(n_obs log n_obs) sort/hash grouping and changed calculate_cell_statistics to accept dict[str, np.ndarray] instead of pd.DataFrame. Phase 2 adds an additive Arrow handoff carrier alongside pandas (default stays pandas). An experimental pyarrow-kernel reducer (handoff="arrow-kernel") reduces the kernel-able stats via pyarrow.compute hash-aggregate; it is opt-in and the byte-identical numpy default path is untouched.

Phase 1 — sort/hash group split (`_build_groups`)

One np.argsort (O(n log n)) instead of n_children boolean masks (O(n_children × n_obs)). For a typical HEALPix O6→O12 shard: 4,096 boolean scans → 1 sort. The per-cell slices are numpy views into the sorted arrays. calculate_cell_statistics takes dict[str, np.ndarray]; scalar outputs are byte-for-byte identical (order-independent stats; stable sort preserves within-cell order).

Phase 2 — Arrow handoff carrier (additive, opt-in)

process_shard(..., handoff="pandas"|"arrow"), default "pandas" → no behavior change unless opted in.
Carrier-agnostic core _group_columns(col_dict, cell_col); both carriers feed the same numpy arrays into the same reductions, so scalar outputs stay byte-for-byte identical (asserted in TestArrowHandoff).
benchmarks/handoff_bench.py — synthetic, CI-runnable harness timing {mask-loop, pandas-group, arrow-group} with a parity assertion.

Experimental — pyarrow-kernel reducer (`handoff="arrow-kernel"`, opt-in)

Reduces count/min/max/variance/(unweighted)mean via one vectorised TableGroupBy.aggregate pass; remaining (weighted mean, expression, quantile) fields fall back to the per-cell numpy path.
count/min/max are exact vs numpy, including NaN: pyarrow min/max kernels skip NaN, so _kernel_aggregate detects NaN per group and propagates it to match numpy. Float mean/variance agree within a documented KERNEL_RTOL (~1 ULP), not byte-identical — hence opt-in and gated.

Dual aggregation contract (this run)

Documents and tests the user contract @espg asked to make explicit (#33 comment):

Default, fully-supported contract: any aggregation expressible in numpy, including the NaN-aware family (np.nanmean, np.nanvar, np.nanmax, np.nanmin, np.nansum, np.nanstd, …). These already resolve with no code change — resolve_function does getattr(np, name) for a bare/np.-prefixed name and import for a dotted path — so the whole nan* family is usable directly from the agg template and runs through calculate_cell_statistics with numpy's own NaN semantics. No gap to close. Added test_numpy_nan_aware_functions proving correct NaN-aware per-cell results, plus a User contract docstring section and a Dual aggregation contract note in the EXPERIMENTAL block.
Arrow kernels are an opt-in acceleration for the kernel-able subset only — they do not narrow the contract. Note: Arrow NULL ≠ float NaN (skip_nulls does not skip NaN), so arrow kernels are not drop-in nan-operators; the experimental path replicates numpy's NaN behaviour by hand rather than conflating nulls and NaN.

How it was tested

Synced with main (merge, not rebase, to avoid force-pushing the shared branch — §1/§2 of CLAUDE.md): git merge origin/main was a clean merge with no conflicts (processing.py/tests untouched by the merged PRs). main brought in Rectilinear grid: chunk-driven auto-padding + run enablement (WIP) #32 (requires-python <3.14), concurrency, shardmap/rectilinear, and docs.
test_numpy_nan_aware_functions (NaN-aware contract), plus the existing TestBuildGroups / TestArrowHandoff / TestKernelHandoff / TestProcessShardKernelBranch suites.
Full suite post-merge: 265 passed, 1 skipped; ruff check --select=E,F,W,I --ignore=E501 (ruff 0.14.10, the pinned CI version) clean on the touched files; benchmarks/handoff_bench.py runs and asserts parity.

Phases

Phase 1: sort/hash grouping + dict interface for calculate_cell_statistics
Phase 2: Arrow handoff carrier (handoff="arrow"), _group_columns core, _read_group Arrow mode, synthetic benchmark + parity test — additive, pandas remains default
Experimental: opt-in pyarrow.compute hash-aggregate kernels (handoff="arrow-kernel") for the kernel-able subset, with documented KERNEL_RTOL and numpy-matching NaN min/max semantics
Numpy nan-op contract documented + tested; dual-contract note added
Phase 3 (deferred → tracked on Refactor the per-cell aggregation handoff: sort/hash grouping + Arrow path (additive, benchmarked) #30): real-data benchmark — ATL03 regions (NEON Maryland / Russell Glacier / Bahamas × time windows), wall-time + peak RSS, pandas vs arrow vs arrow-kernel (needs earthaccess/S3)
Phase 4 (deferred → tracked on Refactor the per-cell aggregation handoff: sort/hash grouping + Arrow path (additive, benchmarked) #30): write_dataframe_to_zarr Arrow-table sibling for the Arrow path

Phases 3 and 4 are split out to a follow-up per @espg's request to finalize and merge this PR now. They will be tracked off #30 unless @espg prefers a dedicated tracking issue (asked on the thread — opening a new issue is a side-effecting action I'm leaving to @espg).

Questions for review

Confirm the dual contract framing matches your intent: numpy (incl. nan*) is the supported surface; arrow kernels are opt-in acceleration only, kept gated behind handoff="arrow-kernel" pending the phase-3 benchmark.
Track phases 3/4 on Refactor the per-cell aggregation handoff: sort/hash grouping + Arrow path (additive, benchmarked) #30, or open a dedicated follow-up issue?

espg

🤖 from Claude (review)

Phase 1 review — sort/hash grouping refactor

CI is currently red (1 failure, test 3.12). Several additional issues found. Ordered by severity.

1. CI failure — missed call site in `test_config.py` (blocking)

tests/test_config.py::TestEquivalence::test_config_matches_calculate_cell_statistics is still calling calculate_cell_statistics(synthetic_df) with a pd.DataFrame. The if cell_data else 0 guard on line 159 hits pandas' __bool__, which raises:

ValueError: The truth value of a DataFrame is ambiguous.

The PR updated tests/test_processing.py but missed this test in tests/test_config.py. That test must be updated (or a compat shim added — see item 3 below) before CI goes green.

2. `_build_groups` crashes on empty DataFrame (correctness bug)

If df_all has zero rows, cell_col is also empty. The boundary arithmetic produces:

starts = np.concatenate([[0], []])  # → [0]
ends   = np.concatenate([[], [0]])  # → [0]
# zip([0], [0]) → one iteration
int(sorted_cells[0])  # IndexError — sorted_cells is empty

Any call path where df_all can be empty (e.g. a shard with no observations that nevertheless reaches process_shard) will crash. There is no test for this case. A guard like:

if len(cell_col) == 0:
    return {col: df_all[col].values for col in df_all.columns}, {}

at the top of _build_groups would fix it.

3. Breaking API change with no compat shim — callers outside `test_processing.py` were not swept

The calculate_cell_statistics signature change is breaking. The CI failure (item 1) proves at least one caller in test_config.py was not updated. Before marking this draft ready, a full grep for calculate_cell_statistics across the repo is warranted to catch any remaining callers. Consider whether a thin shim (checking isinstance(cell_data, pd.DataFrame) and extracting .values) is preferable to a mechanical sweep — the PR description's "Questions for review" already flags this; it needs an answer before phase 1 lands.

4. `if cell_data else 0` is the wrong emptiness guard

Even after fixing the CI failure, the guard len(next(iter(cell_data.values()))) if cell_data else 0 has a subtle gap: a dict with keys but zero-length arrays ({"h_li": np.array([]), "s_li": np.array([])}) is truthy, so n_obs will correctly be 0. That case is actually fine.

The real problem is the guard does not express intent: it was written to protect against an empty dict {} falling into next(iter(...)). The new test test_empty_data_returns_zeros_and_nans passes an explicit {"h_li": np.array([]), ...} (non-empty dict with zero-length arrays), so the cell_data else 0 branch is dead in all tested paths. A cleaner and unambiguous form:

arrays = list(cell_data.values())
n_obs = len(arrays[0]) if arrays else 0

This also works correctly if cell_data is ever a DataFrame (it will hit len(df.iloc[:, 0]) rather than bool(df) — but the right fix is still item 3, not relying on this).

5. `_empty` sentinel is shared across all no-data cells — mutation hazard

_empty: dict[str, np.ndarray] = {col: arr[:0] for col, arr in col_arrays.items()}

_empty is reused by reference for every child with no observations. calculate_cell_statistics does not mutate its input today, but the code path that resolves eval()-based params does:

ns = {"__builtins__": {}, "np": np, "numpy": np, **cell_data}
resolved_params[pkey] = eval(pval, ns)

**cell_data unpacks into ns (a copy), so the arrays themselves are not mutated. That's safe today. But the dict _empty is the same object every iteration — if anyone ever writes cell_data[new_key] = ... inside calculate_cell_statistics, it would silently corrupt all subsequent empty-cell calls. Given the PR description already calls this out under "Questions for review," at minimum add a comment next to _empty noting the shared-reference contract, or use _empty as a template and shallow-copy per use: cell_data = dict(_empty).

6. Parity test uses sorted order vs original order — floating-point results may differ (test correctness)

test_statistics_match_old_approach asserts exact equality (np.testing.assert_array_equal) between:

new_data: values extracted in sorted cell-id order from col_arrays
old_data: values extracted in original DataFrame row order via boolean mask

For commutative aggregations (min, max, count) this is fine. For mean, std, weighted average, or any expression aggregation, floating-point summation is order-dependent. With float32 (lower precision, more rounding), a mismatch of 1 ULP is plausible. The test happens to pass because the RNG seed (7) doesn't hit a rounding boundary — but this is not a reliable guarantee.

The PR description claims "byte-for-byte identical outputs." That claim is not proven by this test as written: the test compares new-path-in-sorted-order vs reference-path-in-original-order. If h_mean or h_sigma differ by 1 ULP, the test will fail in CI on a different platform or numpy version. Either:

a) use assert_allclose with a tight rtol (acknowledging the claim is "within tolerance" not "bit-identical"), or
b) document explicitly that the sort reorders observations and float results are only reproducible at float64 precision, or
c) change the reference to also sort its mask-selected values before passing to calculate_cell_statistics, making the comparison truly order-matched.

Option (c) is the most honest demonstration of parity and costs one np.sort call per cell in the test.

7. `test_slice_counts_match_per_cell_mask` compares sorted values, not ordered slices

np.testing.assert_array_equal(np.sort(new_vals), np.sort(old_vals))

This verifies multiset identity (same elements), not that the slice is contiguous or the sort is stable. The test name says "counts match" but actually checks value-set membership — stronger than needed for a boundary test, weaker than needed for an ordering test. Not a bug, but the name and assertion are slightly at odds. Renaming to test_slice_values_match_per_cell_mask or using np.testing.assert_array_equal(np.sort(…), np.sort(…)) with a comment explaining the sort would make intent clearer.

8. No test for `_build_groups` with empty input

Follows directly from item 2. After the crash fix, add:

def test_empty_dataframe(self):
    df = pd.DataFrame({"h_li": np.array([]), "s_li": np.array([])})
    col_arrays, cell_to_slice = _build_groups(df, np.array([], dtype=np.int64))
    assert cell_to_slice == {}
    assert col_arrays["h_li"].shape == (0,)

9. No end-to-end test for `process_shard` with the new code path

process_shard is the function that actually changed behavior. The new test suite covers _build_groups and calculate_cell_statistics in isolation, but there is no test that exercises process_shard end-to-end through the new _build_groups → dict-slice → calculate_cell_statistics path. A smoke test (even a small synthetic one) would catch regressions like item 1 at the integration level. At minimum, there should be a note under "Questions for review" whether the existing integration tests in test_integration.py cover this path — the coverage report shows processing.py at only 33%, which suggests they may not.

10. `noqa: S307` is a no-op under this repo's ruff config (minor)

Ruff is configured with select = [E, F, W, I, N]. Rule S307 is in the S (bandit) ruleset, which is not selected. The # noqa: S307 comments are harmless but misleading — they imply the rule is active and being suppressed. This carried over from pre-refactor code; worth removing or converting to a comment explaining the eval is intentional.

Summary table

#	Severity	Category	Blocks merge?
1	Critical	CI failure — missed caller in `test_config.py`	Yes
2	High	Crash on empty DataFrame in `_build_groups`	Yes
3	High	Breaking API — callers not fully swept	Yes
4	Medium	Wrong emptiness guard (`if cell_data`)	Recommend fix
5	Medium	Shared `_empty` sentinel mutation hazard	Recommend comment or copy
6	Medium	Parity test doesn't prove bit-identical claim	Recommend fix
7	Low	Test name vs assertion mismatch	Nit
8	Low	Missing empty-input test for `_build_groups`	Recommend add
9	Low	No `process_shard` integration smoke test	Recommend or explain
10	Nit	Dead `noqa: S307` comment	Nit

Items 1–3 must be fixed before CI goes green. Items 4–6 are correctness/reliability concerns I'd want addressed before this leaves draft. Items 7–10 can be resolved in-thread or deferred to phase 2 if explained.

Generated by Claude Code

espg · 2026-06-12T23:21:33Z

🤖 from Claude

Follow-up commit (faa09c4) addresses three findings from the adversarial review pass:

Critical — CI red (test_config.py)
calculate_cell_statistics now expects dict[str, np.ndarray] but TestEquivalence::test_config_matches_calculate_cell_statistics was still passing a pd.DataFrame — triggering ValueError: The truth value of a DataFrame is ambiguous in the empty-check guard. Fixed by converting inline:

cell_data = {col: synthetic_df[col].values for col in synthetic_df.columns}
expected = calculate_cell_statistics(cell_data)

High — _build_groups IndexError on empty shard
When cell_col is empty (all granules filtered out), sorted_cells is empty. starts = [0], ends = [0], so zip(starts, ends) yields one (0, 0) pair, and sorted_cells[0] raises IndexError. Fixed with an early-return guard:

if len(sorted_cells) == 0:
    return col_arrays, {}

Medium — weak parity assertion in test_slice_counts_match_per_cell_mask
np.testing.assert_array_equal(np.sort(new_vals), np.sort(old_vals)) only proved set equality, not element order. Since _build_groups uses kind="stable", within-cell element order is preserved, so the stronger assertion np.testing.assert_array_equal(new_vals, old_vals) should hold — and does. This matters because floating-point operations on float32 are order-dependent; the weaker test wouldn't catch a sort that scrambled element order within a cell.

Generated by Claude Code

espg · 2026-06-12T23:52:31Z

The _build_groups helper copies sorted column arrays; for large shards (10M+ obs) this is ~3× the observation size in extra memory. Is this acceptable for the Lambda 2 GB budget, or should the sort be done in-place on df_all?

This seems like a problem. Moving to ATL03 from ATL06 is orders of magnitude more data; we'd like to become more efficient , not less. Isn't this why we're moving to arrow as a backend? Is this just a transitory side effect (i.e., because we haven't completed the other phases), or are we explicitly trading computational enhancements for memory bloat?

calculate_cell_statistics is now a breaking change for any external callers passing pd.DataFrame. Is there a known external caller that needs a compatibility shim, or is this internal-only?

The case for this is the future temporal module (in #12 and #15 ), along with the eventual user antarctic_AR_dataset . What's the fuller picture of what's going on with calculate_cell_statistics?

The _empty sentinel (zero-length slices of each column) is shared across all empty cells. If calculate_cell_statistics were ever to modify its inputs, this would be a bug. Currently it doesn't — worth calling out explicitly?

Not sure-- do you mean on the data side (i.e., in zarr), or internally for our own in memory usage? Where's the seam here?

espg · 2026-06-12T23:53:35Z

Claude, you can implement phase 2 in parallel to answering my questions above

espg · 2026-06-13T16:41:49Z

🤖 from Claude

Phase 2 pushed (8140437) — the additive Arrow handoff carrier, plus a CI-runnable synthetic benchmark and a pandas-vs-arrow parity test. Then answering your three questions, because they shape what phase 2 is for.

Phase 2 — what landed

process_shard(..., handoff="pandas"|"arrow"), default "pandas" → zero behavior change unless opted in.
Extracted a carrier-agnostic core _group_columns(col_dict, cell_col); _build_groups (pandas) is now a thin wrapper over it, and the Arrow path feeds the same numpy arrays via to_numpy(zero_copy_only=False). Because the reductions run on identical numpy arrays, scalar outputs are byte-for-byte identical — asserted in TestArrowHandoff.
_read_group(..., arrow=True) returns a pyarrow.Table; the Arrow path does concat_tables(...).combine_chunks() then zero-copy column extraction.
benchmarks/handoff_bench.py: times {mask-loop, pandas-group, arrow-group} on synthetic obs, asserting parity. On 300k obs / 1024 cells: mask-loop 1.55s, pandas-group 1.23s @ 16.8 MB, arrow-group 1.30s @ 12.2 MB (Python-domain peak).
I deliberately did not route mean/var through pyarrow.compute hash-aggregate kernels: grouped pyarrow float32 reductions are not byte-for-byte equal to numpy (different summation), which would break this issue's "identical outputs" criterion. That kernel sub-lever needs your call on whether to accept kernel-level float divergence — flagged as phase 2b in the checklist rather than smuggled in.

Q1 — memory bloat at ATL03 scale

We are not trading compute for memory; phase 2 is the structural fix for exactly this worry.

The "~3×" in my phase-1 note was conservative. The real cost is one reordered copy per column — the [sort_idx] gather. Any grouping that yields contiguous per-cell slices must materialize that permutation; numpy has no in-place gather. So peak during grouping is ~2× the obs columns (original + sorted), transitory, released after. The old mask loop wasn't free either — every df_all[cell_col == child] copied the matched rows into a fresh per-cell DataFrame. We replaced n_cells small copies (+ n_cells DataFrame objects) with one big gather.
The Arrow carrier lowers the constant: to_numpy(zero_copy_only=False) on a non-null int64/float32 column shares the Arrow buffer (no copy), and Arrow has no BlockManager/Index overhead. Hence 12.2 vs 16.8 MB above — the ~27% gap is the pandas object/Index overhead Arrow drops, which is what scales badly to ATL03 volume.
If peak still matters on a real shard, the cheap follow-up is dropping df_all immediately after grouping (we hold original + sorted simultaneously today). I left that out until phase 3 quantifies it on real data rather than guessing.

So: phase 1 cut compute O(n_cells·n)→O(n log n) at ~flat memory; phase 2 cuts the memory constant. The ATL03-scale answer is "Arrow carrier + measure on a real shard (phase 3)."

Q2 — `calculate_cell_statistics` breaking change, fuller picture

Internal-only today: the sole caller is process_shard. The signature went pd.DataFrame → dict[str, np.ndarray] on purpose — that dict is the lowest common denominator across every backend (#20): pandas .values, Arrow to_numpy, and a future vaex/cluster groupby all yield numpy arrays. Keeping it pandas-typed would re-couple the reducer to pandas, the exact thing #30 unwinds.

The future consumers you name fit this better, not worse. The temporal module (#12/#15) and antarctic_AR_dataset do per-group array→dict reduction; dict[str, ndarray] is the right shared interface for them. And Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29's non-scalar outputs extend this same function to return ndarray/bytes per field — the dict-in contract is what makes that clean.
If a real external caller passes a DataFrame, a one-line shim restores compat: if isinstance(cell_data, pd.DataFrame): cell_data = {c: cell_data[c].values for c in cell_data.columns}. I left it out (no known external caller; CLAUDE.md §4 favors no speculative abstraction). If antarctic_AR_dataset pins the DataFrame signature, say so and I'll add the shim.

Q3 — the `_empty` sentinel seam

Purely in-memory, not the zarr/data side. The seam is in process_shard: _empty = {col: arr[:0] ...} is built once and handed as cell_data to every no-observation child cell, so all empty cells alias the same zero-length arrays. Safe today because calculate_cell_statistics is read-only — it special-cases n_obs == 0 and returns fill values without ever indexing the arrays. The latent bug would be a future field path that mutates cell_data in place (e.g. an in-place vector accumulator under #29) — then every empty cell would corrupt a shared buffer.

It's a read-only contract that's honored but unenforced. Options: (a) keep it + a one-line comment stating the read-only assumption; (b) fresh _empty per cell (defensive but re-introduces the per-cell allocation phase 1 removed); (c) rely on the existing n_obs==0 early return (which already means the arrays are never touched). I lean (a) and will add the comment. Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29's vector path returns new arrays rather than mutating, so the contract holds forward — worth stating explicitly so it stays that way.

Happy to fold the del df_all, the compat shim, and the _empty read-only comment into the next push if you want any of them now.

Generated by Claude Code

espg

🤖 from Claude (review)

Adversarial review of the latest commit only (8140437, phase 2 — Arrow handoff). Phase 1 (f0e3890) was not re-reviewed except where phase 2 touches it. Not approving / not requesting changes — leaving findings for the author.

Overall: The grouping refactor is clean and the carrier-agnostic _group_columns / _build_groups split is a sensible seam. The calculate_cell_statistics dict-of-arrays conversion is faithful to the old .values math. Lever 1 (sort/hash grouping) looks behavior-preserving and is reasonably tested via TestBuildGroups. My concerns are concentrated on the byte-for-byte parity claim and the benchmark / test coverage of the new Arrow path — i.e. exactly the acceptance criteria the issue calls out.

Must-fix / discuss before this is "done":

Parity rests on an unguarded precondition (inline at processing.py arrow branch). to_numpy(zero_copy_only=False) only reproduces pandas' .values byte-for-byte for dense, null-free columns. h5coro delivers that today, but nothing asserts it, and #29 (non-scalar / potentially null-bearing columns) is explicitly the next consumer. A null-bearing integer column would upcast to float64+NaN on the Arrow side and diverge silently. Assert or document the precondition, and flag it under "Questions for review."
The production Arrow branch in process_shard has no test (inline). TestArrowHandoff re-implements the pa.table → combine_chunks → to_numpy → _group_columns sequence inside the test instead of calling process_shard(handoff="arrow") or a shared helper. The real lines — pa.concat_tables(...).combine_chunks(), n_obs_total, the col_dict extraction, _empty plumbing — are uncovered, and the multi-table concat_tables ordering (the one thing the single-table test can't reach) is never exercised. Suggest extracting a small _concat_and_group(reads, grid, use_arrow) helper that's directly testable with two in-memory carriers.

Should-address:

Benchmark doesn't meet the issue's criterion (inline on handoff_bench.py). It's synthetic (no _read_group/concat_tables/real density), and tracemalloc "peak_MB" omits numpy buffers — where a real shard's memory actually lives — yet is printed without a caveat. Fine as an additive CI-runnable harness, but it does not satisfy "wall-time + peak memory on a real shard," so don't tick that box. The parity assert is also narrow (asserts argsort determinism more than carrier equivalence).
"View"/zero-copy framing is half-true (inline on _group_columns). arr[sort_idx] is a full reorder copy for both carriers; only the post-sort [start:end] slices are views. The docstring and benchmark framing should not imply Arrow extraction is zero-copy here — it isn't.

Nits:

n_obs = len(next(iter(cell_data.values()))) no longer enforces equal column lengths the way the DataFrame did (inline).
cells_with_data is now incremented only in the present-cell arm; equivalent to the old count only because cell_to_slice never holds zero-length slices. A targeted test on a shard with empty children would lock down this metadata field (inline).

Lint/style refactors bundled into this commit (multi-line open_array, warning-string joins, blank lines after import) are cosmetic and fine. No # noqa/# type: ignore abuse introduced; module stays well under the ~1000-line guideline. Pandas remains the default and the handoff validation is correct, so the additive/opt-in requirement is met.

Generated by Claude Code

espg · 2026-06-13T16:46:26Z

+    if use_arrow:
+        import pyarrow as pa
+
+        table = pa.concat_tables(all_reads).combine_chunks()
+        n_obs_total = table.num_rows
+        leaf_ids = table.column("leaf_id").to_numpy(zero_copy_only=False)
+        cell_col = grid.cells_of(leaf_ids)
+        col_dict = {
+            name: table.column(name).to_numpy(zero_copy_only=False) for name in table.column_names
+        }
+        col_arrays, cell_to_slice = _group_columns(col_dict, cell_col)


🤖 from Claude (review)

The byte-for-byte parity claim hinges on an unguarded assumption: that to_numpy(zero_copy_only=False) reproduces the exact same numpy array (dtype + values) that pandas' .values would. That holds only for dense, null-free, non-extension columns — which is what h5coro currently delivers, so this likely works in practice. But the issue (#30) itself flags the footgun, and nothing here enforces it:

Nulls: if any Arrow column ever carried nulls, to_numpy(zero_copy_only=False) upcasts an integer column to float64 and fills with NaN, while the pandas carrier (built from the same dense numpy) would not. The two paths would then diverge silently — exactly the failure the "byte-for-byte" criterion is meant to rule out.

dtype drift on leaf_id: line 464 runs cells_of on table.column("leaf_id").to_numpy(...) whereas the pandas branch (line ~473) runs it on df_all["leaf_id"].values. If Arrow yields a different integer width or a read-only buffer, grid.cells_of could in principle produce a different cell_col. Worth confirming cells_of is dtype-insensitive here.

Recommend either (a) asserting null-free / expected dtype on the extracted columns, or (b) at minimum documenting the precondition in the docstring so a future non-scalar/null-bearing column (#29) doesn't quietly break parity. This is the central correctness risk of the phase, so please call it out under "Questions for review" even if you judge it safe today.

Generated by Claude Code

espg · 2026-06-13T16:46:33Z

+    if use_arrow:
+        import pyarrow as pa
+
+        table = pa.concat_tables(all_reads).combine_chunks()


🤖 from Claude (review)

The production Arrow branch in process_shard is entirely untested. TestArrowHandoff.test_arrow_grouping_matches_pandas rebuilds the pa.table(...).combine_chunks() → to_numpy → _group_columns sequence inline in the test, rather than driving it through process_shard(handoff="arrow"). So the actual lines here — pa.concat_tables(all_reads).combine_chunks(), n_obs_total = table.num_rows, the col_dict comprehension, and the _empty/slice plumbing below — have zero coverage. A typo or column-name mismatch in this branch would pass CI.

I understand process_shard can't be unit-tested end to end without S3/h5coro, but the concat→extract→group block could be lifted into a small carrier-agnostic helper (e.g. _concat_and_group(reads, grid, use_arrow)) that is directly testable with two in-memory tables / DataFrames. That would let a test exercise the real code rather than a re-implementation of it, and would catch divergence in the concat_tables ordering specifically (multi-table concat is the one thing the single-table pa.table(...) test path never touches).

Generated by Claude Code

espg · 2026-06-13T16:46:46Z

+    """
+    sort_idx = np.argsort(cell_col, kind="stable")
+    sorted_cells = cell_col[sort_idx]
+    col_arrays = {col: arr[sort_idx] for col, arr in col_dict.items()}
+    if len(sorted_cells) == 0:


🤖 from Claude (review)

Docstring says "each cell's observations form a contiguous slice, so col_arrays[col][start:end] is a view." That second clause is the load-bearing claim the issue makes about zero-copy extraction, but col_arrays is produced by arr[sort_idx] (line 53) — fancy indexing, which always copies. The per-cell [start:end] of that already-copied, reordered array is a view, but the array as a whole is a full reorder copy of the input — for both carriers. So the "zero-copy slice" framing in #30 is only half-true: the dominant cost is the argsort reorder copy, paid identically by pandas and Arrow, not the slicing. Worth tightening the docstring (and the benchmark's framing — see handoff_bench.py) so the benchmark conclusion isn't read as "Arrow extraction is zero-copy" when it isn't here.

Generated by Claude Code

espg · 2026-06-13T16:46:57Z

+
+This is the CI-runnable half of #30's benchmark: it isolates the grouping
+algorithm and the carrier representation cost with no I/O, so it runs anywhere


🤖 from Claude (review)

This benchmark does not actually measure what #30 asks for, and the gap should be stated plainly rather than only deferred to "phase 3":

It does not benchmark the production code paths. run_arrow() calls pa.table(col_dict).combine_chunks() on a single synthetic table — it never exercises pa.concat_tables (the multi-beam concat that process_shard actually runs), nor _read_group, nor the real density regime. Likewise the "carrier" difference being measured is mostly an artifact of constructing a pd.DataFrame vs a pa.table over the same numpy dict, then immediately converting back to numpy.

The tracemalloc "peak_MB" column is misleading as printed. The docstring honestly notes it misses numpy buffers, but the output table just prints peak_MB with no caveat — and numpy buffers are precisely where the bulk of a real shard's memory lives. A reader scanning the table will over-trust it. Either label the column (e.g. py_peak_MB) or print the caveat alongside the table.

The parity assertion is real but narrow: stats_equal uses exact !=, which is good for byte-for-byte — but it only compares the scalar dict outputs, and both run_pandas/run_arrow feed agg_grouped the same calculate_cell_statistics, so it's largely asserting that argsort is deterministic. The meaningful divergence vector (null/chunked Arrow columns, concat_tables ordering) isn't reachable from this synthetic.

None of this blocks landing an additive benchmark, but the issue's acceptance criterion is "a benchmark table reports wall-time + peak memory for pandas vs Arrow on a real shard" — this satisfies neither "real shard" nor a faithful peak-memory measure, so it shouldn't be presented as closing that criterion.

Generated by Claude Code

espg · 2026-06-13T16:47:06Z

    agg_fields = get_agg_fields(config)

-    if len(df_cell) == 0:
+    n_obs = len(next(iter(cell_data.values()))) if cell_data else 0


🤖 from Claude (review)

n_obs = len(next(iter(cell_data.values()))) if cell_data else 0 — minor robustness note. This takes the length of an arbitrary first column to stand in for the old len(df_cell). That's correct as long as every column is equal-length (the docstring asserts it), but it silently returns whatever the first column's length is if that invariant is ever violated, rather than failing. The old len(df_cell) had the DataFrame enforce equal column lengths at construction. Not worth a guard on the hot path, but flagging that the equal-length precondition is now unchecked.

Generated by Claude Code

espg · 2026-06-13T16:47:13Z

+    _empty: dict[str, np.ndarray] = {col: arr[:0] for col, arr in col_arrays.items()}
+
    cells_with_data = 0
-    cell_col = grid.cells_of(df_all["leaf_id"].values)
    for i, child_morton in enumerate(children):
-        df_cell = df_all[cell_col == child_morton]
-        if len(df_cell) > 0:
+        if child_morton in cell_to_slice:
+            start, end = cell_to_slice[child_morton]
+            cell_data: dict[str, np.ndarray] = {
+                col: arr[start:end] for col, arr in col_arrays.items()
+            }


🤖 from Claude (review)

_empty is rebuilt as a fresh dict of zero-length slices and reused for every absent child. Two small points:

It aliases the same _empty dict object across all empty cells. calculate_cell_statistics only reads it, so this is safe today — but it's a latent footgun if a future change ever mutates cell_data in place. A comment noting the intentional aliasing would help.

Behavior check vs the old loop: previously every child went through df_all[cell_col == child_morton] and calculate_cell_statistics(df_cell) even when empty, and cells_with_data incremented only when len(df_cell) > 0. The new branch increments cells_with_data only inside the if child_morton in cell_to_slice arm — which is equivalent iff cell_to_slice never contains a key with a zero-length slice. Given _group_columns only emits keys for observed cells, that holds. Worth a one-line test asserting cells_with_data matches the old count on a shard with some empty children, since this metric feeds metadata and isn't covered.

Generated by Claude Code

espg · 2026-06-13T16:50:11Z

🤖 from Claude

A fresh-context review pass flagged two parity concerns worth grounding for you (the inline comments stand for your call — I'm not resolving them):

Null-column divergence (to_numpy(zero_copy_only=False) vs .values). Correct in general, but it cannot arise in this pipeline today: the per-group columns are built in _read_group from raw h5coro reads (data[path][mask_sliced]) — dense numpy arrays with no nullmask — so pa.table(data_dict) produces null-free columns and the round-trip is dtype-exact. So scalar parity holds for the real ATL06/ATL03 path now. The reviewer's point is that it's an unguarded precondition; I'd address it with an explicit dense/null-free assertion in the arrow branch — cheap hardening I'll add if you want it enforced.
The process_shard arrow branch isn't tested end-to-end (TestArrowHandoff re-runs the extract/group sequence inline rather than driving the production code, and never exercises concat_tables). Fair — the clean fix is to extract a _concat_and_group(all_reads, grid, handoff) helper used by both process_shard and the test, so the test covers the real carrier path including multi-table concat ordering. That's a small refactor; I left it as review feedback rather than churn the PR unprompted.

Both are real and both are improvements I'm happy to push (null guard + _concat_and_group helper/test) — just confirm you want them and I'll land them on this branch. Nothing here breaks current parity; it's about enforcing and testing the guarantee rather than resting it on the read path staying null-free.

Generated by Claude Code

espg · 2026-06-14T05:51:32Z

Both are real and both are improvements I'm happy to push (null guard + _concat_and_group helper/test)

yup, land them in this branch.

can you say more about this?

I deliberately did not route mean/var through pyarrow.compute hash-aggregate kernels: grouped pyarrow float32 reductions are not byte-for-byte equal to numpy (different summation), which would break this issue's "identical outputs" criterion. That kernel sub-lever needs your call on whether to accept kernel-level float divergence — flagged as phase 2b in the checklist rather than smuggled in.

Is the pyarrow mean/var expected to be a performance win? It might be worth discussing if we should loosen the acceptance critera to check for np.close rather than enforce byte compatibility. Flesh this out-- want the trade space of pros/cons on this decision?

…#30)

espg · 2026-06-15T02:35:08Z

🤖 from Claude

Pushed 39d24b6 — landed the two review items you green-lit, then the pyarrow-kernel trade space you asked for.

Landed (review findings)

_concat_and_group(all_reads, grid, handoff) helper — extracted the concat+extract+group seam out of process_shard so the Arrow path is driven end-to-end, including multi-table pa.concat_tables ordering, by both the production code and the test. process_shard now calls it; no behavior change.
Null-free guard on the arrow carrier — _concat_and_group raises ValueError("arrow handoff requires null-free columns; …") if any concatenated column has a non-zero null_count. The read path (_read_group) builds tables from dense h5coro reads with no null mask, so to_numpy(zero_copy_only=False) is dtype-exact today; this makes that precondition enforced instead of resting on the read path staying null-free.
Tests: test_concat_and_group_arrow_matches_pandas drives the helper with three reads of differing length (exercises concat offsets/ordering) and asserts byte-identical per-cell stats pandas-vs-arrow; test_concat_and_group_arrow_rejects_nulls asserts the guard fires. Full suite 215 passed, 1 skipped; ruff check + ruff format --check clean; no new mypy findings (the two pre-existing processing.py ones remain).

pyarrow.compute hash-aggregate kernels — the trade space (phase 2b)

Is the pyarrow mean/var expected to be a performance win? Should we loosen acceptance to np.close?

Is it a win? Potentially yes for the kernel-able stats (count/min/max/sum/mean/var), but it's an untested lever and a partial one:

pa.TableGroupBy.aggregate (Acero) does group-by + reduction in one vectorized C++ pass (SIMD, no Python-level per-cell loop), replacing both our np.argsort and the per-child Python reduction loop. For an O6→O12 shard (4,096 cells) at ATL03 volume that's the most plausible speedup in the pipeline.
But it only moves the reductions. We still need the per-cell slices for everything non-kernel-able — Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29's fixed-width vectors, quantiles, any custom reducer — so unless all fields are kernel-able it's a mixed path, not a wholesale replacement. And our current group is already O(n log n); the synthetic bench had arrow-group ≈ pandas-group on wall time, so the kernel win is real only if a real-ATL03 (phase 3) measurement shows it.

The byte-identical vs np.allclose fork is the actual decision, and I'd not loosen the global criterion. Why the kernels diverge: pyarrow uses different (often more accurate — pairwise/Kahan-ish) float summation than numpy's naive accumulation, so on float32 means over millions of obs you get last-1–2-ULP differences (~1e-6–1e-4 relative). Scientifically negligible, but it has costs:

	Keep byte-identical (default)	Loosen to `np.allclose` globally
Regression signal	Strongest — exact cross-carrier/refactor equality, cheap	Weaker — a too-loose rtol can mask a real grouping/slice bug as "close enough"
Determinism of the stored product	Same bytes regardless of carrier — clean for your #24 drift-tracking	Output becomes carrier-dependent; two runs (pandas vs kernel) differ in low ULPs — a new non-determinism to document
Backend freedom	Forbids kernel/polars/duckdb summation orders	Unlocks them
#29 dependency	Vector path inherits an exact contract	Inherits a tolerance it must also honor

Recommendation: keep byte-identical as the contract for the pandas↔arrow-carrier equivalence (it's free — this PR already guarantees it, both carriers feed identical numpy arrays), and make the kernel reduction a separate opt-in (reducer="kernel" / handoff="arrow-kernel") governed by an explicit, documented assert_allclose tolerance (e.g. rtol=1e-6 for float32 means) rather than loosening the whole suite. That quarantines the float divergence behind a flag, keeps the default path #29 builds on provably exact, and defers building it at all until the phase-3 real-ATL03 benchmark shows a material win. If the win is marginal, we don't take on the determinism cost.

So my call: don't loosen the global criterion; gate kernels behind an opt-in + documented tolerance, decided by the phase-3 measurement. If you'd rather I prototype the kernel path now (behind the flag) so phase 3 can benchmark all three (mask / numpy-group / kernel), say the word.

Status

All your diff-scoped requests are addressed. Remaining phases now hinge on you: phase 2b needs the kernel/tolerance decision above, and phase 3 (real-ATL03 wall-time + RSS, pandas vs arrow) needs earthaccess/S3 access I don't have in this environment. Phase 4 (Arrow-table write_dataframe_to_zarr sibling) is additive and I can do it on request. Marking waiting for your 2b call; happy to clear it and continue the moment you decide. Draft until then.

Generated by Claude Code

espg

🤖 from Claude (review)

Reviewed commit 39d24b6 against the two prior findings @espg approved landing.

Behavior preservation — verified. _concat_and_group is a faithful extraction of the prior inline process_shard code. Arrow branch is byte-identical to before (concat_tables().combine_chunks(), to_numpy(zero_copy_only=False), _group_columns, n_obs_total = table.num_rows); pandas branch unchanged (pd.concat(ignore_index=True), len(df_all), _build_groups). The branch selector switched from the local use_arrow bool to re-checking handoff == "arrow", but since handoff is validated to {"pandas","arrow"} upstream and use_arrow = handoff == "arrow", the two are equivalent. grid.cells_of usage and grouping are identical on both paths.

Null guard — correct. table.column(n).null_count sums nulls across the (already combined) chunks; nonzero is truthy, so the guard fires on any nullable column. This catches the real divergence: a null would become NaN under to_numpy(zero_copy_only=False), silently corrupting either a leaf_id group key or a stat. Failing loudly is the right contract.

Tests — substantive, not vacuous. test_concat_and_group_arrow_matches_pandas uses three reads of differing length (40/7/53), genuinely exercising the multi-table concat_tables path and offset handling, and compares per-cell stats pandas-vs-arrow with assert_array_equal (byte-for-byte). The guard test feeds a real arrow null and asserts ValueError/null-free. _IdentityGrid is a fair stand-in — both carriers hand cells_of a plain ndarray, so np.asarray is a no-op that isolates carrier mechanics from grid semantics as the docstring claims. PR-comment claims all check out against the diff.

Two minor, non-blocking observations inline. No correctness issues found; nothing blocks the next phase.

Generated by Claude Code

espg

🤖 from Claude (review)

Two minor inline observations on the new tests (both non-blocking). Full analysis is in the prior summary comment on this commit — behavior preservation, the null guard, and parity tests all check out.

Generated by Claude Code

espg · 2026-06-15T02:36:31Z

+        table = pa.table(
+            {
+                "h_li": pa.array([1.0, None, 3.0], type=pa.float32()),
+                "s_li": pa.array([0.1, 0.2, 0.3], type=pa.float32()),


🤖 from Claude (review)

The guard test injects the null into h_li (a value column). The code comment two functions up specifically justifies the guard by the risk of a grouping divergence — i.e. a null in leaf_id, which to_numpy(zero_copy_only=False) would turn into NaN and silently mis-key the sort. The guard is generic so it covers both, but the test as written never exercises the leaf_id-null case it was motivated by. Consider moving the None into the leaf_id column (or adding a second case) so the test pins the scenario the comment calls out. Non-blocking.

Generated by Claude Code

espg · 2026-06-15T02:36:34Z

+
+    def test_concat_and_group_arrow_rejects_nulls(self):
+        """The arrow carrier must fail loudly on null columns, not silently diverge."""
+        pa = pytest.importorskip("pyarrow")


🤖 from Claude (review)

_IdentityGrid is now defined identically in both test_concat_and_group_arrow_matches_pandas and test_concat_and_group_arrow_rejects_nulls. Minor DRY nit — a single class-level (or module-level) stub would avoid the duplication. Non-blocking.

Generated by Claude Code

espg · 2026-06-15T06:21:50Z

Recommendation: keep byte-identical as the contract for the pandas↔arrow-carrier equivalence (it's free — this PR already guarantees it, both carriers feed identical numpy arrays), and make the kernel reduction a separate opt-in (reducer="kernel" / handoff="arrow-kernel") governed by an explicit, documented assert_allclose tolerance (e.g. rtol=1e-6 for float32 means) rather than loosening the whole suite. That quarantines the float divergence behind a flag, keeps the default path #29 builds on provably exact, and defers building it at all until the phase-3 real-ATL03 benchmark shows a material win. If the win is marginal, we don't take on the determinism cost.

So my call: don't loosen the global criterion; gate kernels behind an opt-in + documented tolerance, decided by the phase-3 measurement. If you'd rather I prototype the kernel path now (behind the flag) so phase 3 can benchmark all three (mask / numpy-group / kernel), say the word.

I'll take your recommendation and keep the byte-identical contract. Let's prototype the kernel path now (behind the flag) so phase 3 can benchmark all three (mask / numpy-group / kernel). You can mark the kernel path code experimental -- I don't want us to keep it and contribute to code bloat if it doesn't have a measurable and real impact on run time efficiency... but the only way to know will be to test and benchmark it after we implement it.

espg · 2026-06-15T10:40:01Z

🤖 from Claude

Folded in the two HIGH findings from the phase-2b self-review (commit 82f3496).

HIGH-1 — kernel NaN semantics (the false "exact" claim). Fixed by matching numpy, not just documenting.
I verified the divergence against pyarrow 24.0.0 and it's narrower than the review framed but still real:

stat	arrow on `[1,2,NaN,4]`	numpy
`count`	4	4 (NaN is a value, not a null)
`min`/`max`	1.0 / 4.0 (skips NaN)	NaN / NaN (propagates)
`mean`/`variance`	NaN	NaN (already agree on NaN input)

So the only true divergence was min/max silently skipping NaN. Rather than weaken the "count/min/max exact" contract, I restored numpy parity for it: _kernel_aggregate now detects NaN per group on each min/max source column (a np.isnan boolean column reduced with a per-group max, done in the same single group-by pass) and overwrites those groups' min/max with NaN — src/zagg/processing.py:398-422. skip_nulls does not help here because float NaN is not an Arrow null. With this, count/min/max are bit-identical to numpy including on NaN input; mean/variance still propagate NaN in both paths and only diverge within KERNEL_RTOL on finite floats. Updated the now-accurate contract in the EXPERIMENTAL block (:282-292), the _kernel_aggregate docstring (:345-362), and the process_shard docstring (:580-585).

HIGH-2 — tests didn't exercise the production process_shard(handoff="arrow-kernel") branch or any NaN data.
Added TestProcessShardKernelBranch (tests/test_processing.py:480-590) driving the real branch — concat→combine_chunks(), null guard, grid.cells_of, and the handoff validation — by monkeypatching _read_group/h5coro.H5Coro so no I/O is needed:

test_kernel_branch_matches_default_path — kernel vs arrow carrier over two reads: count/min/max exact, variance within KERNEL_RTOL.
test_kernel_branch_nan_input — a NaN in h_li propagates to that cell's min/max (numpy semantics), count unaffected, and the null guard does not trip (NaN ≠ Arrow null).
test_invalid_handoff_rejected — handoff="bogus" raises before any read.

Plus test_kernel_nan_matches_numpy_semantics in TestKernelHandoff (:408-447) asserting _kernel_aggregate matches the numpy reference on clean / one-NaN / all-NaN cells. test_processing.py 23 → 27 tests; full suite 218 → 222 passed, 1 skipped.

LOW findings: all three addressed as notes (no behavior change): (a) _kernel_aggregate docstring now states it does not re-check the Arrow-null contract and that the caller enforces it (:355-362); (b) noted the "average"→"mean" mapping is dead for shipped configs (:303-306); (c) noted the fallback recomputes the full stats dict and discards the kernel-able ones (:425-430).

Verification: ruff check / ruff format --check clean (against the pinned v0.14.10), pytest -v green (222 passed, 1 skipped — pre-existing deprecation warnings only), no new mypy errors and no # type: ignore/# noqa. Benchmark parity self-check still passes; added a one-line note to its docstring that its synthetic data is NaN-free so the NaN path is covered in tests, not there (benchmarks/handoff_bench.py:11-17). No new deps.

Nothing blocked. Leaving the PR as-is (draft, implement); not marking ready.

Generated by Claude Code

espg · 2026-06-15T23:01:18Z

numpy as a full suite of nan compatible operators -- nanmean , nanvar, nanmax, etc. Does arrow have similar equivalents? I don't want to loose the thread too much here; the user contract is that we can execute aggregation functions that are expressible in numpy, and we certainly don't want to lose that. That said, I don't mind having the implementation provide arrow variants when they are genuinely faster.

Perhaps the angle to look at this is one of expanding the contract with the user? i.e., we support anything expressible in numpy and any arrow kernels. That would make thing more explicit, and also not our problem anymore-- the user defines what gets called in the agg template, and they make the choice of numpy vs arrow; we just document examples of both cases, with some benchmarks, and then they get to decide which hatch to use.

I'd like to finalize this asap, and get it merged-- we can separate phase 3 and 4 back to issue to track, and then get them in a follow-up PR later. Do you have any questions (or thoughts) on implementing the above?

One other note-- #32 (and other PRs) were just merged to main, so check if this branch needs a rebase.

…w0eoz

espg · 2026-06-16T00:08:17Z

🤖 from Claude

Answering your arrow-equivalents question and finalizing per your comment.

Does numpy's nan-family work end-to-end? Yes — already, with no code change. resolve_function resolves any bare/np.-prefixed name via getattr(np, name) and any dotted path via import, so np.nanmean/np.nanvar/np.nanmax/np.nanmin/np.nansum/np.nanstd/… are all usable directly from the agg template and run through calculate_cell_statistics with numpy's own NaN semantics. There was no gap to close — I added test_numpy_nan_aware_functions proving correct NaN-aware per-cell output (NaN-skipping mean/min/max/var/sum/std on NaN-bearing input), and documented the contract in the calculate_cell_statistics docstring + the EXPERIMENTAL block.

Does arrow have nan-equivalents? Not really — and that's why kernels stay opt-in. Arrow compute has mean/min_max/variance with skip_nulls, but an Arrow NULL is a distinct missing-value bit, not a float NaN — skip_nulls does not skip NaN. So there's no arrow "nanmean" kernel that's a drop-in for np.nanmean. The existing experimental kernel path doesn't pretend otherwise: it matches numpy's NaN behaviour by hand (count/min/max are exact vs numpy including NaN — pyarrow min/max skip NaN, so _kernel_aggregate detects NaN per group and propagates it; float mean/variance agree within the documented KERNEL_RTOL ~1 ULP, not byte-identical).

Dual contract — documented as you framed it. The supported user contract is "any aggregation expressible in numpy (including the nan* family)"; arrow kernels are an opt-in acceleration for the kernel-able subset, gated behind handoff="arrow-kernel", kept only if the phase-3 benchmark justifies them. This is now spelled out in the docstring, the EXPERIMENTAL block ("Dual aggregation contract"), and the PR body.

Main rebase status. Per CLAUDE.md (§1: never force-push; §2: don't rebase a shared pushed branch), I synced via git merge origin/main rather than git rebase. It was a clean merge with no conflicts — processing.py and its tests weren't touched by the merged PRs. main brought in #32 (requires-python <3.14), the concurrency module, shardmap/rectilinear changes, and docs.

Tests. Full suite post-merge: 265 passed, 1 skipped. ruff check --select=E,F,W,I --ignore=E501 (ruff 0.14.10, the pinned CI version) clean on the touched files. benchmarks/handoff_bench.py runs and asserts parity. (Note: the local uv env pulled ruff 0.15.17, whose formatter reflows several pre-existing files unrelated to this PR — I did not touch those; CI uses the pinned 0.14.10, under which my two files are clean.)

Phases 3 & 4 deferred. As you asked, I split phase 3 (real-ATL03 wall-time/RSS benchmark) and phase 4 (Arrow-table write_dataframe_to_zarr sibling) out of this PR so it can finalize/merge now. I've left them in the PR checklist marked deferred and noted they'll be tracked off #30. One question for you: do you want a dedicated follow-up tracking issue, or keep them on #30? Opening a new issue is a side-effecting action, so I'm leaving that call to you rather than creating one.

This PR is finalized / merge-ready pending your call on the two questions above (dual-contract framing + where to track 3/4). Kept as a draft — I won't mark it ready or merge.

Generated by Claude Code

espg · 2026-06-16T00:09:38Z

🤖 from Claude (review)

Fresh adversarial self-review of this run's commit (document numpy nan-op contract; test nan-aware aggregation): no defects found.

Verified against resolve_function/calculate_cell_statistics:

Docstring/comment claims are accurate: bare and np.-prefixed names resolve via getattr(np, ...), dotted paths via import, and the default reduction path (float(func(values))) carries numpy's own NaN semantics with no special-casing. The EXPERIMENTAL block's skip_nulls-skips-NULL-not-NaN claim is correct.
The test's minimal PipelineConfig(aggregation=…) constructs fine (all dataclass fields defaulted), and the trace through calculate_cell_statistics is clean for all six nan* fields (no expression, no params, source present, non-empty branch).
The three guard assertions (not isnan(h_nanmean), h_nanmax == 5.0, h_nanmin == 1.0 for input [1.0, nan, 3.0, 5.0]) are genuinely discriminating — a non-NaN-aware op would yield NaN and fail them — so a real NaN-handling regression couldn't slip through.

Nothing to fold; standing for your review.

Generated by Claude Code

phase 1 of issue #30: sort/hash grouping refactor

f0e3890

espg mentioned this pull request Jun 12, 2026

Summary tracking issue espg/mortie#39

Closed

espg commented Jun 12, 2026

View reviewed changes

fix review findings from PR #33

faa09c4

This was referenced Jun 13, 2026

Support non-scalar aggregation outputs (vectors + ragged per-cell payloads) #29

Closed

Refactor the per-cell aggregation handoff: sort/hash grouping + Arrow path (additive, benchmarked) #30

Closed

phase 2 of issue #30: arrow handoff path

8140437

espg commented Jun 13, 2026

View reviewed changes

espg mentioned this pull request Jun 15, 2026

CLAUDE.md: per-issue claude/ branches, multi-PR, and PR label states #37

Merged

espg changed the title ~~phase 1 of issue #30: sort/hash grouping refactor~~ sort/hash grouping refactor ( #30 ) Jun 15, 2026

espg added the implement label Jun 15, 2026

espg mentioned this pull request Jun 15, 2026

runner.py lambda orchestrator: default max_workers exhausts client file descriptors, silently dropping cells #28

Closed

address review: null-free arrow guard + _concat_and_group helper/test (…

39d24b6

…#30)

espg added the waiting label Jun 15, 2026

espg commented Jun 15, 2026

View reviewed changes

fold self-review: dedup grid stub, null-guard test targets leaf_id (#30)

d580746

espg removed the waiting label Jun 15, 2026

claude added 2 commits June 15, 2026 10:17

phase 2b of issue #30: experimental pyarrow kernel reducer

6194852

fold review: kernel NaN semantics + process_shard tests (#30)

82f3496

espg added the waiting label Jun 15, 2026

espg mentioned this pull request Jun 15, 2026

Non-scalar aggregation outputs (Tier-1 vectors): config declaration + vector cell statistics #40

Merged

7 tasks

espg removed the waiting label Jun 15, 2026

claude added 2 commits June 16, 2026 00:01

Merge remote-tracking branch 'origin/main' into claude/epic-wozniak-y…

421d1d7

…w0eoz

document numpy nan-op contract; test nan-aware aggregation (#30)

f6b24f1

espg added the waiting label Jun 16, 2026

espg marked this pull request as ready for review June 16, 2026 00:26

espg merged commit 64bfb67 into main Jun 16, 2026
12 checks passed

espg pushed a commit that referenced this pull request Jun 16, 2026

fix review findings from PR #33

5dd10ff

This was referenced Jun 16, 2026

rename parent_morton event field to shard_key (#24) #42

Merged

ATL03 template + region benchmark + hierarchical-read wiring (Refs #30, #43) #47

Merged

Temporal aggregation #12

Open

espg deleted the claude/epic-wozniak-yw0eoz branch June 18, 2026 18:56

This was referenced Jun 20, 2026

Claude Summary Tracking (June 2026) #50

Open

Follow up items for the week of June 23, 2026 #109

Open

espg mentioned this pull request Jun 29, 2026

Lambda layer omits pyarrow → vector aggregators (tdigest) crash with "No module named 'pyarrow'" #130

Closed


		This is the CI-runnable half of #30's benchmark: it isolates the grouping
		algorithm and the carrier representation cost with no I/O, so it runs anywhere

Uh oh!

Conversation

espg commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Phase 1 — sort/hash group split (_build_groups)

Phase 2 — Arrow handoff carrier (additive, opt-in)

Experimental — pyarrow-kernel reducer (handoff="arrow-kernel", opt-in)

Dual aggregation contract (this run)

How it was tested

Phases

Questions for review

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Phase 1 review — sort/hash grouping refactor

1. CI failure — missed call site in test_config.py (blocking)

2. _build_groups crashes on empty DataFrame (correctness bug)

3. Breaking API change with no compat shim — callers outside test_processing.py were not swept

4. if cell_data else 0 is the wrong emptiness guard

5. _empty sentinel is shared across all no-data cells — mutation hazard

6. Parity test uses sorted order vs original order — floating-point results may differ (test correctness)

7. test_slice_counts_match_per_cell_mask compares sorted values, not ordered slices

8. No test for _build_groups with empty input

9. No end-to-end test for process_shard with the new code path

10. noqa: S307 is a no-op under this repo's ruff config (minor)

Summary table

Uh oh!

espg commented Jun 12, 2026

Uh oh!

espg commented Jun 12, 2026

Uh oh!

espg commented Jun 12, 2026

Uh oh!

espg commented Jun 13, 2026

Phase 2 — what landed

Q1 — memory bloat at ATL03 scale

Q2 — calculate_cell_statistics breaking change, fuller picture

Q3 — the _empty sentinel seam

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 13, 2026

Uh oh!

espg commented Jun 14, 2026

Uh oh!

espg commented Jun 15, 2026

Landed (review findings)

pyarrow.compute hash-aggregate kernels — the trade space (phase 2b)

Status

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 15, 2026

Uh oh!

espg commented Jun 15, 2026

Uh oh!

espg commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

espg commented Jun 12, 2026 •

edited

Loading

Phase 1 — sort/hash group split (`_build_groups`)

Experimental — pyarrow-kernel reducer (`handoff="arrow-kernel"`, opt-in)

1. CI failure — missed call site in `test_config.py` (blocking)

2. `_build_groups` crashes on empty DataFrame (correctness bug)

3. Breaking API change with no compat shim — callers outside `test_processing.py` were not swept

4. `if cell_data else 0` is the wrong emptiness guard

5. `_empty` sentinel is shared across all no-data cells — mutation hazard

7. `test_slice_counts_match_per_cell_mask` compares sorted values, not ordered slices

8. No test for `_build_groups` with empty input

9. No end-to-end test for `process_shard` with the new code path

10. `noqa: S307` is a no-op under this repo's ruff config (minor)

Q2 — `calculate_cell_statistics` breaking change, fuller picture

Q3 — the `_empty` sentinel seam

espg commented Jun 15, 2026 •

edited

Loading