sort/hash grouping refactor ( #30 )#33
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase 1 review — sort/hash grouping refactor
CI is currently red (1 failure, test 3.12). Several additional issues found. Ordered by severity.
1. CI failure — missed call site in test_config.py (blocking)
tests/test_config.py::TestEquivalence::test_config_matches_calculate_cell_statistics is still calling calculate_cell_statistics(synthetic_df) with a pd.DataFrame. The if cell_data else 0 guard on line 159 hits pandas' __bool__, which raises:
ValueError: The truth value of a DataFrame is ambiguous.
The PR updated tests/test_processing.py but missed this test in tests/test_config.py. That test must be updated (or a compat shim added — see item 3 below) before CI goes green.
2. _build_groups crashes on empty DataFrame (correctness bug)
If df_all has zero rows, cell_col is also empty. The boundary arithmetic produces:
starts = np.concatenate([[0], []]) # → [0]
ends = np.concatenate([[], [0]]) # → [0]
# zip([0], [0]) → one iteration
int(sorted_cells[0]) # IndexError — sorted_cells is emptyAny call path where df_all can be empty (e.g. a shard with no observations that nevertheless reaches process_shard) will crash. There is no test for this case. A guard like:
if len(cell_col) == 0:
return {col: df_all[col].values for col in df_all.columns}, {}at the top of _build_groups would fix it.
3. Breaking API change with no compat shim — callers outside test_processing.py were not swept
The calculate_cell_statistics signature change is breaking. The CI failure (item 1) proves at least one caller in test_config.py was not updated. Before marking this draft ready, a full grep for calculate_cell_statistics across the repo is warranted to catch any remaining callers. Consider whether a thin shim (checking isinstance(cell_data, pd.DataFrame) and extracting .values) is preferable to a mechanical sweep — the PR description's "Questions for review" already flags this; it needs an answer before phase 1 lands.
4. if cell_data else 0 is the wrong emptiness guard
Even after fixing the CI failure, the guard len(next(iter(cell_data.values()))) if cell_data else 0 has a subtle gap: a dict with keys but zero-length arrays ({"h_li": np.array([]), "s_li": np.array([])}) is truthy, so n_obs will correctly be 0. That case is actually fine.
The real problem is the guard does not express intent: it was written to protect against an empty dict {} falling into next(iter(...)). The new test test_empty_data_returns_zeros_and_nans passes an explicit {"h_li": np.array([]), ...} (non-empty dict with zero-length arrays), so the cell_data else 0 branch is dead in all tested paths. A cleaner and unambiguous form:
arrays = list(cell_data.values())
n_obs = len(arrays[0]) if arrays else 0This also works correctly if cell_data is ever a DataFrame (it will hit len(df.iloc[:, 0]) rather than bool(df) — but the right fix is still item 3, not relying on this).
5. _empty sentinel is shared across all no-data cells — mutation hazard
_empty: dict[str, np.ndarray] = {col: arr[:0] for col, arr in col_arrays.items()}_empty is reused by reference for every child with no observations. calculate_cell_statistics does not mutate its input today, but the code path that resolves eval()-based params does:
ns = {"__builtins__": {}, "np": np, "numpy": np, **cell_data}
resolved_params[pkey] = eval(pval, ns)**cell_data unpacks into ns (a copy), so the arrays themselves are not mutated. That's safe today. But the dict _empty is the same object every iteration — if anyone ever writes cell_data[new_key] = ... inside calculate_cell_statistics, it would silently corrupt all subsequent empty-cell calls. Given the PR description already calls this out under "Questions for review," at minimum add a comment next to _empty noting the shared-reference contract, or use _empty as a template and shallow-copy per use: cell_data = dict(_empty).
6. Parity test uses sorted order vs original order — floating-point results may differ (test correctness)
test_statistics_match_old_approach asserts exact equality (np.testing.assert_array_equal) between:
new_data: values extracted in sorted cell-id order fromcol_arraysold_data: values extracted in original DataFrame row order via boolean mask
For commutative aggregations (min, max, count) this is fine. For mean, std, weighted average, or any expression aggregation, floating-point summation is order-dependent. With float32 (lower precision, more rounding), a mismatch of 1 ULP is plausible. The test happens to pass because the RNG seed (7) doesn't hit a rounding boundary — but this is not a reliable guarantee.
The PR description claims "byte-for-byte identical outputs." That claim is not proven by this test as written: the test compares new-path-in-sorted-order vs reference-path-in-original-order. If h_mean or h_sigma differ by 1 ULP, the test will fail in CI on a different platform or numpy version. Either:
a) use assert_allclose with a tight rtol (acknowledging the claim is "within tolerance" not "bit-identical"), or
b) document explicitly that the sort reorders observations and float results are only reproducible at float64 precision, or
c) change the reference to also sort its mask-selected values before passing to calculate_cell_statistics, making the comparison truly order-matched.
Option (c) is the most honest demonstration of parity and costs one np.sort call per cell in the test.
7. test_slice_counts_match_per_cell_mask compares sorted values, not ordered slices
np.testing.assert_array_equal(np.sort(new_vals), np.sort(old_vals))This verifies multiset identity (same elements), not that the slice is contiguous or the sort is stable. The test name says "counts match" but actually checks value-set membership — stronger than needed for a boundary test, weaker than needed for an ordering test. Not a bug, but the name and assertion are slightly at odds. Renaming to test_slice_values_match_per_cell_mask or using np.testing.assert_array_equal(np.sort(…), np.sort(…)) with a comment explaining the sort would make intent clearer.
8. No test for _build_groups with empty input
Follows directly from item 2. After the crash fix, add:
def test_empty_dataframe(self):
df = pd.DataFrame({"h_li": np.array([]), "s_li": np.array([])})
col_arrays, cell_to_slice = _build_groups(df, np.array([], dtype=np.int64))
assert cell_to_slice == {}
assert col_arrays["h_li"].shape == (0,)9. No end-to-end test for process_shard with the new code path
process_shard is the function that actually changed behavior. The new test suite covers _build_groups and calculate_cell_statistics in isolation, but there is no test that exercises process_shard end-to-end through the new _build_groups → dict-slice → calculate_cell_statistics path. A smoke test (even a small synthetic one) would catch regressions like item 1 at the integration level. At minimum, there should be a note under "Questions for review" whether the existing integration tests in test_integration.py cover this path — the coverage report shows processing.py at only 33%, which suggests they may not.
10. noqa: S307 is a no-op under this repo's ruff config (minor)
Ruff is configured with select = [E, F, W, I, N]. Rule S307 is in the S (bandit) ruleset, which is not selected. The # noqa: S307 comments are harmless but misleading — they imply the rule is active and being suppressed. This carried over from pre-refactor code; worth removing or converting to a comment explaining the eval is intentional.
Summary table
| # | Severity | Category | Blocks merge? |
|---|---|---|---|
| 1 | Critical | CI failure — missed caller in test_config.py |
Yes |
| 2 | High | Crash on empty DataFrame in _build_groups |
Yes |
| 3 | High | Breaking API — callers not fully swept | Yes |
| 4 | Medium | Wrong emptiness guard (if cell_data) |
Recommend fix |
| 5 | Medium | Shared _empty sentinel mutation hazard |
Recommend comment or copy |
| 6 | Medium | Parity test doesn't prove bit-identical claim | Recommend fix |
| 7 | Low | Test name vs assertion mismatch | Nit |
| 8 | Low | Missing empty-input test for _build_groups |
Recommend add |
| 9 | Low | No process_shard integration smoke test |
Recommend or explain |
| 10 | Nit | Dead noqa: S307 comment |
Nit |
Items 1–3 must be fixed before CI goes green. Items 4–6 are correctness/reliability concerns I'd want addressed before this leaves draft. Items 7–10 can be resolved in-thread or deferred to phase 2 if explained.
Generated by Claude Code
|
🤖 from Claude Follow-up commit ( Critical — CI red ( cell_data = {col: synthetic_df[col].values for col in synthetic_df.columns}
expected = calculate_cell_statistics(cell_data)High — if len(sorted_cells) == 0:
return col_arrays, {}Medium — weak parity assertion in Generated by Claude Code |
This seems like a problem. Moving to ATL03 from ATL06 is orders of magnitude more data; we'd like to become more efficient , not less. Isn't this why we're moving to arrow as a backend? Is this just a transitory side effect (i.e., because we haven't completed the other phases), or are we explicitly trading computational enhancements for memory bloat?
The case for this is the future temporal module (in #12 and #15 ), along with the eventual user antarctic_AR_dataset . What's the fuller picture of what's going on with calculate_cell_statistics?
Not sure-- do you mean on the data side (i.e., in zarr), or internally for our own in memory usage? Where's the seam here? |
|
Claude, you can implement phase 2 in parallel to answering my questions above |
|
🤖 from Claude Phase 2 pushed ( Phase 2 — what landed
Q1 — memory bloat at ATL03 scaleWe are not trading compute for memory; phase 2 is the structural fix for exactly this worry.
So: phase 1 cut compute O(n_cells·n)→O(n log n) at ~flat memory; phase 2 cuts the memory constant. The ATL03-scale answer is "Arrow carrier + measure on a real shard (phase 3)." Q2 —
|
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of the latest commit only (8140437, phase 2 — Arrow handoff). Phase 1 (f0e3890) was not re-reviewed except where phase 2 touches it. Not approving / not requesting changes — leaving findings for the author.
Overall: The grouping refactor is clean and the carrier-agnostic _group_columns / _build_groups split is a sensible seam. The calculate_cell_statistics dict-of-arrays conversion is faithful to the old .values math. Lever 1 (sort/hash grouping) looks behavior-preserving and is reasonably tested via TestBuildGroups. My concerns are concentrated on the byte-for-byte parity claim and the benchmark / test coverage of the new Arrow path — i.e. exactly the acceptance criteria the issue calls out.
Must-fix / discuss before this is "done":
-
Parity rests on an unguarded precondition (inline at
processing.pyarrow branch).to_numpy(zero_copy_only=False)only reproduces pandas'.valuesbyte-for-byte for dense, null-free columns. h5coro delivers that today, but nothing asserts it, and #29 (non-scalar / potentially null-bearing columns) is explicitly the next consumer. A null-bearing integer column would upcast to float64+NaN on the Arrow side and diverge silently. Assert or document the precondition, and flag it under "Questions for review." -
The production Arrow branch in
process_shardhas no test (inline).TestArrowHandoffre-implements thepa.table → combine_chunks → to_numpy → _group_columnssequence inside the test instead of callingprocess_shard(handoff="arrow")or a shared helper. The real lines —pa.concat_tables(...).combine_chunks(),n_obs_total, thecol_dictextraction,_emptyplumbing — are uncovered, and the multi-tableconcat_tablesordering (the one thing the single-table test can't reach) is never exercised. Suggest extracting a small_concat_and_group(reads, grid, use_arrow)helper that's directly testable with two in-memory carriers.
Should-address:
-
Benchmark doesn't meet the issue's criterion (inline on
handoff_bench.py). It's synthetic (no_read_group/concat_tables/real density), andtracemalloc"peak_MB" omits numpy buffers — where a real shard's memory actually lives — yet is printed without a caveat. Fine as an additive CI-runnable harness, but it does not satisfy "wall-time + peak memory on a real shard," so don't tick that box. The parity assert is also narrow (asserts argsort determinism more than carrier equivalence). -
"View"/zero-copy framing is half-true (inline on
_group_columns).arr[sort_idx]is a full reorder copy for both carriers; only the post-sort[start:end]slices are views. The docstring and benchmark framing should not imply Arrow extraction is zero-copy here — it isn't.
Nits:
n_obs = len(next(iter(cell_data.values())))no longer enforces equal column lengths the way the DataFrame did (inline).cells_with_datais now incremented only in the present-cell arm; equivalent to the old count only becausecell_to_slicenever holds zero-length slices. A targeted test on a shard with empty children would lock down this metadata field (inline).
Lint/style refactors bundled into this commit (multi-line open_array, warning-string joins, blank lines after import) are cosmetic and fine. No # noqa/# type: ignore abuse introduced; module stays well under the ~1000-line guideline. Pandas remains the default and the handoff validation is correct, so the additive/opt-in requirement is met.
Generated by Claude Code
| if use_arrow: | ||
| import pyarrow as pa | ||
|
|
||
| table = pa.concat_tables(all_reads).combine_chunks() | ||
| n_obs_total = table.num_rows | ||
| leaf_ids = table.column("leaf_id").to_numpy(zero_copy_only=False) | ||
| cell_col = grid.cells_of(leaf_ids) | ||
| col_dict = { | ||
| name: table.column(name).to_numpy(zero_copy_only=False) for name in table.column_names | ||
| } | ||
| col_arrays, cell_to_slice = _group_columns(col_dict, cell_col) |
There was a problem hiding this comment.
🤖 from Claude (review)
The byte-for-byte parity claim hinges on an unguarded assumption: that to_numpy(zero_copy_only=False) reproduces the exact same numpy array (dtype + values) that pandas' .values would. That holds only for dense, null-free, non-extension columns — which is what h5coro currently delivers, so this likely works in practice. But the issue (#30) itself flags the footgun, and nothing here enforces it:
- Nulls: if any Arrow column ever carried nulls,
to_numpy(zero_copy_only=False)upcasts an integer column tofloat64and fills with NaN, while the pandas carrier (built from the same dense numpy) would not. The two paths would then diverge silently — exactly the failure the "byte-for-byte" criterion is meant to rule out. - dtype drift on
leaf_id: line 464 runscells_ofontable.column("leaf_id").to_numpy(...)whereas the pandas branch (line ~473) runs it ondf_all["leaf_id"].values. If Arrow yields a different integer width or a read-only buffer,grid.cells_ofcould in principle produce a differentcell_col. Worth confirmingcells_ofis dtype-insensitive here.
Recommend either (a) asserting null-free / expected dtype on the extracted columns, or (b) at minimum documenting the precondition in the docstring so a future non-scalar/null-bearing column (#29) doesn't quietly break parity. This is the central correctness risk of the phase, so please call it out under "Questions for review" even if you judge it safe today.
Generated by Claude Code
| if use_arrow: | ||
| import pyarrow as pa | ||
|
|
||
| table = pa.concat_tables(all_reads).combine_chunks() |
There was a problem hiding this comment.
🤖 from Claude (review)
The production Arrow branch in process_shard is entirely untested. TestArrowHandoff.test_arrow_grouping_matches_pandas rebuilds the pa.table(...).combine_chunks() → to_numpy → _group_columns sequence inline in the test, rather than driving it through process_shard(handoff="arrow"). So the actual lines here — pa.concat_tables(all_reads).combine_chunks(), n_obs_total = table.num_rows, the col_dict comprehension, and the _empty/slice plumbing below — have zero coverage. A typo or column-name mismatch in this branch would pass CI.
I understand process_shard can't be unit-tested end to end without S3/h5coro, but the concat→extract→group block could be lifted into a small carrier-agnostic helper (e.g. _concat_and_group(reads, grid, use_arrow)) that is directly testable with two in-memory tables / DataFrames. That would let a test exercise the real code rather than a re-implementation of it, and would catch divergence in the concat_tables ordering specifically (multi-table concat is the one thing the single-table pa.table(...) test path never touches).
Generated by Claude Code
| """ | ||
| sort_idx = np.argsort(cell_col, kind="stable") | ||
| sorted_cells = cell_col[sort_idx] | ||
| col_arrays = {col: arr[sort_idx] for col, arr in col_dict.items()} | ||
| if len(sorted_cells) == 0: |
There was a problem hiding this comment.
🤖 from Claude (review)
Docstring says "each cell's observations form a contiguous slice, so col_arrays[col][start:end] is a view." That second clause is the load-bearing claim the issue makes about zero-copy extraction, but col_arrays is produced by arr[sort_idx] (line 53) — fancy indexing, which always copies. The per-cell [start:end] of that already-copied, reordered array is a view, but the array as a whole is a full reorder copy of the input — for both carriers. So the "zero-copy slice" framing in #30 is only half-true: the dominant cost is the argsort reorder copy, paid identically by pandas and Arrow, not the slicing. Worth tightening the docstring (and the benchmark's framing — see handoff_bench.py) so the benchmark conclusion isn't read as "Arrow extraction is zero-copy" when it isn't here.
Generated by Claude Code
|
|
||
| This is the CI-runnable half of #30's benchmark: it isolates the grouping | ||
| algorithm and the carrier representation cost with no I/O, so it runs anywhere |
There was a problem hiding this comment.
🤖 from Claude (review)
This benchmark does not actually measure what #30 asks for, and the gap should be stated plainly rather than only deferred to "phase 3":
-
It does not benchmark the production code paths.
run_arrow()callspa.table(col_dict).combine_chunks()on a single synthetic table — it never exercisespa.concat_tables(the multi-beam concat thatprocess_shardactually runs), nor_read_group, nor the real density regime. Likewise the "carrier" difference being measured is mostly an artifact of constructing apd.DataFramevs apa.tableover the same numpy dict, then immediately converting back to numpy. -
The
tracemalloc"peak_MB" column is misleading as printed. The docstring honestly notes it misses numpy buffers, but the output table just printspeak_MBwith no caveat — and numpy buffers are precisely where the bulk of a real shard's memory lives. A reader scanning the table will over-trust it. Either label the column (e.g.py_peak_MB) or print the caveat alongside the table. -
The parity assertion is real but narrow:
stats_equaluses exact!=, which is good for byte-for-byte — but it only compares the scalar dict outputs, and bothrun_pandas/run_arrowfeedagg_groupedthe samecalculate_cell_statistics, so it's largely asserting that argsort is deterministic. The meaningful divergence vector (null/chunked Arrow columns,concat_tablesordering) isn't reachable from this synthetic.
None of this blocks landing an additive benchmark, but the issue's acceptance criterion is "a benchmark table reports wall-time + peak memory for pandas vs Arrow on a real shard" — this satisfies neither "real shard" nor a faithful peak-memory measure, so it shouldn't be presented as closing that criterion.
Generated by Claude Code
| agg_fields = get_agg_fields(config) | ||
|
|
||
| if len(df_cell) == 0: | ||
| n_obs = len(next(iter(cell_data.values()))) if cell_data else 0 |
There was a problem hiding this comment.
🤖 from Claude (review)
n_obs = len(next(iter(cell_data.values()))) if cell_data else 0 — minor robustness note. This takes the length of an arbitrary first column to stand in for the old len(df_cell). That's correct as long as every column is equal-length (the docstring asserts it), but it silently returns whatever the first column's length is if that invariant is ever violated, rather than failing. The old len(df_cell) had the DataFrame enforce equal column lengths at construction. Not worth a guard on the hot path, but flagging that the equal-length precondition is now unchecked.
Generated by Claude Code
| _empty: dict[str, np.ndarray] = {col: arr[:0] for col, arr in col_arrays.items()} | ||
|
|
||
| cells_with_data = 0 | ||
| cell_col = grid.cells_of(df_all["leaf_id"].values) | ||
| for i, child_morton in enumerate(children): | ||
| df_cell = df_all[cell_col == child_morton] | ||
| if len(df_cell) > 0: | ||
| if child_morton in cell_to_slice: | ||
| start, end = cell_to_slice[child_morton] | ||
| cell_data: dict[str, np.ndarray] = { | ||
| col: arr[start:end] for col, arr in col_arrays.items() | ||
| } |
There was a problem hiding this comment.
🤖 from Claude (review)
_empty is rebuilt as a fresh dict of zero-length slices and reused for every absent child. Two small points:
- It aliases the same
_emptydict object across all empty cells.calculate_cell_statisticsonly reads it, so this is safe today — but it's a latent footgun if a future change ever mutatescell_datain place. A comment noting the intentional aliasing would help. - Behavior check vs the old loop: previously every child went through
df_all[cell_col == child_morton]andcalculate_cell_statistics(df_cell)even when empty, andcells_with_dataincremented only whenlen(df_cell) > 0. The new branch incrementscells_with_dataonly inside theif child_morton in cell_to_slicearm — which is equivalent iffcell_to_slicenever contains a key with a zero-length slice. Given_group_columnsonly emits keys for observed cells, that holds. Worth a one-line test assertingcells_with_datamatches the old count on a shard with some empty children, since this metric feedsmetadataand isn't covered.
Generated by Claude Code
|
🤖 from Claude A fresh-context review pass flagged two parity concerns worth grounding for you (the inline comments stand for your call — I'm not resolving them):
Both are real and both are improvements I'm happy to push (null guard + Generated by Claude Code |
yup, land them in this branch. can you say more about this?
Is the pyarrow mean/var expected to be a performance win? It might be worth discussing if we should loosen the acceptance critera to check for np.close rather than enforce byte compatibility. Flesh this out-- want the trade space of pros/cons on this decision? |
|
🤖 from Claude Pushed Landed (review findings)
pyarrow.compute hash-aggregate kernels — the trade space (phase 2b)
Is it a win? Potentially yes for the kernel-able stats (count/min/max/sum/mean/var), but it's an untested lever and a partial one:
The byte-identical vs
Recommendation: keep byte-identical as the contract for the pandas↔arrow-carrier equivalence (it's free — this PR already guarantees it, both carriers feed identical numpy arrays), and make the kernel reduction a separate opt-in ( So my call: don't loosen the global criterion; gate kernels behind an opt-in + documented tolerance, decided by the phase-3 measurement. If you'd rather I prototype the kernel path now (behind the flag) so phase 3 can benchmark all three (mask / numpy-group / kernel), say the word. StatusAll your diff-scoped requests are addressed. Remaining phases now hinge on you: phase 2b needs the kernel/tolerance decision above, and phase 3 (real-ATL03 wall-time + RSS, pandas vs arrow) needs earthaccess/S3 access I don't have in this environment. Phase 4 (Arrow-table Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Reviewed commit 39d24b6 against the two prior findings @espg approved landing.
Behavior preservation — verified. _concat_and_group is a faithful extraction of the prior inline process_shard code. Arrow branch is byte-identical to before (concat_tables().combine_chunks(), to_numpy(zero_copy_only=False), _group_columns, n_obs_total = table.num_rows); pandas branch unchanged (pd.concat(ignore_index=True), len(df_all), _build_groups). The branch selector switched from the local use_arrow bool to re-checking handoff == "arrow", but since handoff is validated to {"pandas","arrow"} upstream and use_arrow = handoff == "arrow", the two are equivalent. grid.cells_of usage and grouping are identical on both paths.
Null guard — correct. table.column(n).null_count sums nulls across the (already combined) chunks; nonzero is truthy, so the guard fires on any nullable column. This catches the real divergence: a null would become NaN under to_numpy(zero_copy_only=False), silently corrupting either a leaf_id group key or a stat. Failing loudly is the right contract.
Tests — substantive, not vacuous. test_concat_and_group_arrow_matches_pandas uses three reads of differing length (40/7/53), genuinely exercising the multi-table concat_tables path and offset handling, and compares per-cell stats pandas-vs-arrow with assert_array_equal (byte-for-byte). The guard test feeds a real arrow null and asserts ValueError/null-free. _IdentityGrid is a fair stand-in — both carriers hand cells_of a plain ndarray, so np.asarray is a no-op that isolates carrier mechanics from grid semantics as the docstring claims. PR-comment claims all check out against the diff.
Two minor, non-blocking observations inline. No correctness issues found; nothing blocks the next phase.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Two minor inline observations on the new tests (both non-blocking). Full analysis is in the prior summary comment on this commit — behavior preservation, the null guard, and parity tests all check out.
Generated by Claude Code
| table = pa.table( | ||
| { | ||
| "h_li": pa.array([1.0, None, 3.0], type=pa.float32()), | ||
| "s_li": pa.array([0.1, 0.2, 0.3], type=pa.float32()), |
There was a problem hiding this comment.
🤖 from Claude (review)
The guard test injects the null into h_li (a value column). The code comment two functions up specifically justifies the guard by the risk of a grouping divergence — i.e. a null in leaf_id, which to_numpy(zero_copy_only=False) would turn into NaN and silently mis-key the sort. The guard is generic so it covers both, but the test as written never exercises the leaf_id-null case it was motivated by. Consider moving the None into the leaf_id column (or adding a second case) so the test pins the scenario the comment calls out. Non-blocking.
Generated by Claude Code
|
|
||
| def test_concat_and_group_arrow_rejects_nulls(self): | ||
| """The arrow carrier must fail loudly on null columns, not silently diverge.""" | ||
| pa = pytest.importorskip("pyarrow") |
There was a problem hiding this comment.
🤖 from Claude (review)
_IdentityGrid is now defined identically in both test_concat_and_group_arrow_matches_pandas and test_concat_and_group_arrow_rejects_nulls. Minor DRY nit — a single class-level (or module-level) stub would avoid the duplication. Non-blocking.
Generated by Claude Code
I'll take your recommendation and keep the byte-identical contract. Let's prototype the kernel path now (behind the flag) so phase 3 can benchmark all three (mask / numpy-group / kernel). You can mark the kernel path code |
|
🤖 from Claude Folded in the two HIGH findings from the phase-2b self-review (commit HIGH-1 — kernel NaN semantics (the false "exact" claim). Fixed by matching numpy, not just documenting.
So the only true divergence was min/max silently skipping NaN. Rather than weaken the "count/min/max exact" contract, I restored numpy parity for it: HIGH-2 — tests didn't exercise the production
Plus LOW findings: all three addressed as notes (no behavior change): (a) Verification: Nothing blocked. Leaving the PR as-is (draft, Generated by Claude Code |
|
numpy as a full suite of nan compatible operators -- nanmean , nanvar, nanmax, etc. Does arrow have similar equivalents? I don't want to loose the thread too much here; the user contract is that we can execute aggregation functions that are expressible in numpy, and we certainly don't want to lose that. That said, I don't mind having the implementation provide arrow variants when they are genuinely faster. Perhaps the angle to look at this is one of expanding the contract with the user? i.e., we support anything expressible in numpy and any arrow kernels. That would make thing more explicit, and also not our problem anymore-- the user defines what gets called in the agg template, and they make the choice of numpy vs arrow; we just document examples of both cases, with some benchmarks, and then they get to decide which hatch to use. I'd like to finalize this asap, and get it merged-- we can separate phase 3 and 4 back to issue to track, and then get them in a follow-up PR later. Do you have any questions (or thoughts) on implementing the above? One other note-- #32 (and other PRs) were just merged to main, so check if this branch needs a rebase. |
|
🤖 from Claude Answering your arrow-equivalents question and finalizing per your comment. Does numpy's nan-family work end-to-end? Yes — already, with no code change. Does arrow have nan-equivalents? Not really — and that's why kernels stay opt-in. Arrow compute has Dual contract — documented as you framed it. The supported user contract is "any aggregation expressible in numpy (including the Main rebase status. Per CLAUDE.md (§1: never force-push; §2: don't rebase a shared pushed branch), I synced via Tests. Full suite post-merge: 265 passed, 1 skipped. Phases 3 & 4 deferred. As you asked, I split phase 3 (real-ATL03 wall-time/RSS benchmark) and phase 4 (Arrow-table This PR is finalized / merge-ready pending your call on the two questions above (dual-contract framing + where to track 3/4). Kept as a draft — I won't mark it ready or merge. Generated by Claude Code |
|
🤖 from Claude (review) Fresh adversarial self-review of this run's commit ( Verified against
Nothing to fold; standing for your review. Generated by Claude Code |
Refs #30
What this changes
Reworks the per-shard aggregation hot path in
processing.py. Phase 1 replaced the O(n_children × n_obs) per-child boolean-mask loop with an O(n_obs log n_obs) sort/hash grouping and changedcalculate_cell_statisticsto acceptdict[str, np.ndarray]instead ofpd.DataFrame. Phase 2 adds an additive Arrow handoff carrier alongside pandas (default stays pandas). An experimental pyarrow-kernel reducer (handoff="arrow-kernel") reduces the kernel-able stats viapyarrow.computehash-aggregate; it is opt-in and the byte-identical numpy default path is untouched.Phase 1 — sort/hash group split (
_build_groups)One
np.argsort(O(n log n)) instead ofn_childrenboolean masks (O(n_children × n_obs)). For a typical HEALPix O6→O12 shard: 4,096 boolean scans → 1 sort. The per-cell slices are numpy views into the sorted arrays.calculate_cell_statisticstakesdict[str, np.ndarray]; scalar outputs are byte-for-byte identical (order-independent stats; stable sort preserves within-cell order).Phase 2 — Arrow handoff carrier (additive, opt-in)
process_shard(..., handoff="pandas"|"arrow"), default"pandas"→ no behavior change unless opted in._group_columns(col_dict, cell_col); both carriers feed the same numpy arrays into the same reductions, so scalar outputs stay byte-for-byte identical (asserted inTestArrowHandoff).benchmarks/handoff_bench.py— synthetic, CI-runnable harness timing{mask-loop, pandas-group, arrow-group}with a parity assertion.Experimental — pyarrow-kernel reducer (
handoff="arrow-kernel", opt-in)TableGroupBy.aggregatepass; remaining (weighted mean, expression, quantile) fields fall back to the per-cell numpy path.count/min/maxare exact vs numpy, including NaN: pyarrow min/max kernels skip NaN, so_kernel_aggregatedetects NaN per group and propagates it to match numpy. Floatmean/varianceagree within a documentedKERNEL_RTOL(~1 ULP), not byte-identical — hence opt-in and gated.Dual aggregation contract (this run)
Documents and tests the user contract @espg asked to make explicit (#33 comment):
np.nanmean,np.nanvar,np.nanmax,np.nanmin,np.nansum,np.nanstd, …). These already resolve with no code change —resolve_functiondoesgetattr(np, name)for a bare/np.-prefixed name and import for a dotted path — so the wholenan*family is usable directly from the agg template and runs throughcalculate_cell_statisticswith numpy's own NaN semantics. No gap to close. Addedtest_numpy_nan_aware_functionsproving correct NaN-aware per-cell results, plus aUser contractdocstring section and aDual aggregation contractnote in the EXPERIMENTAL block.skip_nullsdoes not skip NaN), so arrow kernels are not drop-in nan-operators; the experimental path replicates numpy's NaN behaviour by hand rather than conflating nulls and NaN.How it was tested
main(merge, not rebase, to avoid force-pushing the shared branch — §1/§2 of CLAUDE.md):git merge origin/mainwas a clean merge with no conflicts (processing.py/tests untouched by the merged PRs).mainbrought in Rectilinear grid: chunk-driven auto-padding + run enablement (WIP) #32 (requires-python<3.14), concurrency, shardmap/rectilinear, and docs.test_numpy_nan_aware_functions(NaN-aware contract), plus the existingTestBuildGroups/TestArrowHandoff/TestKernelHandoff/TestProcessShardKernelBranchsuites.ruff check --select=E,F,W,I --ignore=E501(ruff 0.14.10, the pinned CI version) clean on the touched files;benchmarks/handoff_bench.pyruns and asserts parity.Phases
calculate_cell_statisticshandoff="arrow"),_group_columnscore,_read_groupArrow mode, synthetic benchmark + parity test — additive, pandas remains defaultpyarrow.computehash-aggregate kernels (handoff="arrow-kernel") for the kernel-able subset, with documentedKERNEL_RTOLand numpy-matching NaN min/max semanticswrite_dataframe_to_zarrArrow-table sibling for the Arrow pathPhases 3 and 4 are split out to a follow-up per @espg's request to finalize and merge this PR now. They will be tracked off #30 unless @espg prefers a dedicated tracking issue (asked on the thread — opening a new issue is a side-effecting action I'm leaving to @espg).
Questions for review
nan*) is the supported surface; arrow kernels are opt-in acceleration only, kept gated behindhandoff="arrow-kernel"pending the phase-3 benchmark.