Non-scalar aggregation outputs (Tier-1 vectors): config declaration + vector cell statistics#40
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial review of phase 1 (config/schema declaration + validation for non-scalar outputs, issue #29). Scope is correctly held to declaration/validation only — no runtime/writer/signature work leaked in, which is right for this phase.
Overall: solid and in-scope. Scalar default is backward-compatible (kind defaults to "scalar", scalar fields short-circuit with no new required keys; the test_scalar_default_backward_compatible + test_atl06_validates cases confirm pre-existing configs can't newly raise). The bool guard (isinstance(dim, bool)) is correct given isinstance(True, int) is True, and it's tested. OUTPUT_KINDS defined after validate_config is fine — it's referenced at call time, not import time.
Findings below are minor / polish — none block advancing to the next phase.
1. Docstring overreach — fill/pad_weight sentinel (config.py:225). The _validate_output_kind docstring claims a field may declare "a fill/pad_weight sentinel," but nothing in this phase reads or validates those keys. Either drop the mention or mark it as a later phase, so the docstring doesn't promise validation that isn't there.
2. Test gap — per-element check inside a list (tests/test_config.py, TestOutputKind). _validate_trailing_shape loops over each dim and rejects non-int / bool / < 1 entries, but every "bad element" test exercises the scalar value (trailing_shape="64" hits the type branch; 0/True are single ints). There's no case like trailing_shape=[16, "x"], [16, 0], or [16, True] that reaches the loop body with a valid leading dim — so the in-loop rejection of a bad element within an otherwise-valid sequence is untested. Adding one such case closes the coverage hole.
3. np.dtype catches only TypeError (config.py:251). Verified against numpy 2.4.6 and the lambda-pinned-equivalent path: bad dtype strings/objects all raise TypeError today, so this is correct in practice. Still, np.dtype has historically raised ValueError for some malformed inputs; broadening to except (TypeError, ValueError) is a cheap hedge against a numpy-version drift turning a config typo into an uncaught crash. Low priority — note, not a blocker.
4. Minor — OUTPUT_KINDS interpolated raw (config.py:244). f"(allowed: {OUTPUT_KINDS}; ...)" renders the literal tuple repr ('scalar', 'vector') in the user-facing error. Cosmetic; a joined form reads slightly cleaner.
No correctness bugs found in the validation logic itself. Acceptance criteria for phase 1 (declare vector kind + trailing_shape + dtype; scalar unchanged; ragged recognized-but-rejected) are met and tested.
Generated by Claude Code
|
🤖 from Claude Phase 1 status: landed and green. Config field-kind declaration (Option B: Ran the fresh-context adversarial self-review (the COMMENT review above). Folded its diff-scoped findings in commit
Ordering: keeping this Generated by Claude Code |
|
🤖 from Claude #33 merged → this PR is unblocked. Status of this run:
Next step: your pick of A/B/C for the phase-3 container, then I'll implement it + phases 4–6. Also a quick confirm that Generated by Claude Code |
|
🤖 from Claude (review) Fresh adversarial pass over the phase-2 diff ( Verified good
Standing finding (not fixed — out of phase-2 scope; flagging for @espg)
No other defects. Tests cover the shape/dtype/sentinel/wrong-width/expression-rejected paths. Generated by Claude Code |
This seems trivial to allow, we should do it now while we're thinking about it
Are we sure pd.DataFrame column can't hold a 2-D block? jbbutler does exactly this is in this notebook which encodes 2-D arrays of binary masks. I believe the code that modifies pandas to do this is in the same repo...
Don't like this, lists of objects in python have the wrong smell.
Possible option, better than (A)... but if we're going to return something alongside df_out, is pyarrow an escape hatch here? We have arrow already wired in as an alternative option to pandas -- can arrow deal with 2D arrays, and can we just default to arrow for that case? Keep in mind that we'll be doing t-digest as a vector output almost immediately as a follow on, so if in one of these options is a better fit for that, I'd like that flagged. |
|
🤖 from Claude Answering your phase-3 carrier questions. Net: I now lean toward an Arrow On "can pandas hold a 2-D block?" — yes, you're right, and jbbutler's approach is the reason. pandas' internal BlockManager is genuinely n-dimensional, and you can surface "one column = one per-cell vector" cleanly via a pandas
On "is pyarrow the escape hatch / can Arrow do 2-D / default to Arrow for that case?" — yes to all three, and this is my recommendation. Arrow carries non-scalar per-cell payloads natively:
#30 already landed the Arrow handoff path ( So my updated recommendation is B′ = Option B carried in Arrow, not a numpy dict:
Why this is the best t-digest fit (your flag): t-digest is Two decisions I'd like from you before I build phase 3:
Meanwhile, the vector- Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial self-review of commit 3fda3c4 (vector expression fields, issue #29). Reviewed against /CLAUDE.md and the issue criteria. Scope limited to this commit; the Phase-3 per-cell container fork is intentionally not requested here.
Verdict: solid, ships as-is. The change is small, well-factored, and correct on the points that matter:
- Scalar
evaluate_expressioncontract is preserved byte-identically. The wrapper now callsfloat(eval_expression_raw(...));eval_expression_rawreuses the samens(same__builtins__={}, samenp/numpy/abs/len/min/max/sumset,**columns), so the scalar path is unchanged. Security namespace is unchanged — no new exec surface. - Vector-expression path routes through the same coercion as the vector
functionpath (eval_expression_raw→_coerce_field_value→trailing_shape/dtype check), so shape/dtype enforcement and the wrong-widthValueErrorare shared, not duplicated. - The
len/countguard is sound and correctly scoped. It lives in thekind == "vector"branch and keys onmeta.get("function"). Becausefunctionandexpressionare validated mutually-exclusive earlier (config.py ~L176), the guard can never collide with an expression field, and it matches the actual short-circuit incalculate_cell_statistics(thefunc_name in ("len","count")branch only runs whenexpressionis falsy). Tests cover bothlenandcount. - Tests added on both sides (config validation + processing behavior). I ran
TestVectorOutputs+TestOutputKindlocally: 28 passed.
Findings below are minor — one real test-coverage gap, the rest are nits. None block the next phase.
- [minor coverage gap] The two empty-cell expression tests don't exercise the new expression code.
calculate_cell_statisticsreturns early via_empty_cell_valuewhenn_obs == 0, soeval_expression_rawis never called for an empty cell — the test would pass even if the expression werenp.minon an empty array (which would otherwise warn/raise). See inline note. These tests are correct, just mislabeled as exercising the expression path; consider an assertion or comment clarifying they hit_empty_cell_value. - [nit]
eval_expression_rawis exposed at module level (no leading underscore) and imported into processing.py. The sibling_coerce_field_value/_empty_cell_value/_field_sentinelare private. Since the only consumer iscalculate_cell_statisticsand the public expression entry point remainsevaluate_expression,_eval_expression_rawwould better signal intent — unless you intend it as public API. Worth a deliberate decision. - [nit]
eval_expression_rawhas no return annotation; its docstring documentsAnybut the signature is bare.-> Any(with the import) would match the documented contract and the repo's typed style (mypy runs in CI). Optional.
No correctness or security defects found within scope.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Inline notes for commit 3fda3c4 (vector expression fields, issue #29). See the summary review for the full verdict — this change is correct and ships as-is; the items below are one minor coverage gap plus nits, none blocking.
Generated by Claude Code
| assert edges.dtype == np.dtype("float32") | ||
| np.testing.assert_array_equal(edges, [1.0, 5.0]) | ||
|
|
||
| def test_vector_expression_empty_cell_gets_sentinel(self): |
There was a problem hiding this comment.
🤖 from Claude (review)
[diff-scoped, minor] This test (and test_vector_expression_empty_cell_numeric_sentinel below) doesn't actually exercise the new vector-expression code. With {"h": np.array([])}, n_obs == 0, so calculate_cell_statistics returns early via _empty_cell_value before the if expression: branch — eval_expression_raw is never called. I confirmed locally that the expression here (np.array([np.min(h), np.max(h)]), which would raise/warn on an empty array) is never evaluated. The test is correct in what it asserts (empty cell → sentinel-filled trailing_shape), but the docstring "the same shape filled with the fill_value sentinel" implies the expression path; it's really covering _empty_cell_value, which is the same code the vector-function empty-cell tests already cover. Either note that explicitly, or to genuinely cover the expression path use an expression that's safe on empty input and assert it isn't what's returned.
Generated by Claude Code
|
|
||
| def evaluate_expression(expression: str, columns: dict[str, np.ndarray]) -> float: | ||
| """Evaluate an expression string in a restricted namespace. | ||
| def eval_expression_raw(expression: str, columns: dict[str, np.ndarray]): |
There was a problem hiding this comment.
🤖 from Claude (review)
[diff-scoped, nit] Two small things on this new function:
- It's exposed module-level (no leading underscore) and imported by
processing.py, while its siblings_coerce_field_value/_empty_cell_value/_field_sentinelare private and the public expression entry point staysevaluate_expression. If this isn't intended as public API,_eval_expression_rawwould better match the surrounding convention (CLAUDE.md §4 "match the surrounding code"). A deliberate call either way. - No return annotation — the docstring promises
Anybut the signature is bare. Since mypy runs in CI, adding-> Any(with thefrom typing import Anyimport) would make the signature match the documented contract. Optional.
Generated by Claude Code
| # ``len``/``count`` short-circuit to a scalar obs count in | ||
| # ``calculate_cell_statistics``; pairing them with kind 'vector' would | ||
| # silently emit a scalar, so reject the nonsensical combination. | ||
| if meta.get("function") in ("len", "count"): |
There was a problem hiding this comment.
🤖 from Claude (review)
[diff-scoped, confirmation — no change requested] Confirming this guard is reachable and matches runtime: function/expression are validated mutually-exclusive earlier (~L176), so this only fires for function-driven vector fields, never for expression fields — and it mirrors the func_name in ("len","count") short-circuit in calculate_cell_statistics, which only runs when expression is falsy. So the guard and the runtime short-circuit agree, and there's no expression-vs-len ambiguity. Good. Leaving this as a note, not a request.
Generated by Claude Code
|
🤖 from Claude Landed the vector- Pushed (
Tests: Phase 3 still awaits your call. I posted my recommendation on the carrier in the comment above: Arrow Generated by Claude Code |
approved
also approved |
|
🤖 from Claude Phase 3 landed ( CI is green on the head commit: Process note: the run that pushed phase 3 hit a transient API 500 right after the push, before posting this status or running the phase-3 self-review — so a fresh-context adversarial review of Remaining (continuing next run, no @espg decision pending — the carrier is settled):
PR stays Generated by Claude Code |
|
🤖 from Claude (review) Fresh-context adversarial review of commit Verdict: correct and in-scope; scalar byte-identity is properly guaranteed and tested. Findings (none block the merge of phase 3, but they shape phases 4–6):
CI is green (test 3.12/3.13, ruff, builds). Findings 1–3 fold naturally into phases 4–6 (continuing next run), so they don't gate landing the carrier. Generated by Claude Code |
|
...why did you stop implementing, aren't your instructions to continue until your require instructions / approval? |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase-4 review (range 4d5a844..93ee562), scoped to the output-field-signature work for issue #29.
Overall: clean and on-target. output_field_signature is canonical and deterministic — sorted by name, trailing_shape rendered as a list, ()→[], and the whole structure round-trips through json.dumps/loads unchanged (verified locally against default_config("atl06")). The nests_with field-set check is symmetric (both grids compute the same canonical list and compare for equality), and get_output_signature already normalizes dtype to None when unset, so the None-vs-explicit-dtype case is consistent across grids. Docstrings updated on both grids + the base Protocol. Nice.
Issues / things to weigh:
-
Coverage gap — rectilinear
nests_withfield-set branch is untested (real gap, non-blocking). The new branch insrc/zagg/grids/rectilinear.pynests_withis the only genuinely-new behavioral path for that family — the Healpix tests intests/test_grids.pydon't exercise it, and the cross-family Healpix↔rectilinear case short-circuits onisinstancebefore reaching the field check, so it doesn't cover it either.tests/test_rectilinear.pyalready has the_grid(cfg, ...)helper and same-CRS nesting tests (~lines 276–296); add one case where two same-CRS, whole-ratio, origin-aligned rectilinear grids differ only in their output-field set and assertnot a.nests_with(b)(+ symmetric). Without it, a regression that drops the rectilinear check stays green. Per CLAUDE.md §4 ("every behavioral change needs tests"). -
Backward-compat consequence of adding
output_fieldstosignature()(flag for @espg, likely acceptable-by-design)._check_signature(src/zagg/runner.py~268) does exactexpected != actualongrid.signature(), andgrid_signatureis persisted verbatim in the ShardMap JSON (src/zagg/catalog/shardmap.pyto_json/from_dict). Any ShardMap built before this commit has nooutput_fieldskey, so it now mismatches and raises "ShardMap was built for a different grid than this run config." That's arguably correct — the gate exists to refuse schema drift, andsignature()is already a hard equality check — but it silently invalidates pre-existing catalogs with a message that says "different grid" rather than "stale/regenerate." Worth a one-line note under "Questions for review," and consider whether the mismatch message should hint at regeneration (the loader already documentspython -m zagg.catalogfor the pre-Phase-5 case). No code change required if @espg treats this as a schema-version bump.
Nits (non-blocking):
-
output_field_signaturedocstring has a double space:()for scalar(src/zagg/config.py~441). Trivial. -
test_signature_is_json_serializable(tests/test_grids.py~286) assertsjson.loads(json.dumps(sig)) == json.loads(json.dumps(sig))— comparing two serializations of the same object, which is trivially true and doesn't prove round-trip fidelity. To test what the docstring claims, assertjson.loads(json.dumps(sig["output_fields"])) == sig["output_fields"](deserialized == original), which would catch a stray tuple / np-scalar leaking in. The current form is a tautology.
None of these block advancing; (1) is the one I'd fold into the next commit.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase-5 review of 93ee562..9c736f9 (the dense vector writer template). Scope limited to this diff. Overall the implementation is correct and the tests pass locally — pytest tests/test_grids.py::TestVectorTemplate tests/test_rectilinear.py::TestNesting::test_differing_output_field_set_rejected → 4 passed. I verified field preservation, multi-dim trailing, and scalar non-regression by running _spec() directly. Findings below; none blocking.
Verified correct
vector_array_specproduces(*spatial, *trailing)shape with the trailing dim as one whole chunk: rectilinear(64,64)→(64,64,4)chunk grid; healpix matches(4**(8-6), 4). Multi-dimtrailing_shape=[3,2]→ shape(...,3,2), chunk(64,64,3,2). Good — this satisfies the writer'sblock_idx = chunk_idx + (0,)*len(trailing)invariant since the trailing block index is always 0.- The
type(base)(**{**base.model_dump(), ...})reconstruction does not lose or corrupt fields. I confirmed empirically thatcodecs,chunk_key_encoding,fill_value,dtype, andstorage_transformerson the vector array are identical to the scalarcountarray (m.codecs == sc.codecs,m.chunk_key_encoding == sc.chunk_key_encoding, bothTrue;fill_value=0/dtype=int64preserved). pydantic_zarr'sArraySpec.model_dumprecursively dumps nested NamedConfigs to dicts and re-validates them on construction, so this round-trips cleanly. The inline comment about settingshape+dimension_namestogether (vs chainedwith_*) is accurate —validate_dimension_names(v3.py:222) is amode="after"validator that would reject a transient rank mismatch. - Scalar fields are untouched: the early
if not trailing: return basemeans a pure-scalar config yields_spec()members with unchanged('cells',)/('y','x')dim_names and no extra axes. No scalar regression.
Real issue (low severity) — invariant is documented, not enforced (src/zagg/grids/base.py:40-45, src/zagg/processing.py:273-279): The single-trailing-chunk contract is the load-bearing coupling between the template and the writer, but nothing guards it. If a future edit chunks the trailing dim (e.g. chunk_shape=(*base_chunk_shape, *some_smaller)), the writer's block_idx = chunk_idx + (0,)*len(trailing) would silently write only block 0 of the trailing axis and drop the rest — no error, corrupt output. The two doc blocks cross-reference each other well, but consider a cheap runtime assertion at the write site (e.g. assert the target array's trailing chunk sizes equal values.shape[1:]) or in vector_array_spec itself, so the invariant fails loud rather than producing partial data. Flagging for @espg; out of scope to add silently.
Nit (non-blocking) — trailing dimension_names derived from base_dims[-1] (base.py:72): dim_names = (*base_dims, *(f"{base_dims[-1]}_v{i}" ...)). On rectilinear this yields ('y','x','x_v0') — the payload axis is named after the x spatial axis, which is semantically misleading (the trailing axis is not an x-axis). It's unique and harmless, but a field-derived or fixed prefix (e.g. hist_v0, or just v0) would read better. Uniqueness is currently guaranteed only incidentally (each field is a separate array, so cross-field collision isn't possible; within a field the _v{i} suffix disambiguates). Minor.
Nit (non-blocking) — test coverage gaps: TestVectorTemplate covers the happy path well (trailing shape + whole-chunk + dtype for both grids, dim_names extension for healpix). Two small gaps worth a follow-up, not a blocker: (1) no test asserts codecs/fill_value/chunk_key_encoding survive the model_dump reconstruction — I verified it manually but a regression here would be silent (the dtype assert wouldn't catch a dropped codec); (2) no multi-dim trailing_shape template test (e.g. (k,2)) — the writer's (0,)*len(trailing) and the chunk grid both depend on len(trailing)>1 working, and only the single-axis case is exercised here.
Line-length note: base.py:74 (chunk_grid = NamedConfig(...)) looks near/over 100 chars — confirm ruff check src is clean (E501 is ignored in this repo per CLAUDE.md §7, so it won't fail CI, but worth a glance).
Net: correct and well-documented; the only thing I'd push on before merge is making the single-trailing-chunk invariant fail-loud rather than relying on a docstring, since a silent break there corrupts data.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase-6 review — scope is the 9c736f9..1511853 diff only (the new TestVectorRoundTrip in tests/test_processing.py). Pre-existing pinned-ruff format drift in this file is on main and out of scope.
Verdict: solid, meaningful round-trip. The block-index math, trailing-dim selection, and NaN-skip assertion all check out against the source. Findings below are mostly nits; one is a substantive coverage gap worth a decision.
What the test gets right (verified against source)
- Block-index/trailing-dim math is correct.
block_indexfor fullsphere returns the parent's HEALPix nested id (healpix.py:160-164), andset_block_selection(block_idx, ...)writes blockb→ rows[b*n, (b+1)*n). The test'sblock_start = chunk_idx[0] * nmatches that, and mirrorsTestWriteDataframeToZarr(test_processing.py:51). Good. - Shape assertion uses the constant, not a magic 12.
group["edges"].shape == (HEALPIX_BASE_CELLS * 4**child_order, 2)lines up exactly withvector_array_specextending the spatial shape (healpix.py:98,base.py:71) by the whole trailing dim. The newHEALPIX_BASE_CELLSimport is the right call. - The padding/NaN-skip assertion is genuinely meaningful.
np.nanmean(got, axis=0)over a(n, 2)block with only rows 0 and 3 populated and rows 1,2 NaN-padded, asserted against[(1-2)/2, (9+4)/2], actually proves the reducer skips the padding sentinel — it isn't a tautology, since a non-NaN-aware mean would give a different (NaN) answer.assert_array_equalon the populated rows (notallclose) is the right strictness for an exact round-trip. column_names[:2] == ["count","edges"]holds —get_data_varspreserves dict-insertion order and coords are appended after (_build_output,processing.py:209-214).
Real issue (non-blocking, but the standing #1 finding asked for e2e)
- The "writer was untested e2e" gap is only partly closed — the leaf→stats half is still bypassed. The test fabricates
statsdicts by hand and feeds them straight to_build_output; it never runscalculate_cell_statistics/the expression path that produces a real vector block. So the path under test is_build_output → _arrow_column (FixedSizeList) → write_dataframe_to_zarr → set_block_selection → Zarr read, which is the writer→Zarr→reader seam (the literal phase-6 ask) — but theedgesconfig carriesexpression: "np.array([np.min(h), np.max(h)])"andkind: vector, and nothing in this test ever evaluates that expression. It's dead config that reads like it's exercised. Either (a) drive one cell through the real statistic eval so the expression→vector block is covered too, or (b) drop the unusedexpression/sourcekeys and add a one-line comment that the stats are synthetic. As written a reviewer could reasonably believe the expression path is tested when it isn't.
Reader-requirement assessment (the explicit question)
- Using the existing
open_group/Zarr API with no new reader code is the right call for #29 phase 6 — agreed, no helper needed. The trailing dim is a normal Zarr axis;group["edges"][block_start:block_start+n]is exactly how a consumer reads it, and the test demonstrates the full contract (selection + sentinel + reducer). Adding a bespoke reader helper would be speculative abstraction (CLAUDE.md §4) with no caller. The one thing I'd want before calling the "reader" box checked: a sentence in the PR body stating explicitly that "reader = the stock Zarr trailing-dim selection, demonstrated by this test" so the decision is on record rather than implicit.
Nits (non-blocking)
- Coverage is single-trailing-dim + HEALPix only. The whole point of the
block_idx = chunk_idx + (0,)*len(trailing)generalization (processing.py:279) and thevector_0/vector_1naming (base.py:74-77) is multi-dim trailing shapes (the t-digest(k,2)case named throughout the docstrings). Atrailing_shape: (k, 2)case would exercise(0,)*len(trailing)withlen>1— currentlylen(trailing)is always 1 here, so the multi-dim block-index arithmetic is asserted by no test. Similarly the rectilinear grid'sblock_index(rectilinear.py:270, a 2-tuple) never sees a vector write. Both are reasonable to defer, but worth a one-line "Questions for review" note on whether phase 6 should cover at least the(k,2)case given it's the motivating future shape. local importplacement.from zagg.config import PipelineConfigsits inside_vector_cfg(+30) whilefrom mortie import geo2mortis inside the test method — fine functionally, butPipelineConfigis already importable at module scope and the surroundingTestNumpyNanAware...test (line 123) also does the in-body import, so this matches local style. No change needed; flagging only for consistency awareness.parent/childrennaming vsn.n = len(children)is immediately re-asserted to equal4**(child_order-parent_order)(+ assert n == ...). The assert is harmless but redundant with the comment on the same line; it documents the invariant, so I'd keep it — minor.
Nothing here blocks advancing. Finding #1 is the one I'd most want addressed (or explicitly waved off in the PR body) since it's the residual of the original e2e-coverage finding; #3 is the strongest candidate for an @espg scope decision.
Generated by Claude Code
|
🤖 from Claude Continued past the carrier decision (it's settled — approved) and landed phases 4–6, completing the #29 Tier-1 vector path end-to-end. Pushed (
Self-review: ran a fresh-context adversarial pass after each phase (COMMENT-event reviews on Tests: Two things for your eyes (non-blocking):
The PR body checklist is updated (phases 1–6 ✅; Tier-2 ragged/t-digest is the explicit follow-up). Draft, Generated by Claude Code |
no, it's fine as is.
understood -- it makes sense to do this as a new issue / seperate PR Does this mean that this PR is ready for review / can be merged? |
|
🤖 from Claude Yes — the Tier-1 vector path (phases 1–6) is complete and CI-green:
CI on the head commit is green ( Per the routine I keep this a draft and don't self-mark ready-for-review or merge — but it is ready for your review/merge whenever you'd like. Leaving it Generated by Claude Code |
Closes #29.
What this does (#29, Tier-1 fixed-width vectors)
Option B output signatures — each aggregation field declares its output kind + trailing shape + dtype. The cell→table→Zarr path now carries non-scalar per-cell payloads end-to-end via the @espg-approved Arrow
FixedSizeList<C>carrier (B′): scalar configs are byte-for-byte unchanged; a config with anyvectorfield routes through Arrow and stores the payload on a trailing Zarr dimension.Per the settled plan in #29 (comment): Option B (kind + trailing_shape + dtype), CSR for ragged deferred to Tier 2, Tier 1 (fixed vector) first, leaf-emit only, Lambda + local scope.
Approach
config.py: per-fieldkind/trailing_shape/dtypemetadata, validated by_validate_output_kind;get_output_signature(meta)is the single normalized read point.calculate_cell_statisticsreturns a per-cell ndarray oftrailing_shape/dtype forvectorfields (with the schema-declaredfill_valuesentinel on empty cells); scalar fields byte-identical. Plus vector-expressionenablement and avector+len/countguard.FixedSizeListcarrier. When anyvectorfield is present, the cell→table handoff routes through Arrow (reusing sort/hash grouping refactor ( #30 ) #33's_concat_and_group/_group_columnsseam): scalars become plain Arrow columns, vectors becomeFixedSizeList<C>. Pure-scalar configs unchanged.OutputGrid.signature()now carries an Option-B output-field set{name, kind, trailing_shape, dtype}per agg variable (via the newconfig.output_field_signature, JSON-canonical, sorted by name), andnests_with()requires a matching field-kind set so co-aggregated grids must share a scalar/vector schema. Both grids + the protocol docstring updated.grids.base.vector_array_spec, wired into both grids'_spec()) creates a(*spatial_shape, *trailing_shape)array for eachvectorfield, chunking the trailing payload dim whole so the writer'sblock_idx = chunk_idx + (0,)*len(trailing)holds. The single-trailing-chunk invariant is documented and now enforced at the write site (raises if a target array splits the trailing dim).write_dataframe_to_zarr(exercising the trailing-dimset_block_selection), reads back through the trailing-dim selection, asserts populated cells round-trip exactly, empty cells carry the NaN sentinel, and a NaN-aware reducer (np.nanmean) skips the padding. Reads use the existing Zarropen_groupAPI — the trailing payload is a normal axis, so no new reader code is needed.Phases checklist
calculate_cell_statisticsreturns ndarray-valued entries forvectorfields (+ vector-expression,vector+len/countguard).FixedSizeListcarrier through grouping/assembly (approved B′).OutputGrid.signature()→{name, kind, trailing_shape, dtype}+nests_with()field-kind check.values/offsets/cell_ids) + t-digest asList<FixedSizeList<2>>.How tested
uv run pytest -v— full suite 316 passed, 1 skipped (the skip is the non-PyPIspherelySpatialIndex fork, expected per CLAUDE.md §7). New coverage:TestOutputFieldSignature/TestVectorTemplate(test_grids.py), rectilinearnests_withfield-set test,TestVectorRoundTrip(test_processing.py — the end-to-end write→read round-trip + the trailing-chunk-invariant guard).uv run ruff check src tests— passes. The lines added in this run are format-clean under the pinned hook ruff (v0.14.10); pre-existing repo-wideruff formatdrift onmain(grid/processing files predate this PR) is left untouched per §4, and CI's ruff job isruff checkonly (fail_level: none), so format drift does not gate.Questions for review
signature()now includesoutput_fields— a ShardMap built before this change (nooutput_fieldskey) will no longer equality-matchgrid.signature()and the runner will report a grid mismatch. This is correct-by-design (the shard map must pin the output schema), but existing catalogs need regeneration (python -m zagg.catalog). Flag if you'd rather soften the runner mismatch message to hint at regeneration.List<FixedSizeList<2>>; that lands as the Tier-2 follow-up (CSR ragged), not here.Generated by Claude Code