arro3-core carrier for the Lambda worker; drop pyarrow + arrow-kernel (#130)#131
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial review of Phase 1 only (commit 8bc813d, parent bcd98e4) — touches deployment/aws/build_layer.sh and pyproject.toml.
Verdict: looks correct, no blocking findings. Each acceptance criterion from #130 checks out:
- pyarrow added + pinned, no drift.
pyarrow==24.0.0is added to thebuild_layer.shinstall line (build_layer.sh:62) and to thelambdaextra (pyproject.toml:49), at the same pin — the extra and the build agree, satisfying the "no drift" criterion. The pin is consistent in spirit with the existingnumpy==2.2.6/pandas==2.2.3pins. - Bulk strip fixed.
pyarrow*is removed from therm -rfbloat line, and the now-false "pyarrow stays stripped" comment is replaced with an accurate one describing the kept-but-component-stripped state. - Component-strip set is coherent. Removed:
libarrow_flight*,libarrow_substrait*,libgandiva*,libarrow_dataset*+_flight*.so/_substrait*.so/_gandiva*.so/_dataset*.so. I checked the globs against pyarrow's actual library naming — none of thelibarrow_*removal globs collide with the KEEP set (libarrow.so*islibarrow.notlibarrow_;libarrow_acero*/libarrow_python*have distinct suffixes from_flight/_substrait/_dataset), and the_*.socython globs don't catch_acero*.so/_compute*.so/_parquet*.so. KEEP items (core, python, parquet, acero+compute for the arrow-kernel reducer per #130 (2)) are all preserved. - 250 MB gate intact. The size check at
build_layer.sh:98-105is untouched. Per the PR body, actual fit is verified by the CIlambda-buildSize Report, not locally — correct call given arm64 can't be built here. - Style/scope. The new
rm -fblock mirrors the existing strip-block style (multi-line\-continued,2>/dev/null || true), terse, no dead code. No CI-workflow edits, no live-AWS calls.
Non-blocking notes (no action required for phase 1):
_dataset*.soalso strips_dataset_parquet*.so/_dataset_orc*.so(the datasets-API parquet/orc integration). This is intended and safe — catalog parquet I/O goes throughpyarrow.parquet(_parquet*.so, kept), not the datasets layer. Flagging only so it's a conscious decision on record.- As the PR body itself notes, the issue body attributes the crash to tdigest but the vector (pyarrow) path is actually gain_bias. Phase 1's fix is target-agnostic, so this doesn't affect the diff — just confirming the grounding correction is acknowledged.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial fresh-context review of phase 2a only (commit 4e0d6ca, parent 8bc813d): threading handoff into the Lambda backend, default "pandas", byte-identical default event.
Verdict: correct on every acceptance criterion of #130 that this phase covers. The single most important property holds: in _invoke_lambda_cell the "handoff" event key is added only via if handoff and handoff != "pandas":, exactly mirroring the if profile: opt-in, so a default run's payload is byte-identical to the pre-handoff path. Confirmed:
- Threading is complete and defaults are consistent at every hop:
agg(handoff="pandas") -> _run_lambda(handoff="pandas") -> _invoke_lambda_cell(handoff="pandas") -> event["handoff"] (only if non-default) -> lambda_handler event.get("handoff", "pandas") -> process_shard(..., handoff=...).process_shard(worker.py:48) acceptshandoff: str = "pandas"keyword-only, so an absent key consumes the default. - The handler/runner tests are meaningful, not tautological, and cover both directions: explicit
"arrow"/"arrow-kernel"is forwarded into the event and reachesprocess_shard; default adds no key and the consumed value stays"pandas". - Docstring corrected — the misleading "Only honored by the
"local"backend" note is gone and now accurately describes lambda forwarding + the byte-identical default. - No
.github/workflows/edits, no live-AWS calls, no broad# noqa/# type: ignore, terse and matches the surroundingprofilepattern, modules well under ~1000 lines.
One substantive finding (non-blocking): the run_benchmark.py additions — all_target_names (run-all excludes provisional_targets) and _resolve_target (provisional fallback) — are new behavioral logic that ships untested at this commit, and the handoff thread into agg in run_target isn't exercised by dry-run (dry-run skips agg). The PR body's test_provisional_targets_excluded_from_merge_matrix and the provisional_targets manifest data land in phase 2b, not here. Per CLAUDE.md §4 that's the same PR, so it's a phasing observation rather than a blocker — just confirm 2b covers all three new functions. Inline notes detail this plus a minor dead KeyError branch in _resolve_target (unreachable from the main() CLI path, which pre-validates and raises SystemExit first).
Nothing here blocks advancing to 2b.
Generated by Claude Code
| # Only add the key when profiling, so default runs stay byte-identical (#100). | ||
| if profile: | ||
| event["profile"] = True | ||
| # Only add the key for a non-default carrier, so default (pandas) runs stay |
There was a problem hiding this comment.
🤖 from Claude (review)
The byte-identical-default guard is correct and is the most important property of this phase: if handoff and handoff != "pandas" adds the "handoff" key only for a non-default carrier, exactly mirroring the if profile: opt-in two lines up. _captured_event(..., handoff="pandas") -> "handoff" not in event confirms it. No notes here, just confirming the load-bearing invariant holds.
Generated by Claude Code
| if name in provisional: | ||
| return provisional[name] | ||
| known = list(targets) + list(provisional) | ||
| raise KeyError(f"unknown target '{name}'; have {known}") |
There was a problem hiding this comment.
🤖 from Claude (review)
This KeyError branch is effectively unreachable from the only in-tree caller. main() pre-validates every name against known = targets | provisional_targets and raises SystemExit (line 212) before run_target -> _resolve_target ever runs, so _resolve_target only sees names already known to be in one block or the other. The KeyError path can only fire via a direct API call to run_target/_resolve_target. That's fine as a defensive guard, but note it's dead for the CLI path and therefore untested — see the summary's coverage point.
Generated by Claude Code
| return (base / rel).resolve() | ||
|
|
||
|
|
||
| def all_target_names(manifest: dict) -> list[str]: |
There was a problem hiding this comment.
🤖 from Claude (review)
all_target_names (run-all excludes provisional_targets) and _resolve_target (run a provisional by explicit --target) are new behavioral logic, but this commit adds no test for either. The runner-side handoff plumbing (target.get("handoff", "pandas") -> agg(handoff=...)) is also untested here — and unlike _invoke_lambda_cell, it isn't exercised by the dry-run path (dry-run skips the agg call entirely, so the new handoff line in run_target runs but its value is discarded). The PR body says test_provisional_targets_excluded_from_merge_matrix pins the exclusion, but that test (and the provisional_targets manifest data it needs) lands in phase 2b, not here — at this commit tests/test_benchmark.py has no provisional/handoff reference. Per CLAUDE.md §4 it's the same PR, so this is acceptable phasing rather than a blocker, but flagging that as of 2a these branches ship untested. The existing test_main_dry_run_writes_outputs harness makes a provisional_targets-resolution test cheap; consider landing it alongside 2a or confirming 2b covers all three new functions (all_target_names, _resolve_target, the handoff thread into agg).
Generated by Claude Code
| # read/index/aggregate deltas; the write phase runs here, so we bracket it | ||
| # below and merge it in. Default (no key) leaves the worker path unchanged. | ||
| profile = event.get("profile", False) | ||
| # Per-cell carrier/reducer (issue #130). Absent key -> "pandas", the |
There was a problem hiding this comment.
🤖 from Claude (review)
Handler read/forward is correct: event.get("handoff", "pandas") -> process_shard(..., handoff=handoff), and process_shard's signature (worker.py:48, handoff: str = "pandas") accepts it keyword-only, so an absent key consumes the byte-identical default. The two new handler tests assert both directions (explicit "arrow-kernel" reaches process_shard; no key -> captured["handoff"] == "pandas") and are meaningful, not tautological. No change requested.
Generated by Claude Code
| and ``"arrow"`` produce byte-for-byte identical scalar outputs (#30); | ||
| ``"arrow-kernel"`` agrees within ``KERNEL_RTOL`` (#33). Opt-in for | ||
| benchmarking. Honored by both the ``"local"`` and ``"lambda"`` backends | ||
| (issue #130): the lambda backend forwards it into each cell event, and the |
There was a problem hiding this comment.
🤖 from Claude (review)
Docstring fix is good: the previously misleading "Only honored by the "local" backend" line is gone, replaced by an accurate "Honored by both "local" and "lambda" backends ... default "pandas" keeps the event payload byte-identical (no key)" note. The threading through agg -> _run_lambda -> _invoke_lambda_cell is complete and the defaults (handoff="pandas") are consistent at every hop, matching the existing profile parameter's shape. The two TestProfilePlumbing additions assert agg(handoff=...) reaches _run_lambda in both directions. Clean.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Adversarial fresh-context review of phase 2b only (commit 88f3dba: scalar config + provisional_targets block + tests). The 2a runner plumbing (all_target_names, _resolve_target, handoff threading) is present at this commit and the four new tests pass against a clean checkout. No blocking findings; everything below is confirmation plus one nit.
Verified against issue #130 acceptance + the (4) decision + CLAUDE.md:
- Config is genuinely scalar + kernel-able. Loaded
atl03_scalar_healpix_o11.yaml: fieldscount(len),h_min(min),h_max(max),h_variance(var) — allkind == "scalar", none withexpression/weights, nokind:keys at all. Everyfunctionis in_KERNEL_FUNCSand passes_kernel_able, sohandoff="arrow-kernel"runs it end-to-end (no silent fallback).test_scalar_config_is_kernel_ableasserts bothkind=="scalar"and_kernel_able(meta)per field — meaningful, not tautological (it would catch a vector/ragged or expression field). - Three provisional targets, one config + one shard, correct handoffs. scalar_pandas→
pandas, scalar_arrow→arrow, scalar_kernel→arrow-kernel, all onhealpix_o11. The arrow-kernel target points at the scalar config, not a ragged one — tdigest stays pandas-only, as required. - Decision (4) — PR-tree-only — holds. They live under a sibling
provisional_targetskey, nottargets.all_target_names()returns onlytargets, andtest_provisional_targets_excluded_from_merge_matrixpins disjointness andtargets ⊆ all_names— so "run all" can't iterate them and the test is meaningful (not just asserting the key is absent). - Committed
targetsmatrix unchanged. Diff vs parent4e0d6ca: identical 8-target matrix; no arrow target snuck in. - Shardmap reuse is sound. Config's
data_source/read_planand theoutput:grid block are byte-identical toatl03_tdigest_healpix_o11.yaml, so the densest-shardhealpix_o11map (staticshard_key, aggregator-independent) pairs correctly; the grid output signature only differs in agg fields. - Accuracy coverage exists (pre-existing). pandas==arrow byte-identical and kernel-within-
KERNEL_RTOLare covered bytests/test_processing.py(kernel-vs-numpy, arrow-matches-pandas) andbenchmarks/handoff_bench.py. Phase 2b's tests correctly assert structure/wiring rather than re-deriving accuracy — no fabricated coverage. - JSON valid; tests terse; no weakened/broad-ignore; no CI-workflow or live-AWS changes in this commit.
ruff check tests/test_benchmark.pyclean.
One nit inline.
Generated by Claude Code
|
🤖 from Claude Status: all four phases landed; the layer-size gate is confirmed green by CI on both architectures ( Notes for review:
Leaving this Generated by Claude Code |
this is fine.
don't care
... does this let us run benchmarks here, in this thread?? |
|
/benchmark |
|
🤖 from Claude The pyarrow strip can't work for 24.0.0 —
|
| set | size | strippable? |
|---|---|---|
mandatory core — libarrow 45.9, compute 14.5, parquet 11.3, substrait 4.9, dataset 2.2, python 2.1, acero 1.7 (+ lib/_compute cython) |
~88 MB | no — import needs all |
Flight (libarrow_flight 23.4 + python_flight + _flight) |
~25 MB | yes (leaf) |
unused fs/format cython (_azurefs/_gcsfs/_hdfs/_orc/_s3fs/encryption/cpp_tests/_substrait) |
~3 MB | yes |
include/ + tests/ |
~8 MB | yes |
Importable minimum is ~100 MB unzipped, dominated by the 88 MB core. That's ~18 MB more than the current (broken) layer, which only "fit" by deleting required libs.
Gate math — it doesn't close on either arch
- x86_64 was 252 MB / 250 with substrait+dataset+parquet already deleted (and Flight already stripped). Restoring those three to make it importable adds ~18 MB → ~270 MB (~20 MB over).
- arm64 was 234 MB under the same broken strip; +~18 MB → ~252 MB — also over 250.
- Flight is already out of those numbers, so there's no ~20 MB left to reclaim — the rest of the layer (numpy/pandas/pyproj/shapely/odc-geo/h5coro/mortie + the 29 MB function) is all required.
Conclusion: pyarrow 24.0.0 cannot be slimmed to fit the 250 MB combined Lambda gate while remaining importable. Path (A) — ship pyarrow, strip to fit — is not viable.
Decision is forced back to (B) or (C)
- (B) Drop pyarrow from the deployed worker. Write vectors straight from numpy (the write path already round-trips
to_numpy(zero_copy_only=False), so the Table is a pure intermediate), and import-gatepyarrow.computeso it's needed only whenhandoff=arrow-kernelis explicitly set — and don't bundle pyarrow in the layer at all. Removes the entire ~100 MB problem. - (C)
arro3-core(~7 MB) for the carrier + drop pyarrow — same outcome for the layer; can't coverpyarrow.compute(the kernel reducer).
Both have one consequence worth weighing against the earlier "keep arrow-kernel, benchmark-to-decide" call: with pyarrow out of the layer, the arrow-kernel reducer can't run on Lambda — only locally. Given a 250 MB gate vs. an unstrippable ~100 MB pyarrow, that looks unavoidable.
Recommendation: pivot this PR to (B) — numpy vector writes + pyarrow.compute import-gated-and-unbundled, keeping the handoff plumbing + scalar configs so arrow/arrow-kernel can still be benchmarked locally. The size gate then passes for real (pyarrow leaves the layer entirely), and gain_bias stops crashing. Flagging for your call before I change the approach — this reopens the (B)-vs-(C) question from #130 with the hard numbers that now make (A) impossible.
(Aside: the CI benchmark separately failed at earthaccess.login() with ENETUNREACH to urs.earthdata.nasa.gov — a runner→NASA reachability issue, likely IPv6 or runner egress, not the token. Local orchestration against the deployed function is the workaround that surfaced the bug above.)
Generated by Claude Code
|
Do (C) -- arro3-core
What (if anything) can we use out of arro3-compute ? Or are we just using arro3-core given that the Acero TableGroupBy.aggregate hash-aggregate isn't in either (correct me if I'm wrong and it is in arro3-compute). |
|
🤖 from Claude arro3-compute answer + the (C) shape I'll implementShort answer to your question: you're right — the Acero arro3-compute, verified against the installed wheel (v0.8.1, Jun 2026)I installed These are element-wise arithmetic (
The Disposition: per your "Do (C)" call, What (C) looks like in code — verified empiricallyI installed
So the on-disk numpy blocks are unchanged; only the intermediate Arrow library changes. No carrier-seam rework, no design fork — proceeding with (C). Plan (phased, one commit each, self-review per phase)
New dependency note: I'll update the PR body to the (C) pivot as I land the phases. Generated by Claude Code |
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Fresh-context adversarial review of the pyarrow→arro3-core pivot (commits 984f9bd, 6a3c19c, 0192fce, 3913a0b; diff a5012d4..HEAD). Reviewed against the 7 intent points plus the arro3-vs-pyarrow API-divergence and empty-carrier concerns. Overall this is clean and correct — carrier byte-identity holds, the worker imports no pyarrow on default/vector/arrow paths, arrow-kernel is fully excised with no orphan imports, and deps↔layer agree. Verified locally in the worktree venv (arro3 0.8.1): the 13 arrow/vector carrier tests pass, runner/lambda/benchmark suites pass (131 passed / 10 skipped), import zagg.processing has no orphaned names, the synthetic benchmark runs (parity: OK (pandas == arrow)), and ruff is clean on the diff (the one N818 in registry.py is pre-existing, outside this PR).
Findings, severity-ranked:
LOW-1 — Latent edge: arro3 0.8.1 cannot build a 0-length array, but write.py's empty-arrow path can never reach it (so it's safe, but the dead guard is now misleading).
Confirmed locally: Table.from_pydict / fixed_size_list_array with a 0-length input raises ValueError: 0-length dimension not currently supported. pyarrow built 0-row tables fine. I traced every construction site and the vector/arrow carrier is never built with 0 rows in practice:
_build_outputsizes columns ton_cells = len(children) = prod(chunk_shape) > 0(a chunk is a fixed cell grid; empty cells get fill values, never a 0-row block)._read_groupreturnsNoneon an emptied read (read.py:521), andprocess_shardearly-returns an empty pandas DataFrame whennot all_reads(worker.py:260) before any arro3 concat, so_concat_and_group'sTable.from_batches(..., schema=all_reads[0].schema)and_read_group'sTable.from_pydictnever see an empty input.
So no live bug. But the consequence is that _carrier_empty(carrier) -> carrier.num_rows == 0 (write.py:91) and the num_rows == 0 skip in the write loop are now unreachable for the arrow carrier — an arro3 table with 0 rows can't be constructed to be tested. Not blocking; worth either a one-line comment that the arro3 carrier can't be empty by construction, or leaving as-is for the pandas path (which still can be empty). Flagging so the dead branch isn't mistaken for live empty-arrow handling later.
LOW-2 — Doc nit (non-blocking): aggregate.py still imports get_agg_fields/get_output_signature etc. correctly and get_data_vars was removed cleanly — verified no orphan. No action; just confirming the __init__.py/__all__ cleanup is complete (KERNEL_RTOL, _KERNEL_FUNCS, _kernel_able, _kernel_aggregate all gone from imports AND __all__).
Point-by-point confirmations:
- Carrier byte-identity — PASS.
Array.from_numpy(np.ascontiguousarray(...)),fixed_size_list_array(...), andlist_flatten(col).to_numpy().reshape(n, width)round-trip exactly. FSL detection viatype.list_sizeis robust: verifiedlist_sizeis3for an FSL andNonefor a scalar Array, so thewidth is not Nonetest is correct. uint64 morton survivesto_numpy()as uint64 (checked). Contiguity is forced vianp.ascontiguousarrayat everyfrom_numpy; no endianness risk (numpy-native). - Worker pyarrow-free — PASS.
worker.py/aggregate.pyhave noimport pyarrow; the arrow handoff usesarro3.core(_concat_and_group), andread.py'sarrow=Truebranch buildsarro3.core.Table.from_pydict(NOTpa.table) — lazy import, arro3 only. The newtest_default_and_vector_write_path_works_without_pyarrowblocks pyarrow via a subprocess meta-path finder and exercises both carriers — a genuine guard, passes. - arrow-kernel removed — PASS. No
arrow-kernel/_kernel_*/KERNEL_RTOL/scalar_kernelin src/tests/benchmarks/targets (only accurate historical-context comments remain).process_shardvalidation is nowpandas|arrow;scalar_kernel_healpix_o11dropped fromtargets.json. Stale.pycfiles in__pycache__still match but are build artifacts, not source. - lambda extra ↔ build_layer.sh — PASS. Both pin
arro3-core==0.8.1, no pyarrow; the pyarrow component-strip block is fully removed; core dep isarro3-core>=0.8.1(floor consistent with the==pins). pyarrow is in thecatalogextra. - pyarrow retained for catalog only — PASS.
catalog/sources.pyis the sole live pyarrow user; pyarrow moved to thecatalogextra; core deps no longer list it. - Tests meaningful — PASS. Migrated
TestArrowHandoff/TestVectorCarrierassert real round-trip behavior and pandas↔arrow byte-equality; the null-guard test still trips on an Arrow null; no broad# noqa/# type: ignore; deleted assertions are all kernel-path-specific (legitimately gone). - Terseness / line limit — PASS.
aggregate.pyshrank to 650 lines.runner.pyis 1162 lines but this PR only edits a docstring there (pre-existing size, not introduced here). Comments are accurate post-pivot.
No approve/request-changes per routine policy — COMMENT only. Nothing here blocks the next phase; LOW-1 is the only item worth a follow-up note.
Generated by Claude Code
|
🤖 from Claude Path-C pivot landed — deployed worker is now pyarrow-free (arro3-core)All phases of the (C) pivot are pushed (commits What changed
Verification
Two items for your call (in the PR body's "Questions for review"):
Only CI confirmation remains (the Generated by Claude Code |
|
🤖 from Claude
|
| carrier | result | max mem |
|---|---|---|
gain_bias (arro3 vector write) |
✓ | 1278 MB |
scalar_pandas |
✓ | 1725 MB |
scalar_arrow (arro3 read handoff) |
✗ timeout | 2048 MB (cap) |
Root cause (measured, not inferred). RSS-traced on Linux/arm64 with the layer's exact numpy==2.2.6 + arro3-core==0.8.1:
from_numpy/from_pydict/from_batchesare zero-copy (Arrow adopts the numpy buffer — confirmed by mutation visibility + flat RSS).- The single forced copy is
combine_chunks().to_numpy()in_concat_and_group(contiguous materialization — unavoidable). - The bug: the chunked Arrow
table(and the per-read tables inall_reads) are kept alive through_group_columns, so the pooled data is held twice during grouping. Pandas avoids this for free (pd.concat, then inputs fall out of scope).
The fix — free the Arrow buffers the instant the columns are numpy, before grouping:
cols = {n: table.column(n).combine_chunks().to_numpy() for n in table.column_names}
del table, batches
all_reads.clear() # not used after this call (worker.py)Proof:
- Correct: on the real NEON median shard through the actual
process_shard,handoff=arrow(fixed) is byte-identical topandas— equaltotal_obs/cells_with_dataand equal output-column checksums. - Survives the cap: same workload under
--memory=1600m: unfixed → OOM-kill (exit 137), fixed → exit 0, identical checksum. - Lighter than pandas: matched scale (6×3M photons) peak RSS — pandas 1456 MB vs fixed-arro3 1019 MB (~30% lower), because pandas holds 3× (
all_reads+ concat + grouped) while fixed-arro3 holds 2×. (The pandas branch could take the samedel all_readshygiene and match.)
Also added opt-in per-stage RSS logging (ZAGG_PROFILE_RSS=1) in _concat_and_group so a run prints exactly where it peaks, for both carriers.
Landing note
Reviving handoff='arrow' means undoing the two transition commits — 96dfa09 (drop arrow / pandas-only) and 2155fc4 (region_timing pandas-only) — then applying the lifetime fix on top. The fix is staged; the branch will be updated to restore the arrow path + fix. The arro3 vector write carrier (gain_bias) was never affected and stays as-is.
Net: the pyarrow→arro3 migration stands; the arro3 read carrier is revivable as the lighter, faster option, with the lifetime fix.
2155fc4 to
b57c3f0
Compare
|
🤖 from Claude (review) Fresh-context adversarial review of the arro3-core migration + the memory-lifetime fix ( Memory-lifetime fix (
|
|
🤖 from Claude (review) Final pre-merge adversarial review at
Tests: Verdict: ready to merge. The two "Questions for review" items (validate the fixed read carrier on the dense shard on real Lambda; config-driven handoff in #132) are post-merge follow-ons, not blockers. |
Closes #130
What this does
Migrates the deployed Lambda worker's Arrow carrier from pyarrow to arro3-core (path C, per @espg's Do (C) decision), and makes the arro3 /
arrowcarrier the default for production and the benchmark matrix.pyarrow 24.0.0's bindings hard-link a ~100 MB unstrippable C++ core that can't fit the 250 MB combined Lambda gate while remaining importable;
arro3-coreis ~7 MB, zero required runtime deps, importable inside the gate. The deployed worker + layer are now pyarrow-free — pyarrow survives only in the off-Lambdacatalogextra (stac-geoparquethard-requires it), which never runs on the worker.The carriers
gain_bias→ 200,676 obs onprocess-shard-test). On-disk numpy blocks are byte-for-byte identical to the prior pyarrow output (FixedSizeList viafixed_size_list_array+list_flatten(col).to_numpy().reshape(n, width)).handoff="arrow") → arro3-core — diagnosed, fixed, and now the default. (Story below.)arrow-kernelreducer DROPPED — arro3 has no hash-aggregate. Removed_kernel_*/KERNEL_RTOL, the worker branch, thescalar_kerneltarget, and the kernel tests.The read-carrier failure: not a native crash — a memory-lifetime bug (fixed)
The arro3 read carrier first appeared to crash the Lambda worker (
scalar_arrow), so it was dropped (96dfa09). Re-investigating with the actual CloudWatch logs showed it was not a native crash — it was memory exhaustion:gain_bias(arro3 vector write)scalar_pandasscalar_arrow(arro3 read handoff, unfixed)REPORT … Duration: 720000 ms Max Memory Used: 2048 MB Status: timeout— the worker pegged the 2 GB cap and thrashed to the 720 s wall, surfacing to the orchestrator asConnectionClosedError("connection closed, no response").Root cause (measured, not inferred).
Array.from_numpy/from_pydict/from_batchesare zero-copy; the single forced copy iscombine_chunks().to_numpy()._concat_and_groupkept the chunked Arrowtable(and the per-read tables inall_reads) alive through_group_columns, so the pooled data was held twice. Fix — free the Arrow buffers the instant the columns are numpy:Proof: peak RSS 1004 → 524 MB; under a 2 GB cap the unfixed run OOMs (exit 137) and the fixed run survives (exit 0) with byte-identical output; on the real NEON median shard
handoff=arrowis byte-identical to pandas throughprocess_shard; at matched scale fixed-arro3 (1019 MB) is lighter than pandas (1456 MB, which holds 3× vs arrow's 2×). Full diagnosis: #130 comment; fresh-context adversarial review (no blocking issues): #131 review.Default switch → arro3 /
arrowagg()and the runner dispatch layer defaulthandoff="arrow". The wire protocol is unchanged ("absent event key = pandas"): arrow runs inject the key; an explicithandoff="pandas"omits it (byte-identical legacy path).process_shard's library default and the handler's absent-default stay"pandas"(legacy-safe — explicit pandas keeps working)."handoff": "arrow", so the retained series tracks the carrier production actually runs.handoff(an aggregation-YAML field, so other sensors / nullable sources can choose pandas) — the read carrier requires dense, null-free columns, so a global default isn't right for every sensor. Not blocking this PR.Phases
processing/write.py); byte-identical output.catalog(out oflambda);arro3-corein core +lambda;build_layer.shswap + strip-block removal.scalar_pandas/scalar_arrowA/B baseline) + pyarrow-free worker guard test.handoff="arrow": diagnosed (CloudWatch: OOM→timeout), fixed (free Arrow buffers before grouping), revived; opt-inZAGG_PROFILE_RSSper-stage RSS trace added.arrowmade the default carrier (productionagg()/runner + the 8 benchmark targets); docstrings/comments + the two default-assertion tests updated toarrow.How tested
test_runner/test_lambda_handler/test_benchmarkafter the default flip; the twotest_default_handoff_is_*tests updated to assertarrow); ruff +ruff formatclean.handoff=arrowoutput == pandas on the real median shard.pyarrowimport, assertingpyarrownever enterssys.modules.process-shard-test): gain_bias (arro3 write) ✅ 200,676 obs; scalar_pandas ✅ 200,676 obs. The fixed arro3 read carrier will be exercised by the next merge's arrow benchmark across all 8 targets.Questions for review
REPORTmem stays well under 2048 MB.==0.8.1(lambda) />=0.8.1(core) — bump policy is yours.