Fix un-awaited store.exists() Zarr-template check in lambda_handler#128
Conversation
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase-1 adversarial review of the deployment/aws/lambda_handler.py template-existence guard (issue #118, locked option (B)). The change is correct and lands the intent. Findings below are diff-scoped, mostly minor, to fold into phase 2.
Correctness — the bug is genuinely fixed. open_group(...) is zarr's high-level sync API, so there is no coroutine left unawaited and no RuntimeWarning; the guard now actually fires on a missing template instead of always-truthy-coroutine dead code. Verified the cited reference path (src/zagg/readers/tdigest_tensor.py uses zarr.open_group(store, path=..., mode="r", zarr_format=...) at lines 292/358), and resolved zarr in uv.lock is 3.2.1, matching the empirical verification. The probe's return value is intentionally discarded — fine for an existence check.
1. Exception breadth vs the cited reference (note, not a blocker). GroupNotFoundError is the right, precise catch per the locked decision (it is a FileNotFoundError subclass in zarr 3.2.1, so semantics are sound and backend-uniform across LocalStore/ObjectStore). Worth noting the named reference readers/tdigest_tensor.py:221-223 actually catches (FileNotFoundError, KeyError, ValueError) — and the bare open_group calls at 292/358 catch nothing. So the comment's "mirrors the open-and-catch pattern in readers/tdigest_tensor.py" is slightly aspirational: the closest open-and-catch there is the open_array block (broader catch), not an open_group one. Consider tightening the comment wording to "mirrors the open-and-catch approach" rather than implying an identical pattern exists for open_group.
2. Edge case worth a phase-2 test: node present but wrong kind. If grid.group_path resolves to an array (or otherwise a non-group node) rather than being absent, open_group(mode="r") will not raise GroupNotFoundError — it surfaces a different error and would escape the guard, producing an uncaught exception instead of the intended 500/"template not found". This is an unlikely state for a real template, but since phase 2 adds tests, a "node-exists-but-not-a-group" case (and confirming current behavior is acceptable) would harden the guard. Not requesting a code change in phase 1.
3. Error-message intent preserved. Switching the path in error_msg from {store_path}/{template_key} (where template_key = "{group_path}/zarr.json") to {store_path}/{grid.group_path} is an improvement — it now names the group being probed rather than a synthesized metadata key, and the load-bearing "Zarr template not found" substring is retained, so acceptance/test assertions keyed on that substring still pass. The 500 + metadata["error"] + json.dumps(metadata) body is byte-for-byte the same shape as before. Good.
4. Convention / import style (minor). tdigest_tensor.py accesses the API as import zarr → zarr.open_group; this diff introduces from zarr import open_group. Both are fine and the local-import form is arguably cleaner here, but if matching the nearest in-repo zarr caller is desired, import zarr + qualified calls would be more consistent (§4 "match the surrounding code"). Non-blocking.
5. Scope. Confinement is clean: only the two imports and the single call site changed, nothing else under deployment/aws/. No CI/infra/deploy-script edits (§1 respected). Phase 1 alone is internally consistent (imports resolve, no syntax/dead-code issues); per the plan, tests are deferred to phase 2 — flagging that this phase carries no test, as expected, not as a defect.
No blocking issues; recommend folding (1)/(2)/(4) into phase 2 alongside the tests.
Generated by Claude Code
espg
left a comment
There was a problem hiding this comment.
🤖 from Claude (review)
Phase-2 review (tests + one-comment handler fold), head 1f3ba37. Verdict: the fix and the test wiring are correct; the catch precision matches the locked option (B); one test assertion is toothless as written. No blockers.
(1) test_missing_template_returns_500 genuinely catches the bug — confirmed.
The test patches handler_mod.open_group to raise GroupNotFoundError and asserts 500 + "Zarr template not found". Against pre-fix production code (if not store.exists(template_key):), the handler never called open_group at all — it called the async store.exists(), whose un-awaited coroutine is truthy, so not <coroutine> was always False, the guard was dead, and the handler proceeded to write and return 200. So as a regression against the old code path this test asserts 500 where the bug returned 200. Good.
(2) test_present_template_proceeds_without_runtimewarning — the no-RuntimeWarning assertion is toothless.
The path is fully mocked: open_group is replaced with lambda *a, **k: MagicMock() and store is a MagicMock, so no real coroutine is ever created and store.exists() is never invoked. warnings.simplefilter("error", RuntimeWarning) therefore has nothing to convert — the "coroutine ... was never awaited" warning can only originate from the real async store.exists(), which the fix deleted and this test never exercises. The test still has value as a "present template → 200" check, but the simplefilter("error", RuntimeWarning) wrapper implies it guards the original foot-gun when it cannot. Suggest one of: (a) drop the warnings machinery and rename to test_present_template_proceeds (it's really a 200-path assertion), or (b) keep the warning filter but make it meaningful — e.g. drive the pre-fix construct against a real async-.exists() store in a dedicated test, or assert via gc/asyncio that no coroutine object leaks. Lowest-friction is (a); the genuine regression coverage already lives in test_missing_template_returns_500 and the present-path catch is exercised by TestProcessEventProfile/WriteLoop.
(3) Stub change is correct and minimal. Both surviving store.exists.return_value = ... lines are removed (WriteLoop._patch and Profile._patch); rg store.exists over the file is clean — no dead references left. Patching handler_mod.open_group is the right seam: the handler does from zarr import open_group (module-level name on the handler module, line 60), so monkeypatch.setattr(handler_mod, "open_group", ...) intercepts the exact call. GroupNotFoundError(grid_stub.group_path) is a valid single-arg construction — BaseZarrError.__init__ treats a lone arg as a pre-formatted message (zarr/errors.py), so no template-format mismatch.
(4) Convention / metadata. Naming and structure match the surrounding Test* classes; _patch/_event helpers mirror the existing pattern; comments are terse and grounded in #118. The fake_process_shard in the new class seeds cells_with_data/total_obs/duration_s (plus shard_key/error) — exactly the keys the 200-path structured log and body read (handler lines 369-372), and chunk_results is non-empty so the write/existence branch is actually entered. Good.
(5) Catch precision matches acceptance. open_group(..., mode="r") raises GroupNotFoundError specifically on a missing group (zarr/api/asynchronous.py: raise GroupNotFoundError(msg) after the read-mode open misses). GroupNotFoundError and ArrayNotFoundError both subclass NodeNotFoundError, but a missing group is always the GroupNotFoundError leaf, so catching only GroupNotFoundError (not NodeNotFoundError) misses no legitimately-missing-template case. A present-but-wrong-type node surfaces as NodeTypeValidationError and correctly propagates as a real error rather than "missing" — matching the handler comment and the locked decision.
Net: fold (2) (prefer dropping the warnings wrapper / renaming). Everything else is sound; not blocking.
Generated by Claude Code
|
🤖 from Claude Both phases complete and all adversarial-review findings folded; Generated by Claude Code |
Lambda benchmark —
|
| target | obs | runtime (s) | cost/shard | cost/100 km² | % timeout | mem (MB) | % cap |
|---|---|---|---|---|---|---|---|
| gain_bias_healpix_o11 | 0 | 180.7 | $0.00482 | $0.04755 | 25% | 1051 | 51% |
| tdigest_healpix_o11 | 200,676 | 161.2 | $0.00430 | $0.04241 | 22% | 1603 | 78% |
| gain_bias_healpix_o10 | 0 | 256.8 | $0.00685 | $0.01689 | 36% | 1147 | 56% |
| tdigest_healpix_o10 | 0 | 0.0 | $0.00000 | $0.00000 | n/a | n/a | n/a |
| gain_bias_rect_3km | 0 | 185.2 | $0.00494 | $0.05488 | 26% | 1051 | 51% |
| tdigest_rect_3km | 207,911 | 168.0 | $0.00448 | $0.04978 | 23% | 1568 | 77% |
| gain_bias_rect_6km | 0 | 228.1 | $0.00608 | $0.01690 | 32% | 1133 | 55% |
| tdigest_rect_6km | 745,342 | 201.9 | $0.00538 | $0.01496 | 28% | 1829 | 89% |
arm64 · 2.0 GB · $1.33334e-05/GB-s · one shard/target · pre-merge runs are not retained.
Closes #118
What
deployment/aws/lambda_handler.pyguarded the Zarr write path withif not store.exists(template_key):, butstore(fromopen_store) is a zarr v3Storewhose.exists()isasync def. The call returned a truthy coroutine, so the guard never fired (dead code) and the runtime emittedRuntimeWarning: coroutine 'ObjectStore.exists' was never awaited. A genuinely missing template slipped past, surfacing later as an opaque write failure instead of a clean 500.Approach — option (B)
Per the locked decision (@espg, "We're going with option (B)" and the grounding comment): replace the async
store.exists()call with an open-and-catch via zarr's high-level sync API, catching the missing-node exception. This is in the same open-and-catch spirit assrc/zagg/readers/tdigest_tensor.py.GroupNotFoundError(not bareFileNotFoundError) is caught: it is the precise, intent-revealing, backend-uniform error (verified raised identically onLocalStoreand the obstore-backedObjectStorein zarr 3.2.1).FileNotFoundErroronly catches these today via the quirkyBaseZarrError(ValueError, FileNotFoundError)MRO, which a future zarr could tidy away — re-introducing the exact never-fires bug this issue is about. A present-but-malformed/wrong-type template now surfaces as a real error rather than as "missing," which is the desirable behavior.The change is confined to the existence-check call site + the two zarr imports. No other infra/deploy edits.
Phases
open_group+except GroupNotFoundError) and add the imports.store.existsreturn a bool; under (B) the handler callsopen_group(...), so the missing-template case now makesopen_groupraiseGroupNotFoundError. AddedTestTemplateExistenceGuard: a mock-based regression test asserting the 500 + "Zarr template not found" body (fails against the pre-fix un-awaited code, which fell through to 200), plus real-MemoryStoretests that exercise the un-mockedopen_grouppath (missing group → 500; present group → 200 with noRuntimeWarning; present-but-array node not masked as "missing").How tested
pytest -v tests/test_lambda_handler.py— 21 passed (incl. the newTestTemplateExistenceGuardcases).pytest -q(excluding the pre-existingtest_shardmap.pycollection failure below) — 846 passed, 9 skipped.ruff check --select=E,F,W,I --ignore=E501 src testsandruff format --checkon the touched files — green.>=3.1.5floor) thatopen_group(store, mode="r")on a missing group raisesGroupNotFoundErroron bothLocalStoreand the obstore-backedObjectStore.Questions for review
None — the option-(B) decision and the exact exception (
GroupNotFoundError) are locked in the issue thread.Pre-existing, unrelated:
tests/test_shardmap.pyfails to collect in this env due to a missingstac_geoparquet(catalog extra not installed) — not touched by this PR; flagged, not fixed.