Skip to content

Fix un-awaited store.exists() Zarr-template check in lambda_handler#128

Merged
espg merged 3 commits into
mainfrom
claude/118-store-exists
Jun 29, 2026
Merged

Fix un-awaited store.exists() Zarr-template check in lambda_handler#128
espg merged 3 commits into
mainfrom
claude/118-store-exists

Conversation

@espg

@espg espg commented Jun 29, 2026

Copy link
Copy Markdown
Member

Closes #118

What

deployment/aws/lambda_handler.py guarded the Zarr write path with if not store.exists(template_key):, but store (from open_store) is a zarr v3 Store whose .exists() is async def. The call returned a truthy coroutine, so the guard never fired (dead code) and the runtime emitted RuntimeWarning: coroutine 'ObjectStore.exists' was never awaited. A genuinely missing template slipped past, surfacing later as an opaque write failure instead of a clean 500.

Approach — option (B)

Per the locked decision (@espg, "We're going with option (B)" and the grounding comment): replace the async store.exists() call with an open-and-catch via zarr's high-level sync API, catching the missing-node exception. This is in the same open-and-catch spirit as src/zagg/readers/tdigest_tensor.py.

from zarr import open_group
from zarr.errors import GroupNotFoundError
...
try:
    open_group(store, path=grid.group_path, mode="r", zarr_format=3)
except GroupNotFoundError:
    error_msg = f"Zarr template not found at {store_path}/{grid.group_path}"
    logger.error(error_msg)
    metadata["error"] = error_msg
    return {"statusCode": 500, "body": json.dumps(metadata)}

GroupNotFoundError (not bare FileNotFoundError) is caught: it is the precise, intent-revealing, backend-uniform error (verified raised identically on LocalStore and the obstore-backed ObjectStore in zarr 3.2.1). FileNotFoundError only catches these today via the quirky BaseZarrError(ValueError, FileNotFoundError) MRO, which a future zarr could tidy away — re-introducing the exact never-fires bug this issue is about. A present-but-malformed/wrong-type template now surfaces as a real error rather than as "missing," which is the desirable behavior.

The change is confined to the existence-check call site + the two zarr imports. No other infra/deploy edits.

Phases

  • Phase 1 — fix the call site (open_group + except GroupNotFoundError) and add the imports.
  • Phase 2 — fix the tests that masked the bug. The stub previously made store.exists return a bool; under (B) the handler calls open_group(...), so the missing-template case now makes open_group raise GroupNotFoundError. Added TestTemplateExistenceGuard: a mock-based regression test asserting the 500 + "Zarr template not found" body (fails against the pre-fix un-awaited code, which fell through to 200), plus real-MemoryStore tests that exercise the un-mocked open_group path (missing group → 500; present group → 200 with no RuntimeWarning; present-but-array node not masked as "missing").

How tested

  • pytest -v tests/test_lambda_handler.py — 21 passed (incl. the new TestTemplateExistenceGuard cases).
  • Broader pytest -q (excluding the pre-existing test_shardmap.py collection failure below) — 846 passed, 9 skipped.
  • ruff check --select=E,F,W,I --ignore=E501 src tests and ruff format --check on the touched files — green.
  • Verified empirically against the installed zarr (3.2.1, the pinned >=3.1.5 floor) that open_group(store, mode="r") on a missing group raises GroupNotFoundError on both LocalStore and the obstore-backed ObjectStore.

Questions for review

None — the option-(B) decision and the exact exception (GroupNotFoundError) are locked in the issue thread.

Pre-existing, unrelated: tests/test_shardmap.py fails to collect in this env due to a missing stac_geoparquet (catalog extra not installed) — not touched by this PR; flagged, not fixed.

@espg espg added the implement label Jun 29, 2026

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase-1 adversarial review of the deployment/aws/lambda_handler.py template-existence guard (issue #118, locked option (B)). The change is correct and lands the intent. Findings below are diff-scoped, mostly minor, to fold into phase 2.

Correctness — the bug is genuinely fixed. open_group(...) is zarr's high-level sync API, so there is no coroutine left unawaited and no RuntimeWarning; the guard now actually fires on a missing template instead of always-truthy-coroutine dead code. Verified the cited reference path (src/zagg/readers/tdigest_tensor.py uses zarr.open_group(store, path=..., mode="r", zarr_format=...) at lines 292/358), and resolved zarr in uv.lock is 3.2.1, matching the empirical verification. The probe's return value is intentionally discarded — fine for an existence check.

1. Exception breadth vs the cited reference (note, not a blocker). GroupNotFoundError is the right, precise catch per the locked decision (it is a FileNotFoundError subclass in zarr 3.2.1, so semantics are sound and backend-uniform across LocalStore/ObjectStore). Worth noting the named reference readers/tdigest_tensor.py:221-223 actually catches (FileNotFoundError, KeyError, ValueError) — and the bare open_group calls at 292/358 catch nothing. So the comment's "mirrors the open-and-catch pattern in readers/tdigest_tensor.py" is slightly aspirational: the closest open-and-catch there is the open_array block (broader catch), not an open_group one. Consider tightening the comment wording to "mirrors the open-and-catch approach" rather than implying an identical pattern exists for open_group.

2. Edge case worth a phase-2 test: node present but wrong kind. If grid.group_path resolves to an array (or otherwise a non-group node) rather than being absent, open_group(mode="r") will not raise GroupNotFoundError — it surfaces a different error and would escape the guard, producing an uncaught exception instead of the intended 500/"template not found". This is an unlikely state for a real template, but since phase 2 adds tests, a "node-exists-but-not-a-group" case (and confirming current behavior is acceptable) would harden the guard. Not requesting a code change in phase 1.

3. Error-message intent preserved. Switching the path in error_msg from {store_path}/{template_key} (where template_key = "{group_path}/zarr.json") to {store_path}/{grid.group_path} is an improvement — it now names the group being probed rather than a synthesized metadata key, and the load-bearing "Zarr template not found" substring is retained, so acceptance/test assertions keyed on that substring still pass. The 500 + metadata["error"] + json.dumps(metadata) body is byte-for-byte the same shape as before. Good.

4. Convention / import style (minor). tdigest_tensor.py accesses the API as import zarrzarr.open_group; this diff introduces from zarr import open_group. Both are fine and the local-import form is arguably cleaner here, but if matching the nearest in-repo zarr caller is desired, import zarr + qualified calls would be more consistent (§4 "match the surrounding code"). Non-blocking.

5. Scope. Confinement is clean: only the two imports and the single call site changed, nothing else under deployment/aws/. No CI/infra/deploy-script edits (§1 respected). Phase 1 alone is internally consistent (imports resolve, no syntax/dead-code issues); per the plan, tests are deferred to phase 2 — flagging that this phase carries no test, as expected, not as a defect.

No blocking issues; recommend folding (1)/(2)/(4) into phase 2 alongside the tests.


Generated by Claude Code

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Phase-2 review (tests + one-comment handler fold), head 1f3ba37. Verdict: the fix and the test wiring are correct; the catch precision matches the locked option (B); one test assertion is toothless as written. No blockers.

(1) test_missing_template_returns_500 genuinely catches the bug — confirmed.
The test patches handler_mod.open_group to raise GroupNotFoundError and asserts 500 + "Zarr template not found". Against pre-fix production code (if not store.exists(template_key):), the handler never called open_group at all — it called the async store.exists(), whose un-awaited coroutine is truthy, so not <coroutine> was always False, the guard was dead, and the handler proceeded to write and return 200. So as a regression against the old code path this test asserts 500 where the bug returned 200. Good.

(2) test_present_template_proceeds_without_runtimewarning — the no-RuntimeWarning assertion is toothless.
The path is fully mocked: open_group is replaced with lambda *a, **k: MagicMock() and store is a MagicMock, so no real coroutine is ever created and store.exists() is never invoked. warnings.simplefilter("error", RuntimeWarning) therefore has nothing to convert — the "coroutine ... was never awaited" warning can only originate from the real async store.exists(), which the fix deleted and this test never exercises. The test still has value as a "present template → 200" check, but the simplefilter("error", RuntimeWarning) wrapper implies it guards the original foot-gun when it cannot. Suggest one of: (a) drop the warnings machinery and rename to test_present_template_proceeds (it's really a 200-path assertion), or (b) keep the warning filter but make it meaningful — e.g. drive the pre-fix construct against a real async-.exists() store in a dedicated test, or assert via gc/asyncio that no coroutine object leaks. Lowest-friction is (a); the genuine regression coverage already lives in test_missing_template_returns_500 and the present-path catch is exercised by TestProcessEventProfile/WriteLoop.

(3) Stub change is correct and minimal. Both surviving store.exists.return_value = ... lines are removed (WriteLoop._patch and Profile._patch); rg store.exists over the file is clean — no dead references left. Patching handler_mod.open_group is the right seam: the handler does from zarr import open_group (module-level name on the handler module, line 60), so monkeypatch.setattr(handler_mod, "open_group", ...) intercepts the exact call. GroupNotFoundError(grid_stub.group_path) is a valid single-arg construction — BaseZarrError.__init__ treats a lone arg as a pre-formatted message (zarr/errors.py), so no template-format mismatch.

(4) Convention / metadata. Naming and structure match the surrounding Test* classes; _patch/_event helpers mirror the existing pattern; comments are terse and grounded in #118. The fake_process_shard in the new class seeds cells_with_data/total_obs/duration_s (plus shard_key/error) — exactly the keys the 200-path structured log and body read (handler lines 369-372), and chunk_results is non-empty so the write/existence branch is actually entered. Good.

(5) Catch precision matches acceptance. open_group(..., mode="r") raises GroupNotFoundError specifically on a missing group (zarr/api/asynchronous.py: raise GroupNotFoundError(msg) after the read-mode open misses). GroupNotFoundError and ArrayNotFoundError both subclass NodeNotFoundError, but a missing group is always the GroupNotFoundError leaf, so catching only GroupNotFoundError (not NodeNotFoundError) misses no legitimately-missing-template case. A present-but-wrong-type node surfaces as NodeTypeValidationError and correctly propagates as a real error rather than "missing" — matching the handler comment and the locked decision.

Net: fold (2) (prefer dropping the warnings wrapper / renaming). Everything else is sound; not blocking.


Generated by Claude Code

@espg espg added the waiting label Jun 29, 2026
@espg

espg commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

🤖 from Claude

Both phases complete and all adversarial-review findings folded; waiting for review. Catches GroupNotFoundError from the high-level open_group check (option B). pytest -v tests/test_lambda_handler.py is green (21 passed, incl. real-store guard tests that exercise the un-mocked open_group path), and the broader suite is 846 passed / 9 skipped; ruff check + format clean. Left in draft per routine policy.


Generated by Claude Code

@espg espg marked this pull request as ready for review June 29, 2026 21:42
@espg espg merged commit e60796e into main Jun 29, 2026
8 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Lambda benchmark — 8f749bc

target obs runtime (s) cost/shard cost/100 km² % timeout mem (MB) % cap
gain_bias_healpix_o11 0 180.7 $0.00482 $0.04755 25% 1051 51%
tdigest_healpix_o11 200,676 161.2 $0.00430 $0.04241 22% 1603 78%
gain_bias_healpix_o10 0 256.8 $0.00685 $0.01689 36% 1147 56%
tdigest_healpix_o10 0 0.0 $0.00000 $0.00000 n/a n/a n/a
gain_bias_rect_3km 0 185.2 $0.00494 $0.05488 26% 1051 51%
tdigest_rect_3km 207,911 168.0 $0.00448 $0.04978 23% 1568 77%
gain_bias_rect_6km 0 228.1 $0.00608 $0.01690 32% 1133 55%
tdigest_rect_6km 745,342 201.9 $0.00538 $0.01496 28% 1829 89%

arm64 · 2.0 GB · $1.33334e-05/GB-s · one shard/target · pre-merge runs are not retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lambda_handler: store.exists() coroutine never awaited — Zarr-template existence check is a no-op

2 participants