Dispatcher abstraction: Executor protocol + generic dispatch loop by espg · Pull Request #76 · englacial/zagg

espg · 2026-06-22T19:32:46Z

Closes #63.

What this does

Extracts the fan-out → retry → measured-cost loop that lived as two bespoke functions in runner.py (_run_local / _run_lambda) into a clean seam in a new src/zagg/dispatch.py, so temporal (#12) and future cluster backends (#20) inherit local + Lambda execution behind one interface.

Implements the (B)+(C) design locked with @espg on the issue thread (comment, recap):

Executor protocol (@runtime_checkable): preflight(n_cells) -> PreflightReport, submit(payload) -> Future, measure_cost(result) -> CellCost, finalize() -> RunReport, shutdown().
RetryPolicy(max_attempts, backoff, classify) as a separate strategy object, with defaults LAMBDA_RETRY / LOCAL_RETRY. The only per-backend variation is classify — errno-24 / EMFILE is not retryable (run-fatal, re-raised), boto3 throttling is.
Generic dispatch() loop returning a structured RunReport. Per-result counting differs between backends, so it lives in a caller-supplied accumulate callback rather than being baked in.
LocalExecutor (thread pool, zero metered cost) + LambdaExecutor (boto3 fan-out; pricing model in measure_cost).

Approach / what stayed put (per the locked answers)

Public agg(backend="local"|"lambda") signature is verbatim. The executor is an internal seam selected from backend, not a new public param.
concurrency.py stays a helper module, now called from inside LambdaExecutor.preflight() (via the runner's injected _preflight) rather than inline.
runner keeps cost presentation (it formats gb_seconds / estimated_cost_usd off the RunReport); dispatch() only returns structured data.
Concurrency orchestrator: mid-run capacity re-check + grant cloudwatch:GetMetricStatistics in template.yaml #56 (mid-run capacity re-check) is out of scope — no hook implemented beyond the existing preflight seam.
The boto3 seams (_invoke_lambda_cell / _invoke_lambda_setup / _invoke_lambda_finalize / compute_available_workers / ThreadPoolExecutor) are injected into the executors off the runner module namespace, so dispatch.py needs no boto3 import and the existing tests that monkeypatch those names keep binding the exact objects in use.

Phases

Phase 1 — lift _run_lambda → LambdaExecutor + generic dispatch() + RetryPolicy.
Phase 2 — port _run_local → LocalExecutor on the same loop (validates the protocol on the trivial case).
Phase 3 — concurrency probe behind LambdaExecutor.preflight(); dispatch() returns RunReport, runner keeps cost presentation.
Phase 4 — tests: protocol conformance for both executors + byte-identical summary/cost pins.

How it was tested (byte-identical evidence)

The spatial production path must stay byte-identical — Zarr store, run-summary dict keys, and per-cell Lambda event payload all unchanged. Pinned by:

tests/test_runner.py::TestInvokeLambdaCellEvent — per-cell event payload bytes (unchanged on this branch).
New tests/test_runner.py::TestSummaryKeysByteIdentical — pins the exact summary-dict key set and the data/error counters for both backends through the dispatch refactor, plus the Lambda cost math (4 cells × 2 s × 2 GB = 16 GB-s, × $0.0000133334).
tests/test_runner_concurrency.py — the probe-clamp / FD-exhaustion wiring still passes unchanged (pool sized to the clamped worker count; errno-24 surfaces with ulimit guidance).
New tests/test_dispatch.py — RetryPolicy defaults + classifiers (throttling retryable, EMFILE not), Executor protocol conformance (isinstance(..., Executor) for both), LambdaExecutor.measure_cost pricing, LocalExecutor zero-cost, and the generic loop's accumulate/cost/on_submit_error behavior.

Local green:

ruff check --select=E,F,W,I --ignore=E501 on touched files — clean.
ruff format --check on the new files (dispatch.py, test_dispatch.py) — clean. (runner.py / test_runner.py are not ruff-format-clean on main, and CI's lint.yml only runs ruff check — so I deliberately did not whole-file-reformat them, to avoid ~200 lines of unrelated churn; only the logical edits are present.)
mypy (pre-commit) — no errors in dispatch.py / runner.py / the new tests.
pytest -q — 536 passed, 1 skipped.

Questions for review

3 pre-existing test failures, not mine. tests/test_integration.py::{test_full_integration,test_multiple_parent_cells} and tests/test_processing.py::TestWriteDataframeToZarr::test_write_dataframe_to_zarr fail on a clean origin/main checkout (verified by stashing my diff) — they're unrelated to this change, so I left them per the "don't fix unrelated CI failures, flag them" rule. Flagging here.
runner.py left un-ruff-format'd (see above) to keep the diff reviewable. Say the word if you'd rather take the whole-file reformat in this PR.
RetryPolicy is currently advisory — the in-process executors apply retry inside submit (Lambda's existing _invoke_lambda_cell retry; local runs once), matching the pre-refactor behavior so the spatial path stays byte-identical. The policy object is threaded through dispatch() so a future loop-level retry (and the Backend support (i.e., local processing support) #20 cluster backends) can consult one policy without a signature change. OK to land the seam advisory now, or do you want the loop to drive retry centrally in this PR?
Temporal aggregation infrastructure (Refs #12) #70 rebases onto this (its functional dispatch.py → this protocol shape), as agreed — no action here.

🤖 Generated with Claude Code

https://claude.ai/code/session_01EN7X53ZAyaCca9rjwv9qZw

Generated by Claude Code

espg

🤖 from Claude (review)

Fresh-context adversarial self-review of the dispatcher refactor. I checked the byte-identical surface hard: per-cell Lambda event payload, run-summary keys/values, counting rules (incl. the "No granules found" / "No data after filtering" no-data sentinels and the local raised-vs-meta-error asymmetry), the probe→client→setup→fan-out→finalize ordering, max_pool_connections sizing, progress-log cadence, FD-exhaustion re-raise, and pool-shutdown-in-finally. All faithful — except one real finding, now folded in this push:

estimated_cost_usd was no longer byte-identical — the loop accumulated per-cell cost_usd, which differs from origin/main's single multiply over summed time in the last FP ULP (~83% of heterogeneous-duration runs). Fixed by computing cost once off the summed compute_time_s in runner._run_lambda; added a mixed-duration test that pins it. See the two inline comments + the test comment.

Everything else verified correct; no further diff-scoped changes. The 3 unrelated pre-existing test failures (test_integration.py::{test_full_integration,test_multiple_parent_cells}, test_processing.py::TestWriteDataframeToZarr::test_write_dataframe_to_zarr) are flagged in the PR body's "Questions for review" — they fail on a clean origin/main checkout, so left untouched.

Generated by Claude Code

espg · 2026-06-22T19:48:18Z

+    # the per-cell CellCost.cost_usd is for the report's structured breakdown.
+    total_lambda_time = report.cost.compute_time_s
+    memory_gb = LAMBDA_MEMORY_GB
    gb_seconds = total_lambda_time * memory_gb


🤖 from Claude (review)

Folded (was the one blocking finding). Earlier in this branch the runner read estimated_cost straight off report.cost.cost_usd, i.e. the loop's sum of per-cell (duration_i · 2.0) · price. That is not bit-equal to origin/main's single multiply over the summed time, (Σ duration · 2.0) · price — IEEE-754 makes the two orders differ in the last ULP for ~83% of heterogeneous-duration runs (verified: 0.00035466843999999996 vs 0.00035466844 for durations [0.1, 0.2, 0.3, 12.7]). Since estimated_cost_usd is a run-summary value and the hard requirement is byte-identical summaries, this line now computes gb_seconds/estimated_cost once off report.cost.compute_time_s, restoring the exact pre-refactor arithmetic order. gb_seconds was always safe (×2.0 is exact in binary FP). The per-cell CellCost.cost_usd stays for the report's structured breakdown only.

Generated by Claude Code

espg · 2026-06-22T19:48:28Z

+        cost = executor.measure_cost(result)
+        report.cost.compute_time_s += cost.compute_time_s
+        report.cost.gb_seconds += cost.gb_seconds
+        report.cost.cost_usd += cost.cost_usd


🤖 from Claude (review)

Note for #20/#12 backends: the loop still accumulates cost.cost_usd per cell here, which is fine for the report's structured per-cell breakdown, but consumers that need a byte-stable aggregate dollar figure should derive it once from the summed compute_time_s (as runner._run_lambda now does), not from this running cost_usd sum — the two differ in the last FP ULP. gb_seconds accumulation is exact and safe to read directly.

Generated by Claude Code

espg · 2026-06-22T19:48:38Z

+        assert summary["price_per_gb_sec"] == 0.0000133334
+        assert summary["estimated_cost_usd"] == 16.0 * 0.0000133334
+
+    def test_lambda_cost_byte_identical_with_mixed_durations(self, monkeypatch, atl06_config):


🤖 from Claude (review)

Folded (test-gap finding). The other cost test (test_lambda_summary_keys_and_cost) uses uniform per-cell durations (all 2.0), the one case where per-cell-accumulation and multiply-once happen to agree — so it could not have caught the ULP divergence above. This test uses heterogeneous durations [0.1, 0.2, 0.3, 12.7] (with max_workers=1 so completion order is deterministic) and asserts estimated_cost_usd == (Σ duration · 2.0) · price. It fails against the per-cell-accumulation version and passes against the folded fix, pinning the byte-identical requirement.

Generated by Claude Code

claude added 4 commits June 22, 2026 19:25

phase 1 of issue #63

7efae05

phase 2 of issue #63

7ad199f

phase 3 of issue #63

6ce4c05

phase 4 of issue #63

17565f1

espg added the implement label Jun 22, 2026

fold review: byte-identical lambda cost arithmetic + mixed-duration test

cec7d37

espg commented Jun 22, 2026

View reviewed changes

espg added the waiting label Jun 22, 2026

espg marked this pull request as ready for review June 22, 2026 19:49

espg merged commit a2697ec into main Jun 22, 2026
12 checks passed

espg pushed a commit that referenced this pull request Jun 22, 2026

merge main: reconcile onto #73 registry + #76 dispatch (#12)

fbfd865

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dispatcher abstraction: Executor protocol + generic dispatch loop#76

Dispatcher abstraction: Executor protocol + generic dispatch loop#76
espg merged 5 commits into
mainfrom
claude/63-dispatcher

espg commented Jun 22, 2026

Uh oh!

espg left a comment

Uh oh!

espg Jun 22, 2026

Uh oh!

espg Jun 22, 2026

Uh oh!

espg Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

espg commented Jun 22, 2026

What this does

Approach / what stayed put (per the locked answers)

Phases

How it was tested (byte-identical evidence)

Questions for review

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

espg Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

espg Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants