Skip to content

Dispatcher abstraction: Executor protocol + generic dispatch loop#76

Merged
espg merged 5 commits into
mainfrom
claude/63-dispatcher
Jun 22, 2026
Merged

Dispatcher abstraction: Executor protocol + generic dispatch loop#76
espg merged 5 commits into
mainfrom
claude/63-dispatcher

Conversation

@espg

@espg espg commented Jun 22, 2026

Copy link
Copy Markdown
Member

Closes #63.

What this does

Extracts the fan-out → retry → measured-cost loop that lived as two bespoke functions in runner.py (_run_local / _run_lambda) into a clean seam in a new src/zagg/dispatch.py, so temporal (#12) and future cluster backends (#20) inherit local + Lambda execution behind one interface.

Implements the (B)+(C) design locked with @espg on the issue thread (comment, recap):

  • Executor protocol (@runtime_checkable): preflight(n_cells) -> PreflightReport, submit(payload) -> Future, measure_cost(result) -> CellCost, finalize() -> RunReport, shutdown().
  • RetryPolicy(max_attempts, backoff, classify) as a separate strategy object, with defaults LAMBDA_RETRY / LOCAL_RETRY. The only per-backend variation is classify — errno-24 / EMFILE is not retryable (run-fatal, re-raised), boto3 throttling is.
  • Generic dispatch() loop returning a structured RunReport. Per-result counting differs between backends, so it lives in a caller-supplied accumulate callback rather than being baked in.
  • LocalExecutor (thread pool, zero metered cost) + LambdaExecutor (boto3 fan-out; pricing model in measure_cost).

Approach / what stayed put (per the locked answers)

  • Public agg(backend="local"|"lambda") signature is verbatim. The executor is an internal seam selected from backend, not a new public param.
  • concurrency.py stays a helper module, now called from inside LambdaExecutor.preflight() (via the runner's injected _preflight) rather than inline.
  • runner keeps cost presentation (it formats gb_seconds / estimated_cost_usd off the RunReport); dispatch() only returns structured data.
  • Concurrency orchestrator: mid-run capacity re-check + grant cloudwatch:GetMetricStatistics in template.yaml #56 (mid-run capacity re-check) is out of scope — no hook implemented beyond the existing preflight seam.
  • The boto3 seams (_invoke_lambda_cell / _invoke_lambda_setup / _invoke_lambda_finalize / compute_available_workers / ThreadPoolExecutor) are injected into the executors off the runner module namespace, so dispatch.py needs no boto3 import and the existing tests that monkeypatch those names keep binding the exact objects in use.

Phases

  • Phase 1 — lift _run_lambdaLambdaExecutor + generic dispatch() + RetryPolicy.
  • Phase 2 — port _run_localLocalExecutor on the same loop (validates the protocol on the trivial case).
  • Phase 3 — concurrency probe behind LambdaExecutor.preflight(); dispatch() returns RunReport, runner keeps cost presentation.
  • Phase 4 — tests: protocol conformance for both executors + byte-identical summary/cost pins.

How it was tested (byte-identical evidence)

The spatial production path must stay byte-identical — Zarr store, run-summary dict keys, and per-cell Lambda event payload all unchanged. Pinned by:

  • tests/test_runner.py::TestInvokeLambdaCellEvent — per-cell event payload bytes (unchanged on this branch).
  • New tests/test_runner.py::TestSummaryKeysByteIdentical — pins the exact summary-dict key set and the data/error counters for both backends through the dispatch refactor, plus the Lambda cost math (4 cells × 2 s × 2 GB = 16 GB-s, × $0.0000133334).
  • tests/test_runner_concurrency.py — the probe-clamp / FD-exhaustion wiring still passes unchanged (pool sized to the clamped worker count; errno-24 surfaces with ulimit guidance).
  • New tests/test_dispatch.pyRetryPolicy defaults + classifiers (throttling retryable, EMFILE not), Executor protocol conformance (isinstance(..., Executor) for both), LambdaExecutor.measure_cost pricing, LocalExecutor zero-cost, and the generic loop's accumulate/cost/on_submit_error behavior.

Local green:

  • ruff check --select=E,F,W,I --ignore=E501 on touched files — clean.
  • ruff format --check on the new files (dispatch.py, test_dispatch.py) — clean. (runner.py / test_runner.py are not ruff-format-clean on main, and CI's lint.yml only runs ruff check — so I deliberately did not whole-file-reformat them, to avoid ~200 lines of unrelated churn; only the logical edits are present.)
  • mypy (pre-commit) — no errors in dispatch.py / runner.py / the new tests.
  • pytest -q — 536 passed, 1 skipped.

Questions for review

  1. 3 pre-existing test failures, not mine. tests/test_integration.py::{test_full_integration,test_multiple_parent_cells} and tests/test_processing.py::TestWriteDataframeToZarr::test_write_dataframe_to_zarr fail on a clean origin/main checkout (verified by stashing my diff) — they're unrelated to this change, so I left them per the "don't fix unrelated CI failures, flag them" rule. Flagging here.
  2. runner.py left un-ruff-format'd (see above) to keep the diff reviewable. Say the word if you'd rather take the whole-file reformat in this PR.
  3. RetryPolicy is currently advisory — the in-process executors apply retry inside submit (Lambda's existing _invoke_lambda_cell retry; local runs once), matching the pre-refactor behavior so the spatial path stays byte-identical. The policy object is threaded through dispatch() so a future loop-level retry (and the Backend support (i.e., local processing support) #20 cluster backends) can consult one policy without a signature change. OK to land the seam advisory now, or do you want the loop to drive retry centrally in this PR?
  4. Temporal aggregation infrastructure (Refs #12) #70 rebases onto this (its functional dispatch.py → this protocol shape), as agreed — no action here.

🤖 Generated with Claude Code

https://claude.ai/code/session_01EN7X53ZAyaCca9rjwv9qZw


Generated by Claude Code

@espg espg added the implement label Jun 22, 2026

@espg espg left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Fresh-context adversarial self-review of the dispatcher refactor. I checked the byte-identical surface hard: per-cell Lambda event payload, run-summary keys/values, counting rules (incl. the "No granules found" / "No data after filtering" no-data sentinels and the local raised-vs-meta-error asymmetry), the probe→client→setup→fan-out→finalize ordering, max_pool_connections sizing, progress-log cadence, FD-exhaustion re-raise, and pool-shutdown-in-finally. All faithful — except one real finding, now folded in this push:

  • estimated_cost_usd was no longer byte-identical — the loop accumulated per-cell cost_usd, which differs from origin/main's single multiply over summed time in the last FP ULP (~83% of heterogeneous-duration runs). Fixed by computing cost once off the summed compute_time_s in runner._run_lambda; added a mixed-duration test that pins it. See the two inline comments + the test comment.

Everything else verified correct; no further diff-scoped changes. The 3 unrelated pre-existing test failures (test_integration.py::{test_full_integration,test_multiple_parent_cells}, test_processing.py::TestWriteDataframeToZarr::test_write_dataframe_to_zarr) are flagged in the PR body's "Questions for review" — they fail on a clean origin/main checkout, so left untouched.


Generated by Claude Code

Comment thread src/zagg/runner.py
# the per-cell CellCost.cost_usd is for the report's structured breakdown.
total_lambda_time = report.cost.compute_time_s
memory_gb = LAMBDA_MEMORY_GB
gb_seconds = total_lambda_time * memory_gb

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Folded (was the one blocking finding). Earlier in this branch the runner read estimated_cost straight off report.cost.cost_usd, i.e. the loop's sum of per-cell (duration_i · 2.0) · price. That is not bit-equal to origin/main's single multiply over the summed time, (Σ duration · 2.0) · price — IEEE-754 makes the two orders differ in the last ULP for ~83% of heterogeneous-duration runs (verified: 0.00035466843999999996 vs 0.00035466844 for durations [0.1, 0.2, 0.3, 12.7]). Since estimated_cost_usd is a run-summary value and the hard requirement is byte-identical summaries, this line now computes gb_seconds/estimated_cost once off report.cost.compute_time_s, restoring the exact pre-refactor arithmetic order. gb_seconds was always safe (×2.0 is exact in binary FP). The per-cell CellCost.cost_usd stays for the report's structured breakdown only.


Generated by Claude Code

Comment thread src/zagg/dispatch.py
cost = executor.measure_cost(result)
report.cost.compute_time_s += cost.compute_time_s
report.cost.gb_seconds += cost.gb_seconds
report.cost.cost_usd += cost.cost_usd

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Note for #20/#12 backends: the loop still accumulates cost.cost_usd per cell here, which is fine for the report's structured per-cell breakdown, but consumers that need a byte-stable aggregate dollar figure should derive it once from the summed compute_time_s (as runner._run_lambda now does), not from this running cost_usd sum — the two differ in the last FP ULP. gb_seconds accumulation is exact and safe to read directly.


Generated by Claude Code

Comment thread tests/test_runner.py
assert summary["price_per_gb_sec"] == 0.0000133334
assert summary["estimated_cost_usd"] == 16.0 * 0.0000133334

def test_lambda_cost_byte_identical_with_mixed_durations(self, monkeypatch, atl06_config):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 from Claude (review)

Folded (test-gap finding). The other cost test (test_lambda_summary_keys_and_cost) uses uniform per-cell durations (all 2.0), the one case where per-cell-accumulation and multiply-once happen to agree — so it could not have caught the ULP divergence above. This test uses heterogeneous durations [0.1, 0.2, 0.3, 12.7] (with max_workers=1 so completion order is deterministic) and asserts estimated_cost_usd == (Σ duration · 2.0) · price. It fails against the per-cell-accumulation version and passes against the folded fix, pinning the byte-identical requirement.


Generated by Claude Code

@espg espg added the waiting label Jun 22, 2026
@espg espg marked this pull request as ready for review June 22, 2026 19:49
@espg espg merged commit a2697ec into main Jun 22, 2026
12 checks passed
espg pushed a commit that referenced this pull request Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dispatcher abstraction as shared infrastructure (#20, #12)

2 participants