Benchmark matrix: per-target AOI override (latitude studies + 88S stress target) by espg · Pull Request #123 · englacial/zagg

espg · 2026-06-29T11:21:47Z

Closes #121

What

The Lambda benchmark matrix is single-AOI by construction: tests/data/benchmark/targets.json has one top-level aoi/temporal/cmr, and the drift test (tests/test_benchmark_shardmap.py) rebuilds every shard map over that single AOI. This makes the top-level aoi/temporal/cmr defaults and lets a shard-map entry override any of them, falling back to the default when a key is absent — enabling runtime-vs-latitude studies and a deliberate 88S density stress target.

Runner (.github/scripts/run_benchmark.py) and CI (.github/workflows/lambda-benchmark.yml) are unchanged — the runner is AOI-agnostic (it dispatches the pinned shard_key), so only the drift test consumes the AOI. The change is confined to the tests/ fixture/test/loader layer.

Approach

Drift test resolves AOI/temporal/CMR via a small resolve_aoi_temporal_cmr(sm_meta) helper: sm_meta.get("aoi", MANIFEST["aoi"]) and likewise for temporal/cmr. An entry with no override resolves by identity to the top-level default, so existing NEON shard maps rebuild exactly as today.
New AOI fixture antarctic_88s.geojson (a box at the ±88° ICESat-2 turning latitude) ships ready to use.
README documents the override semantics, the Layout table entry, and an end-to-end recipe for adding a different-AOI target (override JSON + --polygon antarctic_88s.geojson build + the "high-latitude → OOM/timeout is the expected result" caveat tied to Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117 / Don't retry deterministic Lambda failures (OOM); expose max_retries and run the benchmark with max_retries=1 #119).

Phases

Phase 1 — override + fallback resolution in the drift test; non-network unit tests for the resolution (override-used / default-inherited-by-identity / partial-override); README override semantics.
Phase 2 — antarctic_88s.geojson AOI fixture + fixture-validity test; README Layout entry + end-to-end different-AOI recipe.
Phase 3 (review fold) — benchmark plot polish requested on the PR thread. Merged main to pick up Capture per-shard max memory and color benchmark markers by % of cap #125's memory color-coding, then fixed the chart in .github/scripts/plot_series.py: (1) the memory-coloured cost markers were hidden behind the runtime markers — runtime markers are now hollow circles (was filled squares) so the colour shows through, and the cost scatter is enlarged so the green→red memory signal reads clearly; (2) shared x-axis (sharex=True) so only the bottom-row panels carry the commit labels; (3) the colorbar is now horizontal, placed just below the figure title instead of a cramped vertical bar on the right. Re-rendered from the committed series.parquet (no benchmark re-run); figure posted to the thread for review.
- Round 3 (review fold of comments 4835363044 / 4835378682). Three more colorbar/layout tweaks: (1) added inter-column wspace padding so a panel's right-axis labels no longer overlap the next column's left-axis labels; (2) added a dual axis on the colorbar showing absolute MB alongside the existing % of cap (secondary_xaxis, MB = fraction * cap_mb); (3) clipped the memory normalisation to the observed min..max (no longer anchored at 0) so the green→red gradient spans only real data. While here, dropped the global sharex=True: it stamps one target's commit labels onto every panel when commit sets differ, so each panel now keeps its own x-axis and the upper rows' labels are hidden via labelbottom=False. Re-rendered from the committed series.parquet (no re-run, no network) and posted updated previews to the thread.
- Round 4 (review fold of comment 4835498668). Raised the cost (host) axes above the twinx() runtime axis and hid the host patch so the memory-coloured cost circles render on top of the runtime open-circle/dashed glyph; bumped the marker s=60 -> s=90.
- Round 5 (review fold of comment 4836608338). (1) Meaningful data-driven panel layout (_panel_layout): left column = rect grids, right column = HEALPix; each row pairs the families at matching aggregator + shard-size rank; rows ordered largest-shard-first. (2) Documented the layout + zeros-are-failures conventions in tests/data/benchmark/README.md. (3) Dropped failed-run zeros — a zero cost/runtime breaks the line (NaN, no dip to 0) and is flagged with a non-connected x marker on both the cost and runtime series.

How tested

uv run ruff check --select=E,F,W,I --ignore=E501 tests/... — clean.
uv run ruff format --check on touched files — clean.
uv run pytest tests/test_benchmark.py -q — 39 passed. Phase 3 added test_make_figure_circle_runtime_and_bottom_row_labels (round runtime markers + only the bottom-row panels carry x tick labels); round 3 added test_make_figure_uneven_commit_sets_label_per_panel, which uses differing commit sets across targets (one target missing an early merge) and asserts each bottom panel's tick labels match THAT panel's own commits in order — the misalignment class a uniform-commit fixture can't catch. Rounds 4-5 added test_make_figure_cost_scatter_drawn_above_runtime_line, test_panel_layout_rect_left_healpix_right_largest_on_top, and test_make_figure_drops_failed_run_zeros. The existing Capture per-shard max memory and color benchmark markers by % of cap #125 colour-array / colorbar tests still pass. New tests from earlier phases:
- test_override_resolution_falls_back_to_defaults — absent override returns the top-level objects by identity (the byte-identical guarantee for existing entries).
- test_override_resolution_uses_overrides / ..._partial_override — overrides win; omitted keys fall back.
- test_antarctic_88s_aoi_fixture_loads_near_turning_latitude — the AOI loads and sits in the high-density polar band.
test_targets_manifest_consistent still passes — no existing target entry changed, so the committed NEON shard maps stay byte-identical.

The drift test itself is network + ZAGG_BENCHMARK_DRIFT=1 gated (and needs spherely for rect maps), so it does not run in the unit suite; the resolution logic it now uses is covered by the non-network unit tests above.

Questions for review

Live 88S target deferred to a pure-data follow-on. Issue Benchmark matrix: support per-target AOI (latitude studies + a polar/88S stress target) #121 frames adding the actual latitude/polar targets as "pure data … once this lands." Building the real sm_healpix_o11_88s.json shard map requires a CMR fetch (cmr.earthdata.nasa.gov), which is outside this routine's outbound allowlist (proxy returns 403), so I cannot build/pin an honest map here. This PR ships the mechanism + the AOI geojson + the documented recipe, but does not commit a fabricated shard map or a live target entry referencing it (that would either bake fake CMR granules into the fixture or break test_targets_manifest_consistent). Two options: (1) merge the mechanism now and add the real 88S target as a follow-on data PR once a network-capable run builds + pins the map; (2) you build/pin sm_healpix_o11_88s.json and push it to this branch. I recommend (1).
Phase 3 merged main into this branch to bring in Capture per-shard max memory and color benchmark markers by % of cap #125's color-coding foundation (the chart fixes build on top of it). The merge was conflict-free and the PR base/state stays clean.

Generated by Claude Code

espg

🤖 from Claude (review)

Adversarial review of #123 against the #121 acceptance criteria and CLAUDE.md. Net: meets the acceptance bar. Verified locally on Python 3.12: the 4 new tests pass, full test_benchmark.py is green (23 passed, 3 skipped), the drift test still collects its 4 params, and ruff check / ruff format --check are clean on both touched test files.

Acceptance check:

Byte-identical fallback (the load-bearing guarantee): holds. resolve_aoi_temporal_cmr returns sm_meta.get(key, MANIFEST[key]), so a no-override entry gets the same object by identity. All four existing shard maps in targets.json (healpix_o11, healpix_o10, rect_3km, rect_6km) carry no aoi/temporal/cmr override, and test_override_resolution_falls_back_to_defaults asserts is identity — NEON resolves exactly as before.
Drift test uses resolved values: the old cmr, temporal, aoi = MANIFEST[...] top-level reads are replaced by resolve_aoi_temporal_cmr(sm_meta). Correct, and it fixed the previously-reversed unpack order as a side effect.
Runner + CI unchanged: confirmed via git diff origin/main...HEAD — only tests/data/benchmark/{README.md,antarctic_88s.geojson}, tests/test_benchmark.py, tests/test_benchmark_shardmap.py changed. run_benchmark.py, lambda-benchmark.yml, and targets.json are untouched.
Coverage: exercises override-used, default-inherited, and partial-override — good.
README ↔ code: the recipe's key names (aoi.file, temporal, cmr) and the catalog CLI flags (--config/--short-name/--version/--start-date/--end-date/--polygon/--backend/--output) all match the resolution code and src/zagg/catalog/__init__.py.
No CLAUDE.md violations: no deps added, no CI/infra touched, no new # noqa (the # noqa: E402 is pre-existing), both files well under the ~1000-line limit.

Findings are nits only (inline). Nothing blocks.

Generated by Claude Code

espg

🤖 from Claude (review)

Inline nits attached (4). All non-blocking — see the earlier top-level review for the full acceptance-criteria pass. Verified locally on Python 3.12: new tests green, test_benchmark.py 23 passed / 3 skipped, drift test collects, ruff clean. Byte-identical NEON fallback, resolved-AOI drift test, and untouched runner/CI/targets.json all confirmed.

Generated by Claude Code

espg · 2026-06-29T11:25:20Z

+        sm_meta.get("aoi", MANIFEST["aoi"]),
+        sm_meta.get("temporal", MANIFEST["temporal"]),
+        sm_meta.get("cmr", MANIFEST["cmr"]),
+    )


🤖 from Claude (review)

Nit (non-blocking): resolve_aoi_temporal_cmr returns the manifest dicts by reference, so a caller that mutated aoi/temporal/cmr would mutate the shared MANIFEST default for every other entry. Current callers (drift test, the three new unit tests) only read, so this is safe today — flagging only so the contract ("returns shared references, treat read-only") is explicit if a future caller mutates the result. No change needed now.

Generated by Claude Code

espg · 2026-06-29T11:25:25Z

+
+   ```json
+   "shardmaps": {
+     "healpix_o11_88s": { "path": "shardmaps/sm_healpix_o11_88s.json",


🤖 from Claude (review)

Nit (non-blocking): the override example shows only an aoi override and says "Omit temporal/cmr to inherit." Worth one clarifying line that a cmr override, when used, needs short_name/version/provider/footprint (the keys the drift test reads) but not backend — the drift test reads backend from the committed shard map's metadata.backend, not from cmr. Without this, a future author may copy the top-level cmr (which does carry backend) and assume cmr.backend is consulted. Matches the test fixture in test_override_resolution_uses_overrides, which correctly omits backend.

Generated by Claude Code

espg · 2026-06-29T11:25:31Z

+    # A shard map with no override inherits the top-level aoi/temporal/cmr
+    # *by identity* -- existing NEON entries resolve byte-identically to today.
+    import test_benchmark_shardmap as drift
+


🤖 from Claude (review)

Nit (non-blocking): the override-resolution unit tests live in test_benchmark.py but import test_benchmark_shardmap as drift to reach the helper. That cross-imports a slow+skipif(ZAGG_BENCHMARK_DRIFT) module from the fast suite; it works because resolve_aoi_temporal_cmr/MANIFEST are module-level (the pytestmark skip only gates the test functions, not import), and it relies on bench_metrics resolving via the sys.path.insert(.github/scripts) that test_benchmark_shardmap performs at import. Confirmed green locally on 3.12. Just noting the coupling — these three tests would arguably read more naturally co-located with the helper in test_benchmark_shardmap.py (where the AOI-resolution logic lives), but that's a placement preference, not a defect.

Generated by Claude Code

espg · 2026-06-29T11:25:36Z

+    parts = load_polygon(str(BENCH / "antarctic_88s.geojson"))
+    minx, miny, maxx, maxy = polygon_to_bbox(parts)
+    assert miny <= -87.5 and maxy <= -87.0  # well into the high-density polar band
+    assert -90.0 <= miny < maxy <= -80.0


🤖 from Claude (review)

Confirmed: fixture loads via the real load_polygon and the bbox assertions hold (miny=-88.0, maxy=-87.7). Honest-flagging is good — the docstring states the AOI ships before its built shard map / live target, and targets.json correctly has no healpix_o11_88s entry yet, so the drift test still parametrizes only the four existing NEON maps. One forward-looking note: when the 88°S shard map is added with its aoi override, test_targets_manifest_consistent will re-derive its densest shard from the committed map (AOI-independent, reads metadata), so the override won't perturb that test — but the drift test will then rebuild over the 88°S box, which (per the README) is expected to be a heavy/failing stress target. No action this PR.

Generated by Claude Code

espg · 2026-06-29T16:11:32Z

There's still some warts in our benchmarking, and I'd like to address them and fix them in this PR before it merges:

The results in https://github.com/englacial/zagg/blob/main/docs/deployment/benchmark.md (which reference https://raw.githubusercontent.com/englacial/zagg/benchmarks/site/cost_per_shard.png and https://raw.githubusercontent.com/englacial/zagg/benchmarks/site/cost_per_100km2.png ), don't actually use the color coding that Capture per-shard max memory and color benchmark markers by % of cap #125 implemented. Looking at the comment posted benchmarks in Capture per-shard max memory and color benchmark markers by % of cap #125 , it seems to be reading the memory, but not plotting it correctly. Change the marker from squares to circles, and make sure that colors plot correctly.
It's hard to read cost_per_shard.png and cost_per_100km2.png because of visual clutter. One place that's easy to cleanup-- the x-axis is sharded by all subplots in both figures. Please make the x-axis shared; i.e., only the bottom row should be labeled with the git commit label, the rows above don't need it repeated.
The color bar is similarly cluttered and hard to read; change it from being vertical to horizontal, and move it so it's placed just below the figure title.

You can test these changes for me to review so that I can confirm they're working as expected -- with out rerunning any benchmarks, since we can just read the parquet data and re-render the plots. Do that, and post an updated figure for me to review. You can do this by (ideally) posting the figure as a comment in this thread-- or as a (worse) fallback, by adding the png renders as git committed files (which we'll have to git rm from this branch later before merging).

github-actions · 2026-06-29T16:48:21Z

Lambda benchmark — `75d28ed`

target	obs	runtime (s)	cost/shard	cost/100 km²	% timeout	mem (MB)	% cap
gain_bias_healpix_o11	0	201.1	$0.00536	$0.05291	22%	n/a	n/a
tdigest_healpix_o11	200,676	192.3	$0.00513	$0.05060	21%	n/a	n/a
gain_bias_healpix_o10	0	280.9	$0.00749	$0.01848	31%	n/a	n/a
tdigest_healpix_o10	939,217	241.8	$0.00645	$0.01591	27%	n/a	n/a
gain_bias_rect_3km	0	0.0	$0.00000	$0.00000	n/a	n/a	n/a
tdigest_rect_3km	207,911	189.7	$0.00506	$0.05620	21%	n/a	n/a
gain_bias_rect_6km	0	241.4	$0.00644	$0.01788	27%	n/a	n/a
tdigest_rect_6km	0	0.0	$0.00000	$0.00000	n/a	n/a	n/a

arm64 · 2.0 GB · $1.33334e-05/GB-s · one shard/target · pre-merge runs are not retained.

… colorbar)

espg · 2026-06-29T17:30:29Z

🤖 from Claude

Folded the three benchmark-plot fixes from the review comment (4834632174) into this PR. All re-rendered from the committed series.parquet on the benchmarks branch — no benchmark was re-run, no network.

What changed in .github/scripts/plot_series.py:

Color-coding (Capture per-shard max memory and color benchmark markers by % of cap #125) now plots correctly + squares → circles. The memory-coloured cost markers were being completely hidden behind the runtime markers (cost and runtime track each other, so the filled orange squares sat right on top of the colour). Runtime markers are now hollow circles so the green→red memory colour shows through, and the cost scatter is enlarged so the signal reads at a glance. You can now see, e.g., gain_bias_rect_3km at cbd4828 sitting near the red end (~93% of cap) and the NaN-memory points rendering uncoloured.
Shared x-axis (sharex=True) — only the bottom-row panels carry the commit labels; the rows above no longer repeat them.
Colorbar is now horizontal, placed just below the figure title instead of the cramped vertical bar on the right.

Preview renders (committed temporarily under docs/deployment/_review_renders/ — to be git rm'd before merge per your note):

Notes:

To pick up Capture per-shard max memory and color benchmark markers by % of cap #125's color foundation (the fixes build on it), I merged main into this branch — conflict-free; PR stays mergeable_state: clean.
Tests: pytest tests/test_benchmark.py -q → 34 passed, including a new test_make_figure_shared_x_and_circle_runtime (asserts round runtime markers + that only the bottom-row panels carry x labels). The existing Capture per-shard max memory and color benchmark markers by % of cap #125 colour-array / colorbar tests still pass. ruff check/ruff format --check clean on touched files.
One residual, non-blocking and not part of the three asks: the right-axis runtime tick labels still abut the next column's left-axis labels in the 2-wide grid. Left as-is for now; happy to widen the inter-column gap in a follow-up if you'd like it tightened too.

The live 88S target is still deferred (needs a CMR fetch this routine can't reach) — see the "Questions for review" note in the PR description.

Generated by Claude Code

espg

🤖 from Claude (review)

Adversarial review of the latest phase (commits 59f35fb benchmark plot fixes + b7c7872 review renders), scoped to plot_series.py + tests/test_benchmark.py.

One real correctness bug in the shared-x labelling, plus a test that can't catch it. Details inline; summarised here:

Per-target x labels are wrong whenever targets don't all share the same commit set (correctness). make_figure now builds one global label list — labels = [str(c)[:7] for c in hist.drop_duplicates("commit").sort_values("timestamp")["commit"]] — and applies that same list to every bottom panel via bottom.set_xticklabels(labels), while each panel's data is plotted at xs = range(len(sub)) (its own per-target merge count) and ticks are fixed at x = range(nx) with nx = max len over targets. The premise in the comment ("same merge points per target") does not hold in general: targets get added/removed over time, and the densest-shard run for a target can be skipped on a given merge, so per-target commit sets diverge. Two concrete failure modes, both verified by running the pushed code:
- Mislabelled ticks. With t1 = merges {c1, c2} and t2 = {c0, c1, c2} (t1 missing the earliest merge), both panels render labelled c0, c1, c2. t1's leftmost point is its c1 but is labelled c0 — every t1 label is shifted by one and wrong. A reader comparing a regression against the wrong commit is the exact failure this chart exists to prevent.
- Silent truncation, not the expected crash. With four targets carrying six distinct commits but only two each (nx = 2, len(labels) = 6), set_xticklabels does not raise (sharex + a later autoscale replaces the FixedLocator, so matplotlib clamps the label list silently); all bottom panels end up showing the last-applied 2 labels. So the length mismatch fails quietly rather than loudly.
The pre-fold code computed labels per-target from sub["commit"] — that was correct. The fix is to keep labels per-target (label each bottom panel from its own sub["commit"]), or to reindex every target onto a shared, ordered union of commits so position 0..k means the same commit in every panel. As written, "share one label list across panels" is only valid when all targets have identical commit sets.
The new test only exercises the uniform-data case, so it can't catch (1) (test strength). test_make_figure_shared_x_and_circle_runtime builds rows = [_rec_row(f"c{i}", t) for i in range(2) for t in ("t1","t2","t3","t4")] — every target has the identical two commits {c0, c1}, which is exactly the one configuration where global labels happen to be correct. Add a case where targets have differing/uneven commit sets (e.g. one target missing an early merge) and assert each bottom panel's tick labels match that panel's commits, in order. That assertion fails on the current code.
The runtime-marker and top-row-empty assertions themselves look sound. markers collects every line's marker across all axes and asserts "o" in markers and "s" not in markers — the cost connecting line uses "-" (marker None), so "o" uniquely identifies the runtime line; robust. The top-vs-bottom split keys on get_subplotspec().rowspan.start and checks get_xticklabels() text is empty on the top row — that is a genuine assertion (it would catch labels leaking onto the top row), not a false pass, given matplotlib clears shared inner-tick labels.
Review-render PNGs under docs/deployment/_review_renders/ (commit b7c7872) are binary artifacts that CLAUDE.md §3 says not to commit. Noting this is the authorized fallback ("remove before merge"); flagging only so it isn't forgotten — git rm them before this PR goes ready/merges.

Everything else in the plot diff (hollow-circle runtime markers, s=60 scatter enlargement, horizontal location="top" colorbar, suptitle moved above the colorbar) matches the stated intent and reads cleanly. ruff passes. No other findings.

Generated by Claude Code

espg

🤖 from Claude (review)

Phase-3 plot-fix review (only 59f35fb + c7d7080; phases 1–2 already reviewed). I reproduced make_figure's logic with matplotlib on a 2×2 grid and confirmed all three asks land:

Color show-through (#125): runtime markers are now hollow (marker="o", markerfacecolor="none") and the cost scatter is enlarged (s=60); the green→red memory color reads through the runtime circle. ✓
Shared x-axis: sharex=True plus the per-column "last populated panel" labeling produces empty top-row tick labels and populated bottom-row labels — verified empirically. ✓
Horizontal colorbar below title: fig.colorbar(..., orientation="horizontal", location="top", ...) returns orientation == "horizontal" with no error and sits above the panels / below suptitle. ✓

No correctness bugs found. The new test is meaningful, not a tautology — "o" in markers and "s" not in markers would fail on the old "s--", and the top-empty / bottom-populated label assertions would fail if labels were applied to all panels or none. Stays well under the ~1000-line limit; no new deps.

Two non-blocking nits inline plus the expected pre-merge cleanup of the authorized docs/deployment/_review_renders/*.png preview images (250 KB each, confirmed committed). None of these block readiness; fold at your discretion.

Generated by Claude Code

espg

🤖 from Claude (review)

Inline nits for the phase-3 plot diff (both non-blocking, fold at your discretion). Two more standing notes not worth an inline anchor:

Colorbar args are redundant (cosmetic): orientation="horizontal" together with location="top" — location="top" already implies a horizontal bar above the axes, so orientation is redundant. Harmless (they agree), but you could drop orientation= for terseness.
Pre-merge cleanup: the authorized preview renders docs/deployment/_review_renders/cost_per_shard.png and cost_per_100km2.png (~250 KB each) are committed on this branch; git rm them before marking ready/merging, as agreed on the thread.

Generated by Claude Code

espg · 2026-06-29T17:37:12Z

+            sub["runtime_s"],
+            linestyle="--",
+            marker="o",
+            markerfacecolor="none",


🤖 from Claude (review)

Nit (not a bug): I traced this per-column "last populated panel" logic for the odd-grid case (e.g. 3 targets → 2×2 with one blank panel). For column 1 the last populated index is 1, which is the top-row panel — and that's correct, since the panel directly beneath it is blank, so it should carry the labels. The new test only exercises the even (2×2, 4-target) grid, so this trickier odd-grid branch is untested. Consider a small parametrization (e.g. 3 targets) to lock in that the bottom-of-column panel gets labels even when it lands on the top row. Non-blocking.

Generated by Claude Code

espg · 2026-06-29T17:37:17Z

+    assert plot_series.make_figure(
+        update_series.load_series(series), "cost_per_shard_usd", "c", out
+    )
+    fig = captured["fig"]


🤖 from Claude (review)

Nit on the fixture, not this assertion: _rec_row derives timestamp as ...0{len(commit) % 10}Z, so the two commits here (c0, c1 — both length 2) get the same timestamp ...02Z. The labels are now per-target from sub["commit"] (good — c7d7080 removed the cross-target timestamp ordering), so the tie is harmless today. But if a future assertion ever depends on merge-point ordering, tied timestamps would make this fixture silently unreliable. Cheap fix: key the timestamp off the loop index i instead of len(commit). Non-blocking.

Generated by Claude Code

espg · 2026-06-29T17:40:39Z

this is getting better, but still needs some tweaking:

The subfigure columns are two close together-- please separate them with some padding so that the y-axis labels don't overlap
Add a dual axis on the colorbar that also shows megabytes of memory
We don't need the full range on the memory colorbar -- clip it to the observed range of memory (i.e., don't start it at zero)

espg · 2026-06-29T17:42:21Z

Also, provide updated png previews for me to review as a comment please

espg · 2026-06-29T17:53:50Z

🤖 from Claude

Folded the round-3 plot tweaks from the two review comments (4835363044, 4835378682) into .github/scripts/plot_series.py, re-rendered both figures from the committed series.parquet (no benchmark re-run, no network), and refreshed the previews under docs/deployment/_review_renders/.

What changed:

(1) Column padding. Added inter-column wspace to the subplot grid so a panel's right-axis (runtime) tick labels no longer overlap the next column's left-axis (cost) labels — the residual I'd flagged last round.

(2) Dual-axis colorbar with MB. The bar still encodes peak memory as % of cap on top; a second axis below it now reads the same scale in absolute MB (secondary_xaxis with MB = fraction × cap_mb, cap read from memory_gb in the parquet = 2 GB). So a point reads, e.g., ~93% ≈ ~1900 MB at a glance.

(3) Clipped to observed range. The memory normalisation no longer anchors at 0; vmin/vmax are now the observed min/max fraction (~52% .. ~93% in this series), so the green→red gradient spans only real data.

While in there I also fixed a latent bug the new test surfaced: the global sharex=True collapses all panels onto one x-axis, so when targets have differing commit sets it stamps one target's commit labels onto every panel. Each panel now keeps its own x-axis and the upper rows' labels are hidden via labelbottom=False — preserving the "only the bottom row is labelled" look without the misalignment.

Test added: test_make_figure_uneven_commit_sets_label_per_panel builds a fixture where one target is missing an early merge (uneven commit sets across targets) and asserts each bottom panel's tick labels match THAT panel's own commits in order — the class of bug a uniform-commit fixture can't catch. pytest tests/test_benchmark.py -q → 35 passed; ruff check/ruff format --check clean on touched files.

Updated previews (committed temporarily under docs/deployment/_review_renders/, still to be git rm'd before merge):

Generated by Claude Code

espg

🤖 from Claude (review)

Round 3 (benchmark-plot tweaks), reviewing commit 7332bf9 against parent c7d7080. Verdict: functionally sound and close to mergeable — the matplotlib changes are correct — but there are a few diff-scoped follow-ups worth a fold before ready-for-review.

What's correct (verified, no action):

Dual-axis colorbar. secondary_xaxis(-1.0, functions=(f*cap_mb, mb/cap_mb)) on a location="top" bar is right: the bar's x data coords are fraction-of-cap, so the forward/inverse correctly align MB with %. The set_major_formatter(lambda v, _pos: ...) signature is valid, and the secondary axis sits below the bar (no overlap with the top ticks/label).
Clipped normalisation. Spanning observed non-None min/max (don't anchor at 0), [0,1] fallback when no memory, and the ±0.01 single-value spread all handle the cases correctly; memory_fractions returns None (filtered) rather than NaN, and the scatter still uses plotnonfinite=True for missing markers. Matches the "don't start at zero; span observed range" ask.
sharex removal + per-column labelling. The last = max(i for i in range(n) if i%ncols==col) + tick_params(labelbottom=False) logic is correct with no off-by-one; default=None is unreachable given ncols=min(2,n). This fixes the real latent shared-x stamping bug.
wspace=0.45 is reasonable for the 2-wide twin-axis collision and composes fine with bbox_inches="tight".
plot_series.py is well under the ~1000-line limit.

Diff-scoped findings (inline):

(1) test_make_figure_shared_x_and_circle_runtime — name and docstring claim "the x-axis is shared," but this same commit removes sharex=True. Rename/reword so the test's stated intent matches the manual-hide implementation it now guards.
(2) test_make_figure_uneven_commit_sets_label_per_panel — the n=2 fixture is a single-row 1x2 grid, so the row-hiding loop (the actual fix, per make_figure's "hide the upper rows" comment) never fires and it's unclear the test would fail on pre-fix code. Bump to 4 uneven targets → 2x2 to truly exercise the stacked-row misalignment. Plus a legibility nit on the len(commit) % 10 ordering trick.
(3) The headline MB twin axis has no test asserting it's populated/correct — a one-line check against cbar.ax.child_axes would lock it. Non-blocking.

Out-of-scope / convention flag (no inline — binary files):

This commit adds two committed PNGs under docs/deployment/_review_renders/ (cost_per_100km2.png, cost_per_shard.png). The _review_renders path reads like throwaway review aids, i.e. generated artifacts — CLAUDE.md §3 says don't commit generated artifacts. If they're intended permanent doc assets that's fine, but please confirm; otherwise drop them (and add the dir to .gitignore). Raising for the author's decision rather than treating it as a required change.

None of these block the chart rendering; (1) and (2) are the substantive folds (test intent vs. implementation). With those tightened — and the artifact question resolved — this round is good to go.

Generated by Claude Code

espg · 2026-06-29T17:56:05Z

+def test_make_figure_shared_x_and_circle_runtime(tmp_path, monkeypatch):
+    # The benchmark-plot polish (PR #123 fold of #125's chart): runtime markers
+    # are circles (not squares) so the memory-coloured cost marker shows through,
+    # and the x-axis is shared so only the bottom-row panels carry commit labels.


🤖 from Claude (review)

Name/comment is now stale relative to this same commit. This round's make_figure change removes sharex=True from plt.subplots — the "bottom row only is labelled" effect is now produced by the manual tick_params(labelbottom=False) pass, not by a shared x-axis. But this test is named ..._shared_x_... and its docstring asserts "the x-axis is shared so only the bottom-row panels carry commit labels." That's the property the code deliberately abandoned (precisely because shared-x stamped one target's labels onto all panels). Suggest renaming to something like test_make_figure_circle_runtime_and_bottom_row_labels and rewording the comment to describe the manual hide, so the test's stated intent matches the implementation it guards.

Generated by Claude Code

espg · 2026-06-29T17:56:18Z

+def test_make_figure_uneven_commit_sets_label_per_panel(tmp_path, monkeypatch):
+    # Regression for the per-target x-label misalignment class (fixed in c7d7080):
+    # when targets share x but have *different* commit sets (one missing an early
+    # merge), each bottom panel must carry ITS OWN commits in order -- not another
+    # panel's labels. A uniform-commit fixture can't catch this; an uneven one can.
+    pytest.importorskip("matplotlib")
+    import matplotlib
+
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    import plot_series
+
+    captured = {}
+    monkeypatch.setattr(plt, "close", lambda fig: captured.setdefault("fig", fig))
+
+    # Two targets in one column-pair. ``_rec_row`` derives the timestamp from the
+    # commit-name length, so name lengths set the merge order. ``early`` has 3
+    # points; ``late`` skipped the first merge and has only 2.
+    early = [_rec_row(c, "t_early") for c in ("a", "bb", "ccc")]
+    late = [_rec_row(c, "t_late") for c in ("dd", "eee")]  # missing the len-1 merge
+    series = tmp_path / "series.parquet"
+    update_series.save_series(update_series.records_to_frame(early + late), series)
+    out = tmp_path / "fig.png"
+    assert plot_series.make_figure(
+        update_series.load_series(series), "cost_per_shard_usd", "c", out
+    )
+    fig = captured["fig"]
+
+    panels = {ax.get_title(): ax for ax in fig.axes if ax.get_title()}
+    # n=2 -> single row, so both panels are "bottom" and carry their own labels.


🤖 from Claude (review)

The regression target here is the per-target x-label misalignment across the grid, and make_figure's own comment frames it as "hide the upper rows' commit labels by hand." But this fixture has only 2 targets → nrows=1, a single-row 1x2 grid, and the test even acknowledges it ("n=2 -> single row, so both panels are 'bottom'"). In a single row there are no stacked rows, so the labelbottom=False hiding loop never fires for any panel — every panel is its column's last. That means this test exercises the differing-commit-sets path but does not exercise the row-hiding logic that is the actual fix, and it's not obvious it would fail on the pre-fix shared-x code (whose cross-panel stamping is a multi-row phenomenon). Suggest bumping to 4 uneven targets (e.g. t_early/t_late/t_a/t_b with differing commit sets) so it lands a 2x2 grid: then assert the top-row panels are blank and the bottom-row panels each carry their own commits in order. That makes it a true regression for the stacked-row misalignment and removes the reliance on the single-row coincidence.

Secondary nit: ordering via len(commit) % 10 (name length → timestamp) is clever but obscure — a reader has to reverse-engineer why "a"/"bb"/"ccc" sort that way. Passing explicit timestamps (or an explicit ts= kwarg to _rec_row) would make the merge order legible at the call site.

Generated by Claude Code

espg · 2026-06-29T17:56:34Z

+    cbar.ax.xaxis.set_major_formatter(lambda v, _pos: f"{v:.0%}")
+    # Twin axis (below the bar) reading the same scale in absolute MB: the bar's
+    # data coords are fraction-of-cap, so MB = fraction * cap_mb (and back).
+    mb_ax = cbar.ax.secondary_xaxis(-1.0, functions=(lambda f: f * cap_mb, lambda mb: mb / cap_mb))


🤖 from Claude (review)

The dual-axis math is correct — the colorbar's x data coords are fraction-of-cap, so the forward/inverse (f*cap_mb, mb/cap_mb) place MB ticks aligned with the % scale, and the (v, _pos) formatter signature is right. No change required there.

One robustness note, not a blocker: there is no test asserting the MB axis is actually populated/correct (e.g. that mb_ax exists and its forward maps vmax -> vmax*cap_mb). Given this is the headline feature of the round and matplotlib's secondary_xaxis tick placement is easy to silently break (empty axis if the parent has no ticks, wrong side, etc.), a one-line assertion in the existing colorbar test — grab the secondary axis from cbar.ax.child_axes and check f(vmax) == pytest.approx(vmax * cap_mb) — would lock the behavior. Worth considering since the visual artifacts (docs/deployment/_review_renders/*.png) are the only current "proof" it renders.

Generated by Claude Code

espg · 2026-06-29T17:59:05Z

closer... we need the circles plotted on top of the line; right now the circle symbol z-order puts them behind the open circle and dotted line glyph (also, bump them up one radius size please).

…ar MB axis

espg · 2026-06-29T18:01:33Z

🤖 from Claude

Round 3 done (column padding, colorbar MB dual-axis, clipped memory range, plus an uneven-commit-set regression test); fresh-context self-review folded in 1aa343c (renamed the now-stale sharex test, bumped the uneven case to a 2x2 grid so the row-hiding path fires, and added a colorbar MB twin-axis assertion). Gates green on touched files (pytest tests/test_benchmark.py -q = 36 passed, ruff clean). Awaiting your review of the new previews above.

Generated by Claude Code

espg · 2026-06-29T18:07:44Z

🤖 from Claude

Folded the round-4 tweaks from your comment (4835498668) into .github/scripts/plot_series.py, re-rendered both figures from the committed series.parquet (no benchmark re-run, no network), and refreshed the previews under docs/deployment/_review_renders/.

(1) Cost circles now plot on top of the runtime line. The runtime line lives on a twinx() axis, which matplotlib draws after (above) the host axis regardless of per-artist zorder — that's why the open-circle/dashed glyph was sitting over the coloured cost circles. Fixed by raising the host (cost) axes above the twin (ax.set_zorder(rt.get_zorder() + 1)) and dropping the host's opaque patch (ax.patch.set_visible(False)) so the runtime line still shows through the gaps. The runtime line's own zorder is set to 1, below the cost scatter.

(2) Marker bumped up one size — s=60 → s=90.

Test added: test_make_figure_cost_scatter_drawn_above_runtime_line asserts the cost panel's axes z-order is above the runtime twin's and that the host patch is hidden — a stable, non-brittle check of the layering. pytest tests/test_benchmark.py -q → 37 passed; ruff check/ruff format --check clean on touched files.

Updated previews (committed temporarily under docs/deployment/_review_renders/, still to be git rm'd before merge):

Generated by Claude Code

espg

🤖 from Claude (review)

Round-4 review of db317b0 (parent 1aa343c), scoped to the cost-circle-over-runtime-line layering fix + marker bump. Verdict: mergeable as-is. Both of @espg's asks land, the matplotlib idiom is the correct/standard one, and the new test is a real guard (fails on pre-fix code). Only non-blocking nits inline.

What's correct (verified):

Layering fix. ax.set_zorder(rt.get_zorder() + 1) + ax.patch.set_visible(False) is exactly the documented fix for the twinx "twin draws on top" gotcha — across axes, only axes-level zorder reorders; per-artist zorder can't. Host→1, twin→0, host patch hidden so it doesn't occlude the runtime line beneath. No white-background loss (figure patch still backs the panel), no gridline double-render (none enabled), no colorbar / bbox_inches="tight" interaction. The (2) cost circles now sit over the runtime open-circle/dashed glyph.
Marker bump. s=60 → s=90 is a sensible one-step bump; with bbox_inches="tight" there's no edge-clipping concern.
Test. test_make_figure_cost_scatter_drawn_above_runtime_line asserts host-zorder > twin-zorder and host patch hidden — both fail on the default-zorder pre-fix code, and the bbox-equality twin match is robust (twins share host position; colorbar bounds differ).
plot_series.py ~210 lines, tests/test_benchmark.py ~640 — both well under the ~1000-line limit. ruff-clean style, matches surrounding code.

Non-blocking nits (inline):

(1) The artist-level zorder=1 on the runtime rt.plot(...) is inert for the cost-vs-runtime layering — that's decided by the axes-level zorder set below. Harmless but misleading-as-written; lean toward dropping it for terseness, or comment it.
(2) The test asserts the mechanism (zorder + hidden patch) rather than rendered pixels — correct call (pixels would be brittle); optionally also assert the scatter's s == 90 to lock the marker-bump ask too.

Nothing here blocks ready-for-review.

Generated by Claude Code

espg · 2026-06-29T18:09:42Z

+        )
        rt.set_ylabel("runtime (s)", color="C1")
        rt.tick_params(axis="y", labelcolor="C1")
+        ax.set_zorder(rt.get_zorder() + 1)


🤖 from Claude (review)

Layering fix is correct. ax.set_zorder(rt.get_zorder() + 1) + ax.patch.set_visible(False) is exactly the documented matplotlib idiom for the twinx "twin paints on top" gotcha — the twin is added to the figure after the host, so it draws above regardless of per-artist zorder, and the only lever that reorders across axes is the axes-level zorder. Raising the host above the twin then drops the host's opaque patch so the (now-on-top) host axes doesn't occlude the runtime line beneath it. Default axes zorder is 0, so this sets host=1, twin=0 — works.

No white-background regression: only the axes patch is hidden, not the figure patch, so savefig's default white figure facecolor still backs the panel. No gridlines are enabled (ax.grid() is never called), so nothing double-renders, and there's no interaction with the colorbar (separate axes) or bbox_inches="tight". Verified the (2) cost circles now sit over the runtime glyph.

Generated by Claude Code

espg · 2026-06-29T18:09:55Z

+            marker="o",
+            markerfacecolor="none",
+            color="C1",
+            zorder=1,


🤖 from Claude (review)

Nit (non-blocking): this artist-level zorder=1 on the runtime line does nothing for the cost-vs-runtime layering. Per-artist zorder only orders artists within the same axes; the runtime line lives on rt while the cost scatter lives on ax, so what decides which is on top is the axes-level zorder you set two lines below — not this. It's harmless (it agrees with the cost line's zorder=1 on the host, and doesn't fight the axes-level fix), but as written it reads like it's contributing to the fix when it isn't.

Two options, either fine: (1) drop it (the cost scatter's zorder=2 over the cost line's zorder=1 on the host axis is what the run actually relies on, and that's unaffected); or (2) keep it but add a one-line comment that it only orders the runtime line vs. anything else on the twin, so a future reader doesn't mistake it for the cross-axes lever. Leaning toward (1) for terseness (CLAUDE.md §4).

Generated by Claude Code

espg · 2026-06-29T18:10:08Z

+
+def test_make_figure_colorbar_has_mb_twin_axis(tmp_path, monkeypatch):
+    # The colorbar carries a second axis reading the same scale in absolute MB
+    # (issue #120 polish): MB = fraction-of-cap * cap_mb. cap_mb = 2 GB here, so a


🤖 from Claude (review)

The new test is meaningful and would fail on pre-fix code, so it genuinely guards the fix:

panel.get_zorder() > twin.get_zorder(): pre-fix both axes default to zorder 0, so 0 > 0 is False → fails. Post-fix host=1, twin=0 → passes. Real assertion.

panel.patch.get_visible() is False: axes patch defaults to visible → pre-fix this is True → fails. Real assertion.

Matching the twin by ax.bbox.bounds == panel.bbox.bounds is robust here: twinx() shares the host's position so their bounds are equal, while the colorbar axes has different bounds; with a single panel the comprehension yields exactly one match (the (twin,) unpack would loudly fail if not).

One framing nit (non-blocking): panel.get_zorder() > twin.get_zorder() plus patch.get_visible() is False is asserting the mechanism (the documented idiom), not directly the rendered outcome (circles visually above the line). That's acceptable and the right call — a pixel assertion on the rendered PNG would be brittle (and over-engineered), and these two properties are precisely the two-part idiom that produces the outcome. If you want one extra cheap belt-and-suspenders signal without touching pixels, you could also assert the cost PathCollection's s array is 90 here so the "bump the marker one size" ask (2) is locked too — but that's optional, fold at your discretion.

Generated by Claude Code

espg · 2026-06-29T18:11:57Z

🤖 from Claude

Round 4 done: cost circles now render on top of the runtime line (host-axes z-order raised above the twinx() runtime axis + host patch hidden) and the marker is bumped one size (s=60 → s=90); refreshed previews are in the comment above. Fresh-context self-review (verdict: mergeable as-is) folded in f1ec81b — dropped an inert artist-level zorder on the runtime line and added a s == 90 assertion to the new layering test. Gates green on touched files (pytest tests/test_benchmark.py -q = 37 passed, ruff clean). Awaiting your review of the new previews.

Generated by Claude Code

espg · 2026-06-29T20:10:56Z

A few more small changes:

The rows and columns should be meaningful. The left column should be all rect plots, the right column should be all healpix plots. Similar resolutions at the same agg function (i.e., tdigest) should be on the same row. The rows should also be sorted so that the largest shards are at the top of the plots, and then decend going down the rows.
Everything above should be added to benchmarks docs (tests/data/benchmark/README.md), so that future plots follow the same conventions as they are added or rearranged.
Can we drop the zeros? Zero cost / time is indicative of a failed run, and shouldn't be plotted... we should only connect line segments that are non-zero. If we need a marker to make the x-axis work, we could add an 'X' marker or similar that isn't line connected.

…nventions

espg · 2026-06-29T20:24:31Z

🤖 from Claude

Folded the round-5 changes from your comment (4836608338) into .github/scripts/plot_series.py + tests/data/benchmark/README.md, re-rendered both figures from the committed series.parquet (no benchmark re-run, no network), and refreshed the previews under docs/deployment/_review_renders/.

(1) Meaningful row/column layout. New _panel_layout places panels by a data-driven rule (nothing hardcoded): left column = rect grids, right column = HEALPix; a row pairs the two families at the same aggregator and the same shard-size rank within their family, so rect_6km lines up with healpix_o10 and rect_3km with healpix_o11, with tdigest/gain_bias rows aligned. Rows are ordered largest-shard-first (size ranked from shard_area_km2), descending down the grid. The resulting target→(row,col) for today's eight:

row 0 (largest):  gain_bias_rect_6km  | gain_bias_healpix_o10
row 1:            tdigest_rect_6km    | tdigest_healpix_o10
row 2:            gain_bias_rect_3km  | gain_bias_healpix_o11
row 3 (smallest): tdigest_rect_3km    | tdigest_healpix_o11

(2) Documented the conventions in tests/data/benchmark/README.md (new "Plot layout" section): column = grid family, row = agg + resolution-rank aligned, rows largest-shard-first, and the zeros-are-failures rule from (3) — so future targets follow the same layout as they're added/rearranged.

(3) Dropped failed-run zeros. A zero cost/runtime is a crashed shard, not a measurement: the connecting line now breaks at that merge (NaN — it never dips to 0), for both the cost line+scatter and the runtime line. To keep the x-axis/commit alignment, a failed run is flagged with a non-line-connected grey x marker (distinct from the cost circle / runtime open-circle). You can see this on the cbd4828 column for the four targets whose latest run failed.

Tests: extended the layout/label tests to the new placement and added test_panel_layout_rect_left_healpix_right_largest_on_top (asserts the exact row/col grid) and test_make_figure_drops_failed_run_zeros (the line is NaN at the zero, never 0, and an x marker is drawn). pytest tests/test_benchmark.py -q → 39 passed; ruff check/ruff format --check clean on touched files.

Updated previews (committed temporarily under docs/deployment/_review_renders/, still to be git rm'd before merge):

One thing I had to decide (correct me if you'd prefer otherwise): for the vertical ordering of the four rows I made shard size the primary sort and aggregator secondary, so the grid is strictly largest→smallest top-to-bottom with the two aggregators of a given size class adjacent (gain_bias above tdigest). The alternative reading — group by aggregator first, size second — would interleave differently; happy to flip if you intended agg-major.

Generated by Claude Code

espg

🤖 from Claude (review)

Round-5 review (commit 97dbb87, parent f1ec81b), scoped to the meaningful-layout / drop-zeros work in plot_series.py + tests/data/benchmark/README.md requested in comment 4836608338. Verdict: all three asks land; mergeable as-is, with a couple of judgment-call polish items worth a quick fold.

What's correct (verified against the full file on the branch):

(1) Layout. _panel_layout is data-driven, not hardcoded: col = 0 if grid.startswith("rect") else 1 (rect left / healpix right), shard-size rank computed per family so the two families align row-for-row, rows ordered largest-shard-first via sorted({(rank, agg)}). The row order is size-major, agg-minor (the ambiguity the author flagged) — a defensible reading of "largest shards at top, descending," and code/test/README agree. test_panel_layout_... pins the exact 4x2 grid and would fail on a column swap or row mis-order.
(2) README. New "Plot layout" section documents columns=grid-family, rows=agg+resolution-rank aligned + largest-first with the agg tie-break, and zeros-are-failures — matches the implemented rule exactly, tie-break included.
(3) Zeros. Both cost and runtime lines NaN out zeros (v != 0), so neither line dips to/connects 0; cost failures get a non-connected grey x. test_make_figure_drops_failed_run_zeros proves the line is NaN at the zero, no connected 0 survives, and an x marker exists — fails on pre-fix code.

Diff-scoped judgment calls (fold at discretion, none block):

Cost fail x at data-y=0 re-expands the axis floor back to 0 — the line no longer dips, but the marker pins ymin≈0, so the panel still reads against a 0 baseline. Consider drawing it in ax.get_xaxis_transform() axes-fraction y, or document the 0-anchor.
No x failure marker on the runtime axis — partial against "apply to both," though the cost x already flags that merge for both series. Either document that or mirror the marker.
Latent collision: two same-aggregator targets in one column with identical shard_area_km2 but different resolutions map to the same (rank, agg) cell and one is silently overwritten. Can't fire on today's data; rank by (area, grid_size) to harden the data-driven claim.

Nits: fail-marker test matches by len(get_offsets())==1 (fragile only if a panel ever has exactly 1 commit); NaN cost is a silent line-break with no x (intended distinction, just confirming); no test for the all-rect ncols=1 path; README could note the x is cost-axis-only.

Inline details on each. I only review — folding is the author's call.

Generated by Claude Code

espg · 2026-06-29T20:27:31Z

+            if d["col"] == col:
+                d["rank"] = rank[d["area"]]
+    # One row per (size-rank, aggregator); largest shard at the top.
+    row_keys = sorted({(d["rank"], d["agg"]) for d in rows})


🤖 from Claude (review)

Row ordering is size-major, aggregator-minor: sorted({(rank, agg)}) sorts on rank first (0 = largest, ascending = descending size, good) then breaks ties alphabetically on the aggregator. So with two sizes x two aggs the rows are gain_bias@large, tdigest@large, gain_bias@small, tdigest@small — exactly what the test pins.

This is a defensible reading of @espg's "largest shards at top, descending" — it keeps the primary sort on size and only uses agg as a tie-break, which is what "rows sorted largest-shard-first" most naturally means. The alternative (agg-major: all gain_bias rows, then all tdigest) would not be largest-first across the whole grid, so I think the chosen reading is correct. Just confirming this matches the author's flagged ambiguity and the README ("same-size rows break ties on the aggregator name") — code, test, and README are mutually consistent. No change needed; flagging so @espg can veto if agg-major was actually wanted. [diff-scoped, judgment call]

Generated by Claude Code

espg · 2026-06-29T20:27:44Z

+        areas = sorted({d["area"] for d in rows if d["col"] == col}, reverse=True)
+        rank = {a: i for i, a in enumerate(areas)}
+        for d in rows:
+            if d["col"] == col:
+                d["rank"] = rank[d["area"]]
+    # One row per (size-rank, aggregator); largest shard at the top.
+    row_keys = sorted({(d["rank"], d["agg"]) for d in rows})
+    row_of = {k: i for i, k in enumerate(row_keys)}
+    nrows = len(row_keys)


🤖 from Claude (review)

Edge case: two targets in the same column+aggregator with identical shard_area_km2 but different resolutions collide into one cell. rank is keyed on the distinct area value, so two different-resolution rect targets that happen to share the exact same float area get the same rank; with the same aggregator they map to the same (rank, agg) row key, and grid[row][col] = target silently overwrites — one target vanishes from the figure with no error.

In today's data the four rect/healpix sizes are distinct, so this can't fire now, and the author flagged area-ties to @espg. But the layout claims to be data-driven for new targets, and the natural collision key is resolution, not area. Cheap hardening: rank by (area, grid_size) (or fall back to grid label) so equal-area-different-resolution targets still get distinct ranks. At minimum worth a one-line comment that equal areas within a family must not occur. [diff-scoped, low severity — latent]

Generated by Claude Code

espg · 2026-06-29T20:27:57Z

+                plotnonfinite=True,
+            )
+            if fail:  # failed-run markers, not joined to the line
+                ax.scatter(fail, [0] * len(fail), marker="x", color="0.5", s=60, zorder=3)


🤖 from Claude (review)

The fail 'x' at y=0 re-expands the cost axis back down to 0, partially undermining the "line never dips to 0" goal. Once zeros are NaN'd out of line, the left y-axis would autoscale tightly to the real cost range (e.g. 0.005–0.006). But this scatter(fail, [0]*n, ...) adds a real data point at y=0, so matplotlib's autoscale pulls ylim back down to ~0 — the real points then cluster at the very top and the 'x' sits alone at the bottom, against a baseline that now reaches 0. The line doesn't dip to 0 (good, the ask is met literally), but visually the axis floor is still 0, which is the impression the ask was trying to avoid.

Two options if you want the cost panel to read cleanly: (a) draw the 'x' in axes-fraction coords pinned near the bottom (transform=ax.get_xaxis_transform(), y≈0.02) so it flags the failed x without forcing data-space ymin=0; or (b) accept it as-is and note in the README that failures anchor the axis at 0. Either is fine — flagging because "break the line at zeros" was specifically to stop the 0-dip, and pinning the marker at data-y=0 reintroduces a 0 floor. [diff-scoped, judgment call]

Generated by Claude Code

espg · 2026-06-29T20:28:05Z

+            # ``ax``'s opaque patch -- otherwise the runtime glyph sits on top of
+            # the cost circles. Zero runtimes break the line the same way.
+            runtime = sub["runtime_s"].to_numpy(dtype=float)
+            rt_line = [v if v != 0 else float("nan") for v in runtime]


🤖 from Claude (review)

Runtime zeros break the line (good) but there is no failure 'x' marker on the runtime/right axis — only the cost axis gets one. @espg's ask was "apply to both cost and runtime series." This is arguably acceptable: a failed shard zeros both cost and runtime at the same x, so the cost 'x' already flags that merge as a failure and a second 'x' would be redundant clutter. But it's a literal partial-coverage against "both series," so calling it out as a judgment call: either (a) leave as-is and document in the README that the single cost 'x' flags the failure for both series, or (b) add a matching runtime 'x' on the twin axis for symmetry. I lean (a). [diff-scoped, judgment call]

Generated by Claude Code

espg · 2026-06-29T20:28:17Z

+            # non-connected 'X' so the x-axis/commit alignment is kept and the
+            # failure reads as a failure, not a real 0.
+            cost = sub[cost_col].to_numpy(dtype=float)
+            line = [v if v != 0 else float("nan") for v in cost]


🤖 from Claude (review)

v != 0 is the right "failed run" test for this data (costs/runtimes are pinned-shard measurements, never legitimately exactly 0.0; a near-zero real value like 1e-6 is non-zero and correctly kept). Two small notes, neither blocking:

NaN cost (missing metric, not a failed run) passes v != 0 (NaN != 0 → True), so it stays in line as a NaN break without a fail 'x' — i.e. silently a gap. That's a reasonable distinction (missing-metric ≠ crashed-run), just confirming it's intended rather than overlooked.

A failed run that emitted a tiny non-zero cost (partial write before crash) wouldn't be caught by == 0. Out of scope for this PR — the series records 0 for failures by convention — but worth knowing the marker keys strictly on exact 0. [diff-scoped, nit]

Generated by Claude Code

espg · 2026-06-29T20:28:27Z

+    assert 0.0 not in [y for y in ys if not np.isnan(y)]
+    # A distinct, non-line 'x' marker flags the failed run.
+    fails = [
+        c for c in panel.collections if isinstance(c, PathCollection) and len(c.get_offsets()) == 1


🤖 from Claude (review)

Matching the fail marker by len(c.get_offsets()) == 1 is robust for this fixture (3 commits → cost scatter has 3 offsets, fail scatter has 1, runtime is a Line2D not a PathCollection) and the test does prove the three things that matter: the cost line is NaN at the zero index, no connected 0 survives, and a single-offset 'x' collection exists. It would fail on pre-fix code (which plotted a connected 0). Good test.

One fragility note: the == 1 heuristic would mis-identify if a panel ever had exactly one commit total — then the cost scatter also has 1 offset and the assertion could match the wrong collection (or a non-failure panel could spuriously "have a fail marker"). Cheap tightening: assert on the marker itself, e.g. select the collection whose paths are the 'x' marker, or assert its single offset's x equals the failed index (1) and its y equals 0. Not blocking — the fixture guarantees ≥2 commits. [diff-scoped, nit]

Generated by Claude Code

espg · 2026-06-29T20:28:38Z

+# --- panel layout + failed-run handling (issue #121 review) ---------------
+
+
+def test_panel_layout_rect_left_healpix_right_largest_on_top():


🤖 from Claude (review)

test_panel_layout_rect_left_healpix_right_largest_on_top is a genuine, non-tautological pin of the rule: it asserts the exact 4x2 grid with rect in col 0 / healpix in col 1, largest shards on top, and agg as the row tie-break. It would fail if columns were swapped, if rows were mis-ordered, or if the size-rank alignment broke — i.e. it locks all of @espg's layout asks. Good.

Two gaps the suite doesn't cover (optional, not blocking): (a) the all-rect → ncols=1 path (does make_figure render a 1-col figure without the right-column logic tripping?), and (b) a column with only one family present. Both are reachable configurations the layout claims to handle data-drivenly; a small assertion on _panel_layout for an all-rect input would lock the ncols branch. [diff-scoped, nit]

Generated by Claude Code

espg · 2026-06-29T20:28:47Z

+
+When adding or rearranging targets, keep these conventions — the layout is derived
+from `grid_type` / `aggregator` / `shard_area_km2`, so getting that metadata right
+in `targets.json` is what places the panel.


🤖 from Claude (review)

README ↔ code match is accurate on all three asks: columns = grid family (rect left / healpix right), rows = aggregator + shard-size-rank aligned across families with "same-size rows break ties on the aggregator name" (matches sorted({(rank, agg)})), largest-first descending, and the zeros-are-failures rule (line breaks, non-connected x marker). Good — the documented convention is exactly the implemented rule, including the tie-break.

One doc accuracy nit tied to the runtime-marker finding on plot_series.py: the README says the failure "is shown as a non-line-connected x marker" without noting it's drawn only on the cost axis, not runtime. If you keep the cost-only marker, add a half-sentence that the single cost x flags the failure for both series (so the doc isn't read as promising an x on the runtime axis too). [diff-scoped, nit]

Generated by Claude Code

… layout tests

espg · 2026-06-29T20:33:29Z

🤖 from Claude

Round 5 done (meaningful data-driven panel layout: rect left / healpix right, rows agg+resolution-aligned largest-first; documented the conventions in tests/data/benchmark/README.md; dropped failed-run zeros — line breaks at zeros + non-connected x marker). Fresh-context self-review folded in 43b8a01: pinned the failure x in axes-fraction coords (with an explicit x-limit) so a failed run no longer drags the cost axis floor back to 0 — the cost axes now autoscale tight to the real range; hardened _panel_layout to rank on (area, grid_size) so equal-area/different-resolution targets can't collide into one cell; added all-rect-single-column and same-area-distinct-resolution layout tests; tightened the failure-marker assertion; and noted the cost-only x flags both series in the README. Gates green on touched files (pytest tests/test_benchmark.py -q = 41 passed, ruff clean). The ?round=5 previews above reflect the folded render. One judgment call for you: rows are size-major / aggregator-minor (largest-shard-first, gain_bias above tdigest within a size class) — say the word if you wanted aggregator-major instead. Awaiting your review.

Generated by Claude Code

claude added 2 commits June 29, 2026 11:16

phase 1 of issue #121

26347c7

phase 2 of issue #121

935953b

espg added the implement label Jun 29, 2026

espg commented Jun 29, 2026

View reviewed changes

fold review: clarify cmr override excludes backend

75d28ed

espg added the waiting label Jun 29, 2026

espg mentioned this pull request Jun 29, 2026

Claude Summary Tracking (June 2026) #50

Open

espg removed the waiting label Jun 29, 2026

This was referenced Jun 29, 2026

Order-10 single-shard processing OOMs at 2 GB (gain_bias) — 2 GB is marginal for order-10 #117

Open

On-demand benchmark gate matches /benchmark as a substring → URLs/paths trigger billed runs #126

Closed

claude added 3 commits June 29, 2026 17:25

merge main (#125 memory color-coding) into per-target-aoi

2630d23

fold review: benchmark plot fixes (circles, shared x-axis, horizontal…

59f35fb

… colorbar)

review renders: benchmark plot fixes preview (remove before merge)

b7c7872

fold review: per-target x labels so ticks stay length-matched

c7d7080

espg added the waiting label Jun 29, 2026

espg commented Jun 29, 2026

View reviewed changes

espg removed the waiting label Jun 29, 2026

fold review: colorbar MB dual-axis + clipped range + column padding

7332bf9

espg commented Jun 29, 2026

View reviewed changes

fold review: rename stale test, 2x2 uneven-commit case, assert colorb…

1aa343c

…ar MB axis

espg added the waiting label Jun 29, 2026

fold review: cost circles above runtime line + larger marker

db317b0

espg commented Jun 29, 2026

View reviewed changes

fold review: drop inert runtime zorder, assert marker size

f1ec81b

espg removed the waiting label Jun 29, 2026

fold review: meaningful panel layout + drop failed-run zeros + doc co…

97dbb87

…nventions

espg commented Jun 29, 2026

View reviewed changes

fold review: pin failure marker off the axis floor + rank tie-break +…

43b8a01

… layout tests

espg added the waiting label Jun 29, 2026

espg marked this pull request as ready for review June 29, 2026 20:38

espg merged commit 0fcac90 into main Jun 29, 2026
11 checks passed

		# --- panel layout + failed-run handling (issue #121 review) ---------------


		def test_panel_layout_rect_left_healpix_right_largest_on_top():

Uh oh!

Conversation

espg commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Approach

Phases

How tested

Questions for review

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Lambda benchmark — 75d28ed

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg commented Jun 29, 2026

Uh oh!

espg commented Jun 29, 2026

espg commented Jun 29, 2026 •

edited

Loading

Lambda benchmark — `75d28ed`

espg commented Jun 29, 2026 •

edited

Loading