Skip to content

fix(baker): clamp bbox to image space and squeeze overflowing rows#76

Open
jakebromberg wants to merge 3 commits into
mainfrom
fix/baker-bbox-overflow-and-clamp
Open

fix(baker): clamp bbox to image space and squeeze overflowing rows#76
jakebromberg wants to merge 3 commits into
mainfrom
fix/baker-bbox-overflow-and-clamp

Conversation

@jakebromberg

@jakebromberg jakebromberg commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

  • _maybe_prepend_missing_first_line now clamps the synthetic first-line y to max(0, lines[0] - median_gap), eliminating negative-y bboxes that cause the verifier UI's ctx.drawImage to clip outside the page.
  • _assign_row_bboxes now proportionally squeezes row_height when sum(spans) * row_height > available, eliminating zero-height bboxes when the model over-emits rows for the printed grid.
  • _merge_with_spans drops rows with empty raw_text and null notes. They render as empty input fields with no visual cue in the verifier UI, and scripts.derive_truth skips them anyway (its docstring is explicit), so they only break row-count parity downstream.
  • _merge_with_spans preserves the predecessor's row_index through continuation absorption and blank-row drops — the resulting sequence is sparse, matching the contract in core/continuations.py. verifier/app.js's add-row handler now mints new indices off max(existing) + 1 instead of entries.length, so a sparse sequence no longer collides.
  • New --pdf-path / --page-number CLI flags on scripts/make_verifier_bundle.py so one-shot re-bake scripts can produce correct (stem, pdf_path, page_number) job keys without staging under data/results/<rel>/page-NN.json.

Context

PR #65's review surfaced two baker geometry bugs latent on main:

  • .seed/verifier/1990-08aug0106-page24.bundle.json had row_bbox = [1263, -237, 2550, 152] in top_right row 0 — negative y1 from the unclamped _maybe_prepend_missing_first_line.
  • .seed/verifier/1990-04apr1318-page11.bundle.json had row 7 collapsed to a zero-height bbox at y=4074 once the natural-shape pipeline correctly tagged row 3 as double_height and shifted the overflow boundary.

Both render as blank crops in the verifier UI, silently hiding the underlying row from review. Closes #74.

An earlier revision of this PR resolved the sparse row_index problem by reindexing 0..n-1 inside the baker; a code review caught that this would silently break applyVerifiedToBundle's overlay join for any pre-existing .verified.json saved against a non-reindexed bundle. The fix was moved to the UI's add-row handler so the baker can keep emitting sparse indices and existing overlays still align.

Test plan

  • .venv/bin/pytest tests/unit/test_make_verifier_bundle.py — 48 tests green
  • .venv/bin/pytest — 582 passed, 8 skipped (default markers)
  • .venv/bin/ruff check . — clean
  • .venv/bin/ruff format --check . — clean
  • .venv/bin/mypy core cli.py — clean
  • After merge: re-bake the 64 bundles on PR chore(seed): re-bake 64 untouched bundles with natural-shape pipeline #65's branch with this baker and verify zero negative-y1 and zero overflow-collapse bboxes

Five baker fixes surfaced by PR #65's bundle output.

scripts/make_verifier_bundle.py:

- _maybe_prepend_missing_first_line: clamp the inferred first-line y to max(0, lines[0] - median_gap). Without the clamp, the verifier UI's ctx.drawImage gets a negative sy and renders whitespace instead of the actual row. Observed on .seed/verifier/1990-08aug0106-page24.bundle.json (top_right row 0 y1 = -237).

- _assign_row_bboxes: when sum(spans) * row_height exceeds the available body height, shrink row_height proportionally so every entry gets a non-zero bbox. The prior fallback let y_cursor reach y2 and collapsed every subsequent row to a zero-height strip, silently hiding entries from review. Observed on .seed/verifier/1990-04apr1318-page11.bundle.json (one model-emitted row collapsed at y=4074).

- _merge_with_spans: drop rows with empty raw_text and null notes. They render as empty input fields with no visual cue in the verifier UI, and scripts.derive_truth silently drops them downstream, breaking row-count parity. Illegible-tagged blank rows are preserved.

- _merge_with_spans: reindex row_index 0..n-1 after the continuation merge. The verifier UI's add-row handler computes the new row's row_index as entries.length, which collides with a preserved predecessor's row_index once the sequence becomes sparse.

- main: accept --pdf-path and --page-number for one-shot re-bake scripts that produce bundle outputs outside data/results/<rel>/page-NN.json. The flags must be passed together; without them, the existing result-path inference still applies.

tests/unit/test_make_verifier_bundle.py:

- New tests for each fix above. The fallback-overflow test was renamed from _clamps_to_quadrant_bottom to _squeezes_when_lines_undercover; the old name codified the broken zero-height behavior.
The earlier commit reindexed `row_index` 0..n-1 inside `_merge_with_spans` to avoid the verifier UI's add-row collision (the handler used `entries.length`, which collides with a sparse predecessor index). That fix had a silent backward-compat hazard: `verifier/app.js`'s `applyVerifiedToBundle` joins a `.verified.json` overlay onto its bundle by `row_index`. Any pre-existing `.verified.json` saved against a non-reindexed bundle would, on the next re-bake, have every row downstream of an absorbed continuation silently lose its bbox (the join would miss).

This commit moves the fix from the baker to the UI:

scripts/make_verifier_bundle.py:
- _merge_with_spans no longer reindexes. The predecessor's row_index is preserved through continuation absorption and blank-row drops, leaving the sequence sparse. This matches the documented contract in core/continuations.py.

verifier/app.js:
- The add-row handler now mints `row_index = max(existing) + 1` (via reduce with -1 sentinel) instead of `entries.length`. Sparse sequences no longer cause collisions.

tests/unit/test_make_verifier_bundle.py:
- The reindex test is replaced with one asserting the predecessor's row_index is preserved through absorption.
- The drop-blank-rows test now asserts the surviving row_index is sparse (0, 2 not 0, 1).
`test_assign_row_bboxes_clamps_y_cursor_at_y1_in_fallback` asserted that every fallback-path row bbox has `y1 >= quad_bbox.y1`. The implementation doesn't make that guarantee — `_maybe_prepend_missing_first_line` clamps to `max(0, ...)` in image space, not to the quadrant top, and the function's own comment explicitly allows row 0 to dip slightly above y1 ("handwriting that bled into the header band — keep that"). The test passed only because its chosen input (lines=[100, 300, 500], quad y1=200) is dropped to [300, 500] by `_filter_misattributed_leading_lines` and never triggers the prepend at all.

The negative-y image-space clamp is already covered directly by `test_maybe_prepend_missing_first_line_never_returns_negative_y`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Baker geometry: negative-y1 bboxes + zero-height overflow rows

1 participant