fix(baker): clamp bbox to image space and squeeze overflowing rows#76
Open
jakebromberg wants to merge 3 commits into
Open
fix(baker): clamp bbox to image space and squeeze overflowing rows#76jakebromberg wants to merge 3 commits into
jakebromberg wants to merge 3 commits into
Conversation
Five baker fixes surfaced by PR #65's bundle output. scripts/make_verifier_bundle.py: - _maybe_prepend_missing_first_line: clamp the inferred first-line y to max(0, lines[0] - median_gap). Without the clamp, the verifier UI's ctx.drawImage gets a negative sy and renders whitespace instead of the actual row. Observed on .seed/verifier/1990-08aug0106-page24.bundle.json (top_right row 0 y1 = -237). - _assign_row_bboxes: when sum(spans) * row_height exceeds the available body height, shrink row_height proportionally so every entry gets a non-zero bbox. The prior fallback let y_cursor reach y2 and collapsed every subsequent row to a zero-height strip, silently hiding entries from review. Observed on .seed/verifier/1990-04apr1318-page11.bundle.json (one model-emitted row collapsed at y=4074). - _merge_with_spans: drop rows with empty raw_text and null notes. They render as empty input fields with no visual cue in the verifier UI, and scripts.derive_truth silently drops them downstream, breaking row-count parity. Illegible-tagged blank rows are preserved. - _merge_with_spans: reindex row_index 0..n-1 after the continuation merge. The verifier UI's add-row handler computes the new row's row_index as entries.length, which collides with a preserved predecessor's row_index once the sequence becomes sparse. - main: accept --pdf-path and --page-number for one-shot re-bake scripts that produce bundle outputs outside data/results/<rel>/page-NN.json. The flags must be passed together; without them, the existing result-path inference still applies. tests/unit/test_make_verifier_bundle.py: - New tests for each fix above. The fallback-overflow test was renamed from _clamps_to_quadrant_bottom to _squeezes_when_lines_undercover; the old name codified the broken zero-height behavior.
The earlier commit reindexed `row_index` 0..n-1 inside `_merge_with_spans` to avoid the verifier UI's add-row collision (the handler used `entries.length`, which collides with a sparse predecessor index). That fix had a silent backward-compat hazard: `verifier/app.js`'s `applyVerifiedToBundle` joins a `.verified.json` overlay onto its bundle by `row_index`. Any pre-existing `.verified.json` saved against a non-reindexed bundle would, on the next re-bake, have every row downstream of an absorbed continuation silently lose its bbox (the join would miss). This commit moves the fix from the baker to the UI: scripts/make_verifier_bundle.py: - _merge_with_spans no longer reindexes. The predecessor's row_index is preserved through continuation absorption and blank-row drops, leaving the sequence sparse. This matches the documented contract in core/continuations.py. verifier/app.js: - The add-row handler now mints `row_index = max(existing) + 1` (via reduce with -1 sentinel) instead of `entries.length`. Sparse sequences no longer cause collisions. tests/unit/test_make_verifier_bundle.py: - The reindex test is replaced with one asserting the predecessor's row_index is preserved through absorption. - The drop-blank-rows test now asserts the surviving row_index is sparse (0, 2 not 0, 1).
`test_assign_row_bboxes_clamps_y_cursor_at_y1_in_fallback` asserted that every fallback-path row bbox has `y1 >= quad_bbox.y1`. The implementation doesn't make that guarantee — `_maybe_prepend_missing_first_line` clamps to `max(0, ...)` in image space, not to the quadrant top, and the function's own comment explicitly allows row 0 to dip slightly above y1 ("handwriting that bled into the header band — keep that"). The test passed only because its chosen input (lines=[100, 300, 500], quad y1=200) is dropped to [300, 500] by `_filter_misattributed_leading_lines` and never triggers the prepend at all.
The negative-y image-space clamp is already covered directly by `test_maybe_prepend_missing_first_line_never_returns_negative_y`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_maybe_prepend_missing_first_linenow clamps the synthetic first-line y tomax(0, lines[0] - median_gap), eliminating negative-y bboxes that cause the verifier UI'sctx.drawImageto clip outside the page._assign_row_bboxesnow proportionally squeezesrow_heightwhensum(spans) * row_height > available, eliminating zero-height bboxes when the model over-emits rows for the printed grid._merge_with_spansdrops rows with emptyraw_textand nullnotes. They render as empty input fields with no visual cue in the verifier UI, andscripts.derive_truthskips them anyway (its docstring is explicit), so they only break row-count parity downstream._merge_with_spanspreserves the predecessor'srow_indexthrough continuation absorption and blank-row drops — the resulting sequence is sparse, matching the contract incore/continuations.py.verifier/app.js's add-row handler now mints new indices offmax(existing) + 1instead ofentries.length, so a sparse sequence no longer collides.--pdf-path/--page-numberCLI flags onscripts/make_verifier_bundle.pyso one-shot re-bake scripts can produce correct(stem, pdf_path, page_number)job keys without staging underdata/results/<rel>/page-NN.json.Context
PR #65's review surfaced two baker geometry bugs latent on
main:.seed/verifier/1990-08aug0106-page24.bundle.jsonhadrow_bbox = [1263, -237, 2550, 152]intop_rightrow 0 — negative y1 from the unclamped_maybe_prepend_missing_first_line..seed/verifier/1990-04apr1318-page11.bundle.jsonhad row 7 collapsed to a zero-height bbox at y=4074 once the natural-shape pipeline correctly tagged row 3 asdouble_heightand shifted the overflow boundary.Both render as blank crops in the verifier UI, silently hiding the underlying row from review. Closes #74.
An earlier revision of this PR resolved the sparse
row_indexproblem by reindexing 0..n-1 inside the baker; a code review caught that this would silently breakapplyVerifiedToBundle's overlay join for any pre-existing.verified.jsonsaved against a non-reindexed bundle. The fix was moved to the UI's add-row handler so the baker can keep emitting sparse indices and existing overlays still align.Test plan
.venv/bin/pytest tests/unit/test_make_verifier_bundle.py— 48 tests green.venv/bin/pytest— 582 passed, 8 skipped (default markers).venv/bin/ruff check .— clean.venv/bin/ruff format --check .— clean.venv/bin/mypy core cli.py— clean