You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gemini's Entry.notes field on Phase 1 output has three distinct failure modes, measured against 20 hand-verified pages from 1990-04apr0106 (1074 emitted rows, ~50 per page). All three are structural: the model is misreading or not surfacing visual cues for row-spanning and crossing-out.
These are distinct from the raw_text content errors (#58 covers the related type_raw blank-rate failure) — the row-level text can be correct while the notes field is wrong.
Why this matters
double_height false negatives shift bbox geometry. The verifier's per-row crop cropper uses span counts derived from notes. A handwritten 2-row entry that Gemini emits as notes=None produces two crops that each contain half the handwriting. Alex sees a visually broken row strip and has to set double_height to fix it. The page-34 bbox bug (PR fix(verifier): share inter-block boundary line between top and bottom partitions #60) is unrelated but the cropper here is downstream of the same span concept.
continuation is never emitted. The Phase 1 prompt mentions it (see core/prompts.py) and the read-time merge in core/continuations.py is wired to handle it, but the model doesn't surface it. Alex has to set continuation manually for every multi-line song entry, and read-time merging never fires in the corpus.
crossed_out over-emission wastes Alex's time. Three of every four crossed-out tags are removed.
Desired end state
Per-row notes matches the visual cues on the page at parity with the metadata fields (which are already at 0 corrections / 19 pages). Concretely:
continuation: emitted on multi-line handwritten entries. Specifically the case where one song's title or notes wrap onto the printed line below.
crossed_out: emitted only when a strike-through actually crosses the song name (not when a margin doodle bleeds into the row).
double_height: emitted on all entries whose handwriting visually spans two printed-grid rows.
Where
core/prompts.py — PAGE_EXTRACTION_PROMPT and quadrant-prompt counterparts. The current language around notes likely needs (a) explicit definition of each value with a worked example, (b) a stronger directive to emit continuation on multi-line entries.
data/verifier-pulled-refresh/*.corrections.json — empirical dataset of (Gemini-emitted, Alex-corrected) pairs. Use for evaluation; don't burn API tokens iterating without measuring.
Constraints
Don't restructure the notes enum. The four current values (continuation, double_height, crossed_out, illegible) plus other cover the observed cases. The fix is reading them correctly, not extending the vocabulary.
The read-time core/continuations.py merge must keep working — its on-disk format unchanged, so a prompt fix that makes the model surface continuation should slot in transparently.
Don't break the strong-performing fields (page_date_raw, hour_raw, jock_raw, the structural quadrant detection). Prompt edits scoped to the notes-column instruction block, not a rewrite.
Acceptance criteria
On a held-out subset of the verified corpus, re-extraction shows the crossed_out precision ≥ 60% (vs. current ~22%), continuation recall ≥ 60% on Alex-marked rows, and double_height FNR ≤ 15% (vs. current 29%).
No regression on raw_text substring match rate or on page/quadrant metadata edit counts.
Existing core.continuations.merge_continuations tests still pass; if the prompt now produces continuation tags, add an end-to-end test that the read-time merge folds them as expected.
Notes for implementer
Worked examples carry more weight than enumerations here, especially for continuation (which is a visual concept that's hard to define in prose: "the line below has handwriting that completes the song name above").
The 19-pages-1-PDF corpus is the same era and same DJ rotation; whatever prompt changes you make should be evaluated against held-out pages from different PDFs / decades before scaling. The existing external_api golden run is the natural gate.
Problem
Gemini's
Entry.notesfield on Phase 1 output has three distinct failure modes, measured against 20 hand-verified pages from1990-04apr0106(1074 emitted rows, ~50 per page). All three are structural: the model is misreading or not surfacing visual cues for row-spanning and crossing-out.crossed_outdouble_height, 2 →continuationcontinuationdouble_heightcontinuation, 2 → other, 1 →illegible; 36 false negatives (Alex sets None→double_height)illegibleThese are distinct from the
raw_textcontent errors (#58 covers the relatedtype_rawblank-rate failure) — the row-level text can be correct while thenotesfield is wrong.Why this matters
double_heightfalse negatives shift bbox geometry. The verifier's per-row crop cropper uses span counts derived fromnotes. A handwritten 2-row entry that Gemini emits asnotes=Noneproduces two crops that each contain half the handwriting. Alex sees a visually broken row strip and has to setdouble_heightto fix it. The page-34 bbox bug (PR fix(verifier): share inter-block boundary line between top and bottom partitions #60) is unrelated but the cropper here is downstream of the same span concept.continuationis never emitted. The Phase 1 prompt mentions it (seecore/prompts.py) and the read-time merge incore/continuations.pyis wired to handle it, but the model doesn't surface it. Alex has to set continuation manually for every multi-line song entry, and read-time merging never fires in the corpus.crossed_outover-emission wastes Alex's time. Three of every four crossed-out tags are removed.Desired end state
Per-row
notesmatches the visual cues on the page at parity with the metadata fields (which are already at 0 corrections / 19 pages). Concretely:continuation: emitted on multi-line handwritten entries. Specifically the case where one song's title or notes wrap onto the printed line below.crossed_out: emitted only when a strike-through actually crosses the song name (not when a margin doodle bleeds into the row).double_height: emitted on all entries whose handwriting visually spans two printed-grid rows.Where
core/prompts.py—PAGE_EXTRACTION_PROMPTand quadrant-prompt counterparts. The current language aroundnoteslikely needs (a) explicit definition of each value with a worked example, (b) a stronger directive to emitcontinuationon multi-line entries.data/verifier-pulled-refresh/*.corrections.json— empirical dataset of (Gemini-emitted, Alex-corrected) pairs. Use for evaluation; don't burn API tokens iterating without measuring.Constraints
notesenum. The four current values (continuation,double_height,crossed_out,illegible) plusothercover the observed cases. The fix is reading them correctly, not extending the vocabulary.core/continuations.pymerge must keep working — its on-disk format unchanged, so a prompt fix that makes the model surface continuation should slot in transparently.page_date_raw,hour_raw,jock_raw, the structural quadrant detection). Prompt edits scoped to the notes-column instruction block, not a rewrite.Acceptance criteria
crossed_outprecision ≥ 60% (vs. current ~22%),continuationrecall ≥ 60% on Alex-marked rows, anddouble_heightFNR ≤ 15% (vs. current 29%).raw_textsubstring match rate or on page/quadrant metadata edit counts.core.continuations.merge_continuationstests still pass; if the prompt now produces continuation tags, add an end-to-end test that the read-time merge folds them as expected.Notes for implementer
continuation(which is a visual concept that's hard to define in prose: "the line below has handwriting that completes the song name above").external_apigolden run is the natural gate.notes-column prompt section may already share content withtype_raw(Phase 1: type_raw blanked on 44% of rows despite prompt requesting it #58); be careful not to combine the two prompt edits in one PR — they have independent acceptance criteria.Related
type_rawblanked on 44% of rows. Same Phase-1-prompt-quality cluster but different field.data/verifier-pulled-refresh/*.corrections.json(20 pages, ~1074 rows).