Skip to content

Phase 1: notes column failure modes (crossed_out over-emitted, continuation never emitted, double_height missed) #61

@jakebromberg

Description

@jakebromberg

Problem

Gemini's Entry.notes field on Phase 1 output has three distinct failure modes, measured against 20 hand-verified pages from 1990-04apr0106 (1074 emitted rows, ~50 per page). All three are structural: the model is misreading or not surfacing visual cues for row-spanning and crossing-out.

Note value Gemini emits Alex keeps Alex removes / reclassifies Headline
crossed_out 27 6 16 removed; 3 → double_height, 2 → continuation 78% wrong
continuation 0 — (Alex sets 29 from scratch: 19 None→cont, 8 double_height→cont, 2 crossed_out→cont) 100% missed
double_height 102 88 6 removed; 8 → continuation, 2 → other, 1 → illegible; 36 false negatives (Alex sets None→double_height) FNR 36/126 = 29%
illegible 0 — (Alex sets 4) mostly missed

These are distinct from the raw_text content errors (#58 covers the related type_raw blank-rate failure) — the row-level text can be correct while the notes field is wrong.

Why this matters

  • double_height false negatives shift bbox geometry. The verifier's per-row crop cropper uses span counts derived from notes. A handwritten 2-row entry that Gemini emits as notes=None produces two crops that each contain half the handwriting. Alex sees a visually broken row strip and has to set double_height to fix it. The page-34 bbox bug (PR fix(verifier): share inter-block boundary line between top and bottom partitions #60) is unrelated but the cropper here is downstream of the same span concept.
  • continuation is never emitted. The Phase 1 prompt mentions it (see core/prompts.py) and the read-time merge in core/continuations.py is wired to handle it, but the model doesn't surface it. Alex has to set continuation manually for every multi-line song entry, and read-time merging never fires in the corpus.
  • crossed_out over-emission wastes Alex's time. Three of every four crossed-out tags are removed.

Desired end state

Per-row notes matches the visual cues on the page at parity with the metadata fields (which are already at 0 corrections / 19 pages). Concretely:

  • continuation: emitted on multi-line handwritten entries. Specifically the case where one song's title or notes wrap onto the printed line below.
  • crossed_out: emitted only when a strike-through actually crosses the song name (not when a margin doodle bleeds into the row).
  • double_height: emitted on all entries whose handwriting visually spans two printed-grid rows.

Where

  • core/prompts.pyPAGE_EXTRACTION_PROMPT and quadrant-prompt counterparts. The current language around notes likely needs (a) explicit definition of each value with a worked example, (b) a stronger directive to emit continuation on multi-line entries.
  • data/verifier-pulled-refresh/*.corrections.json — empirical dataset of (Gemini-emitted, Alex-corrected) pairs. Use for evaluation; don't burn API tokens iterating without measuring.

Constraints

  • Don't restructure the notes enum. The four current values (continuation, double_height, crossed_out, illegible) plus other cover the observed cases. The fix is reading them correctly, not extending the vocabulary.
  • The read-time core/continuations.py merge must keep working — its on-disk format unchanged, so a prompt fix that makes the model surface continuation should slot in transparently.
  • Don't break the strong-performing fields (page_date_raw, hour_raw, jock_raw, the structural quadrant detection). Prompt edits scoped to the notes-column instruction block, not a rewrite.

Acceptance criteria

  • On a held-out subset of the verified corpus, re-extraction shows the crossed_out precision ≥ 60% (vs. current ~22%), continuation recall ≥ 60% on Alex-marked rows, and double_height FNR ≤ 15% (vs. current 29%).
  • No regression on raw_text substring match rate or on page/quadrant metadata edit counts.
  • Existing core.continuations.merge_continuations tests still pass; if the prompt now produces continuation tags, add an end-to-end test that the read-time merge folds them as expected.

Notes for implementer

  • Worked examples carry more weight than enumerations here, especially for continuation (which is a visual concept that's hard to define in prose: "the line below has handwriting that completes the song name above").
  • The 19-pages-1-PDF corpus is the same era and same DJ rotation; whatever prompt changes you make should be evaluated against held-out pages from different PDFs / decades before scaling. The existing external_api golden run is the natural gate.
  • The notes-column prompt section may already share content with type_raw (Phase 1: type_raw blanked on 44% of rows despite prompt requesting it #58); be careful not to combine the two prompt edits in one PR — they have independent acceptance criteria.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions