Phase 1: notes column failure modes (crossed_out over-emitted, continuation never emitted, double_height missed)

## Problem

Gemini's `Entry.notes` field on Phase 1 output has three distinct failure modes, measured against 20 hand-verified pages from `1990-04apr0106` (1074 emitted rows, ~50 per page). All three are structural: the model is misreading or not surfacing visual cues for row-spanning and crossing-out.

| Note value | Gemini emits | Alex keeps | Alex removes / reclassifies | Headline |
|---|---|---|---|---|
| `crossed_out` | 27 | 6 | 16 removed; 3 → `double_height`, 2 → `continuation` | **78% wrong** |
| `continuation` | 0 | — | — (Alex sets 29 from scratch: 19 None→cont, 8 double_height→cont, 2 crossed_out→cont) | **100% missed** |
| `double_height` | 102 | 88 | 6 removed; 8 → `continuation`, 2 → other, 1 → `illegible`; **36 false negatives** (Alex sets None→double_height) | FNR 36/126 = 29% |
| `illegible` | 0 | — | — (Alex sets 4) | mostly missed |

These are distinct from the `raw_text` content errors (#58 covers the related `type_raw` blank-rate failure) — the row-level text can be correct while the `notes` field is wrong.

## Why this matters

- **`double_height` false negatives shift bbox geometry.** The verifier's per-row crop cropper uses span counts derived from `notes`. A handwritten 2-row entry that Gemini emits as `notes=None` produces two crops that each contain half the handwriting. Alex sees a visually broken row strip and has to set `double_height` to fix it. The page-34 bbox bug (PR #60) is unrelated but the cropper here is downstream of the same span concept.
- **`continuation` is never emitted.** The Phase 1 prompt mentions it (see `core/prompts.py`) and the read-time merge in `core/continuations.py` is wired to handle it, but the model doesn't surface it. Alex has to set continuation manually for every multi-line song entry, and read-time merging never fires in the corpus.
- **`crossed_out` over-emission wastes Alex's time.** Three of every four crossed-out tags are removed.

## Desired end state

Per-row `notes` matches the visual cues on the page at parity with the metadata fields (which are already at 0 corrections / 19 pages). Concretely:

- `continuation`: emitted on multi-line handwritten entries. Specifically the case where one song's title or notes wrap onto the printed line below.
- `crossed_out`: emitted only when a strike-through actually crosses the song name (not when a margin doodle bleeds into the row).
- `double_height`: emitted on all entries whose handwriting visually spans two printed-grid rows.

## Where

- `core/prompts.py` — `PAGE_EXTRACTION_PROMPT` and quadrant-prompt counterparts. The current language around `notes` likely needs (a) explicit definition of each value with a worked example, (b) a stronger directive to emit `continuation` on multi-line entries.
- `data/verifier-pulled-refresh/*.corrections.json` — empirical dataset of (Gemini-emitted, Alex-corrected) pairs. Use for evaluation; don't burn API tokens iterating without measuring.

## Constraints

- Don't restructure the `notes` enum. The four current values (`continuation`, `double_height`, `crossed_out`, `illegible`) plus `other` cover the observed cases. The fix is reading them correctly, not extending the vocabulary.
- The read-time `core/continuations.py` merge must keep working — its on-disk format unchanged, so a prompt fix that makes the model surface continuation should slot in transparently.
- Don't break the strong-performing fields (`page_date_raw`, `hour_raw`, `jock_raw`, the structural quadrant detection). Prompt edits scoped to the notes-column instruction block, not a rewrite.

## Acceptance criteria

- [ ] On a held-out subset of the verified corpus, re-extraction shows the `crossed_out` precision ≥ 60% (vs. current ~22%), `continuation` recall ≥ 60% on Alex-marked rows, and `double_height` FNR ≤ 15% (vs. current 29%).
- [ ] No regression on `raw_text` substring match rate or on page/quadrant metadata edit counts.
- [ ] Existing `core.continuations.merge_continuations` tests still pass; if the prompt now produces continuation tags, add an end-to-end test that the read-time merge folds them as expected.

## Notes for implementer

- Worked examples carry more weight than enumerations here, especially for `continuation` (which is a visual concept that's hard to define in prose: "the line below has handwriting that completes the song name above").
- The 19-pages-1-PDF corpus is the same era and same DJ rotation; whatever prompt changes you make should be evaluated against held-out pages from different PDFs / decades before scaling. The existing `external_api` golden run is the natural gate.
- The `notes`-column prompt section may already share content with `type_raw` (#58); be careful not to combine the two prompt edits in one PR — they have independent acceptance criteria.

## Related

- #58 — `type_raw` blanked on 44% of rows. Same Phase-1-prompt-quality cluster but different field.
- #44, #47 — closed; the over-emit hypothesis the row-count check targeted doesn't appear in the corpus.
- Empirical source: `data/verifier-pulled-refresh/*.corrections.json` (20 pages, ~1074 rows).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: notes column failure modes (crossed_out over-emitted, continuation never emitted, double_height missed) #61

Problem

Why this matters

Desired end state

Where

Constraints

Acceptance criteria

Notes for implementer

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Note value	Gemini emits	Alex keeps	Alex removes / reclassifies	Headline
`crossed_out`	27	6	16 removed; 3 → `double_height`, 2 → `continuation`	78% wrong
`continuation`	0	—	— (Alex sets 29 from scratch: 19 None→cont, 8 double_height→cont, 2 crossed_out→cont)	100% missed
`double_height`	102	88	6 removed; 8 → `continuation`, 2 → other, 1 → `illegible`; 36 false negatives (Alex sets None→double_height)	FNR 36/126 = 29%
`illegible`	0	—	— (Alex sets 4)	mostly missed

Phase 1: notes column failure modes (crossed_out over-emitted, continuation never emitted, double_height missed) #61

Description

Problem

Why this matters

Desired end state

Where

Constraints

Acceptance criteria

Notes for implementer

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions