Problem
The PDF extraction pipeline's table detector (tableDetector.ts) stops each table region at page boundaries. If a table spans multiple pages but only has a header row on the first page, the continuation rows on subsequent pages are not detected as part of the table. They get classified as prose or are lost entirely.
Root Cause
findRegionEnd returns when lines[i].page !== headerPage (line 42). This means each table region is confined to a single page. On page 2, there's no header candidate to start a new table region, so the continuation data rows are never captured.
Affected Statements
Any multi-page bank or credit card statement where the transaction table overflows to the next page without repeating the column header row. Some banks repeat headers on each page (works fine), others don't (this bug).
Possible Approaches
- Detect continuation rows on subsequent pages that match the previous page's table schema (same column count, similar x-positions, date/amount content patterns)
- Allow
findRegionEnd to cross page boundaries when the next page starts with rows matching the current table's column geometry
- Add a "headerless region" concept that inherits schema from the previous page's table
Workaround
N/A — the data is lost during extraction. User would need to manually verify the total transaction count matches expectations.
Problem
The PDF extraction pipeline's table detector (
tableDetector.ts) stops each table region at page boundaries. If a table spans multiple pages but only has a header row on the first page, the continuation rows on subsequent pages are not detected as part of the table. They get classified as prose or are lost entirely.Root Cause
findRegionEndreturns whenlines[i].page !== headerPage(line 42). This means each table region is confined to a single page. On page 2, there's no header candidate to start a new table region, so the continuation data rows are never captured.Affected Statements
Any multi-page bank or credit card statement where the transaction table overflows to the next page without repeating the column header row. Some banks repeat headers on each page (works fine), others don't (this bug).
Possible Approaches
findRegionEndto cross page boundaries when the next page starts with rows matching the current table's column geometryWorkaround
N/A — the data is lost during extraction. User would need to manually verify the total transaction count matches expectations.