Skip to content

Multi-page tables without repeated headers lose continuation rows #2

@AJ

Description

@AJ

Problem

The PDF extraction pipeline's table detector (tableDetector.ts) stops each table region at page boundaries. If a table spans multiple pages but only has a header row on the first page, the continuation rows on subsequent pages are not detected as part of the table. They get classified as prose or are lost entirely.

Root Cause

findRegionEnd returns when lines[i].page !== headerPage (line 42). This means each table region is confined to a single page. On page 2, there's no header candidate to start a new table region, so the continuation data rows are never captured.

Affected Statements

Any multi-page bank or credit card statement where the transaction table overflows to the next page without repeating the column header row. Some banks repeat headers on each page (works fine), others don't (this bug).

Possible Approaches

  • Detect continuation rows on subsequent pages that match the previous page's table schema (same column count, similar x-positions, date/amount content patterns)
  • Allow findRegionEnd to cross page boundaries when the next page starts with rows matching the current table's column geometry
  • Add a "headerless region" concept that inherits schema from the previous page's table

Workaround

N/A — the data is lost during extraction. User would need to manually verify the total transaction count matches expectations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions