Skip to content

[Bee] add content to word_cells when "PyPdfiumDocumentBackend" is used in PDF pipeline #3370

@yqliving

Description

@yqliving

Requested feature

In order to have better PDF processing performance, PyPdfiumDocumentBackend is used to do the parsing, which is much faster than doclingparseV4backend. However, PyPdfiumDocumentBackend explicitly sets word_cells = [] and has_words = False, only textline_cells includes the content, which is line-based. I checked with Duso, and it said it is by design. But WDU team needs the single-world-based token generation in the final output. So this feature is requested.
...

Alternatives

pdfium2 provides the character text and bbox, could docling generate word_cells based on it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions