[Bee] add content to word_cells when "PyPdfiumDocumentBackend" is used in PDF pipeline

### Requested feature
In order to have better PDF processing performance, **PyPdfiumDocumentBackend** is used to do the parsing, which is much faster than **doclingparseV4backend**. However, PyPdfiumDocumentBackend explicitly sets word_cells = [] and has_words = False, only textline_cells includes the content, which is line-based. I checked with Duso, and it said it is by design. But WDU team needs the single-world-based token generation in the final output. So this feature is requested.
...

### Alternatives
pdfium2 provides the character text and bbox, could docling generate word_cells based on it?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bee] add content to word_cells when "PyPdfiumDocumentBackend" is used in PDF pipeline #3370

Requested feature

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bee] add content to word_cells when "PyPdfiumDocumentBackend" is used in PDF pipeline #3370

Description

Requested feature

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions