Requested feature
In order to have better PDF processing performance, PyPdfiumDocumentBackend is used to do the parsing, which is much faster than doclingparseV4backend. However, PyPdfiumDocumentBackend explicitly sets word_cells = [] and has_words = False, only textline_cells includes the content, which is line-based. I checked with Duso, and it said it is by design. But WDU team needs the single-world-based token generation in the final output. So this feature is requested.
...
Alternatives
pdfium2 provides the character text and bbox, could docling generate word_cells based on it?
Requested feature
In order to have better PDF processing performance, PyPdfiumDocumentBackend is used to do the parsing, which is much faster than doclingparseV4backend. However, PyPdfiumDocumentBackend explicitly sets word_cells = [] and has_words = False, only textline_cells includes the content, which is line-based. I checked with Duso, and it said it is by design. But WDU team needs the single-world-based token generation in the final output. So this feature is requested.
...
Alternatives
pdfium2 provides the character text and bbox, could docling generate word_cells based on it?