Skip to content

Convert-PDFToText - possible bug #51

Description

@PrzemyslawKlys

Reported on linkedin to be verified

It seems that the function Convert-PDFToText is working a bit incorrect - I have to test further, but for the moment (in my environment) it works like this:

Assuming that PDF has multiple pages with PageText1, PageText2,.. PageTextN, after running the function I get the result where text from every next page has all the text from previous pages, smthng like "PageText1PageText1PageText2PageText1PageText2PageText3" for pdf of 3 pages.

It seems that (in my environment) I could fix it by explicitly declaring new TextExtractionStrategy for every call of GetTextFromPage

so, line 1754

[iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor]::GetTextFromPage($ExtractedPage, $iTextExtractionStrategy) converted to [iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor]::GetTextFromPage($ExtractedPage, [iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy]::new())

after this fix extraction worked as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions