4

While extracting text from some PDFs PDFBox returns gibberish. This is because of a missing or corrupt Unicode mapping. I can see following warnings on the console. I want to be able to detect this to be able to flag these PDFs as corrupt.

I'm looking for a solution that is better than parsing logs.

Thanks for your help!

Sample Console Logs:

WARNING: No Unicode mapping for CID+32 (32) in font F6
WARNING: Failed to find a character mapping for 32 in TimesNewRoman,Bold

Below mentioned post also talks about the same issue but doesn't talk about ways to be able to detect this on code side and handle the same: Issue with reading some unicode characters out of a PDF using PDFBox

Richard Neish
  • 7,053
  • 4
  • 32
  • 65
Magpies3
  • 43
  • 5
  • 3
    There are a couple different ways to use PDFBox, could you share exactly how you're running it? See [mcve] for help including a good amount of code so that people wanting to answer your question can recreate your issue and better help you. – Davy M Mar 17 '19 at 15:40
  • As @DavyM mentions, please provide some code to show how you use PDFBox and whenever possible, which part of the code fails – Al-un Mar 17 '19 at 16:01
  • try (final PDDocument document = PDDocument.load(new File(sourceFile))){ PDFTextStripper str = new PDFTextStripper(); str.getText(document);//This line throws the warnings .... } catch (IOException e) { System.out.println(e); } – Magpies3 Mar 17 '19 at 16:10
  • Below mentioned post also talks about the same issue but doesn't talk about ways to be able to detect this on code side and handle the same. https://stackoverflow.com/questions/39324398/issue-with-reading-some-unicode-characters-out-of-a-pdf-using-pdfbox – Magpies3 Mar 17 '19 at 16:21
  • 2
    @Magpies3 Information that improves your question, like that code, as well as that arditional research should be edited into your question, not placed as comments. That way, all the information to answer your question is located in the question itself. – Davy M Mar 17 '19 at 16:28
  • I have updated the description with the details shared above. Thanks – Magpies3 Mar 17 '19 at 23:09
  • You didn't link to the PDF. I think a solution would be to use the `PrintTextLocations.java` example from the source code download and check whether `text.getUnicode()` is null or empty. – Tilman Hausherr Mar 18 '19 at 08:53
  • Because of the reason stated here: [link](https://stackoverflow.com/questions/37862159/pdf-reading-via-pdfbox-in-java) processTextPosition is not called for text with missing unicode. I dont have permissions to share the file I am working with. Please have a look at the file from above mentioned URL, this also exhibit the same issue. [Sample File](https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view) – Magpies3 Mar 18 '19 at 13:59
  • 1
    Sorry. Try overriding `showGlyph()`. – Tilman Hausherr Mar 18 '19 at 15:00
  • 2
    Thanks Tilman this did the trick, i was able to override this method and get font and code information for which unicodes were missing. – Magpies3 Mar 19 '19 at 14:10

2 Answers2

5

A fourth possibility (next to the three given in Aaron Digulla answer) is to override showGlyph() when extending the PDFTextStripper class:

protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
{
    super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
    if (unicode == null || unicode.isEmpty())
    {
        // do stuff
    }
}
Lonzak
  • 7,737
  • 5
  • 43
  • 72
Tilman Hausherr
  • 14,950
  • 6
  • 51
  • 80
2

I see these solutions, both are a bit messy.

Solution #1: Install your own filter to the logger. The filter can check for the log message and set a thread local flag. Check the flag after calling getText(). Don't forget to remove the flag or your thread local map will fill up.

You can replace commons logging with something else, like logback, that supports MDC. You could then put the flag in the MDC.

Solution #2: Patch the sources of PDFbox. In the classes PDSimpleFont and PDType0Font add a getter:

public boolean hadEncodingProblems() {
    return !noUnicode.isEmpty();
}

There should be a way to get all fonts after calling getText().

Solution #3: Use reflection to read the field value (kudos to mkl). Note that this can break with new Java versions or when a SecurityManager is installer or the default one is activated.

Aaron Digulla
  • 297,790
  • 101
  • 558
  • 777
  • Yeah, I was also thinking on similar lines but wanted to check if somebody had a suggestion for a cleaner solution. Thanks for your inputs will wait to see if any one else has a suggestion before I mark this closed. – Magpies3 Mar 18 '19 at 14:14
  • An alternative might be to apply reflection to access those `noUnicode` variables. – mkl Mar 18 '19 at 15:34