Issue with reading some unicode characters out of a PDF using PDFBox

Question

I am new to PDFBOX. I am reading a PDF file which is in Hindi.
I am having trouble reading some unicode characters out of a PDF using PDFBox.
I want to copy the string into java objects so that I can work on that.

There are couple of things I tried for reading the files.
1. I tried to use PDFTextStripper to read text from document but it prints garbage value and warning about missing unicode mappings.

    PDDocument document = PDDocument.load(pathToFile);
    PDFTextStripper s = new PDFTextStripper();
    System.out.println(s.getText(document));//prints garbage values
    System.out.println(document.getNumberOfPages());//right output
    PDPageTree pages = document.getPages();
    System.out.println(pages.get(0).getResources().getFontNames()); //prints [COSName{TT1}, COSName{TT3}, COSName{TT8}]

I tried to simply extract the contents of the file and wrie it back to other file. To my suprise it does read some characters(eg text which is selected in image) but I am not able to read values whch are written in bold.
```
private static void extractTextUse(String pdfFile) throws IOException
{
    ExtractText.main(new String[]{pdfFile, "E:\\try-1.txt"}); 
}
```

I basically want to copy the string into java objects.

Below is the warning I am getting while reading the PDF file on both instances

Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+231 (231) in font JCBMGH+Mangal
Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+232 (232) in font JCBLPH+Mangal,Bold
Sep 05, 2016 10:00:38 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+227 (227) in font JCBLPH+Mangal,Bold

Because the unicode mapping is not available, see https://stackoverflow.com/questions/37862159/pdf-reading-via-pdfbox-in-java and https://stackoverflow.com/questions/35972788/how-to-read-control-characters-in-a-pdf-using-java . Contact the creator of the file. — Tilman Hausherr, Sep 05 '16 at 08:08
Those lists (voter lists?) are well known for having incomplete or even false internal font information and, therefore, for leading text extractors astray, see also http://stackoverflow.com/a/15566820/1729265 and http://stackoverflow.com/a/30804279/1729265 — mkl, Sep 05 '16 at 10:38
@mkl thanks I get the issue. Also saw some other PDFBOX jira tickets which explained the issue more. — Viraj Nalawade, Sep 06 '16 at 09:07
@please You're welcome! Please delete your question, to avoid orphans. You'll probably have to use OCR. A few people have done this here on stackoverflow, i.e. 1) render with PDFBox (at least 300dpi), and then OCR with tesseract. — Tilman Hausherr, Sep 06 '16 at 09:14
@TilmanHausherr i will soon delete this question.. Thank you so such... Can you please guide me to how I can start using OCR for extracting text from this PDF?? I am already going through this http://www.mythoughtspot.com/2014/10/23/use-tesseract-ocr-with-pdf-file/comment-page-1/ hope I am moving to right direction.. — Viraj Nalawade, Sep 08 '16 at 18:31
@VirajNalawade convert: https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images I can't help with tesseract because I haven't used it myself. Search for it here :-) — Tilman Hausherr, Sep 08 '16 at 18:42

Issue with reading some unicode characters out of a PDF using PDFBox

0 Answers0

Linked