1

So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares. I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.

PDFBox Version: 2.0.7

PDF Version: 1.6(Acrobat 7.x)

Security Details(PDF): enter image description here

Font Details:

enter image description here

I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).

PDF File Snippet

Note: I have drawn the black bar as it contains the address of someone.

Output of text extracted using PDFBox:

पता: कालकाजी, दि ण िद ी, िद ी - 110019

As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.

I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?

For instance, extracting the data using PDFBox, not as a text but an image?

EDIT:: Also getting the following warnings.

03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+26 (26) in font Lohit-Devanagari

Abhinav P
  • 501
  • 2
  • 11
  • Please edit your question with the responses to: 1) What version are you using 2) What do you get with Adobe Reader? – Tilman Hausherr Sep 20 '17 at 19:26
  • @TilmanHausherr Done! – Abhinav P Sep 20 '17 at 21:01
  • Sorry I see that (2) was already in the initial text. Sadly, if Adobe Reader can't do it, then PDFBox can't do it either. Extracting as an image, just try the ExtractImages command line utility. Or convert the PDF to image. But then you'll still have to do OCR... – Tilman Hausherr Sep 20 '17 at 21:08
  • No, the adobe reader is showing the correct text. But when I am extracting the text from the PDF using PDFBox then the text is messed up. So is there any way in PDFBox to extract text fields in PDF as images? That could do the trick for me. – Abhinav P Sep 20 '17 at 21:11
  • You wrote "...are replaced by circles. The same happens when I manually copy from PDF to a word document" so I thought that Adobe Reader can't do text extraction either. Converting a whole page to image see https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images . There's no API to extract text fields as images, you have to do that as image post processing. – Tilman Hausherr Sep 20 '17 at 21:17
  • Sorry for the misunderstanding. Thank you! – Abhinav P Sep 20 '17 at 21:21
  • @AbhinavP are you able to get hindi text from any library. – Brajendra Pandey Sep 26 '17 at 06:16
  • @BrajendraPandey Nope tried PDFBox and OCR in Java. None of it worked for me. – Abhinav P Sep 27 '17 at 14:00

0 Answers0