So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares. I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.
PDFBox Version: 2.0.7
PDF Version: 1.6(Acrobat 7.x)
Font Details:
I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).
Note: I have drawn the black bar as it contains the address of someone.
Output of text extracted using PDFBox:
पता: कालकाजी, दि ण िद ी, िद ी - 110019
As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.
I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?
For instance, extracting the data using PDFBox, not as a text but an image?
EDIT:: Also getting the following warnings.
03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+26 (26) in font Lohit-Devanagari