I am new to PDFBOX. I am reading a PDF file which is in Hindi.
I am having trouble reading some unicode characters out of a PDF using PDFBox.
I want to copy the string into java objects so that I can work on that.
There are couple of things I tried for reading the files.
1. I tried to use PDFTextStripper
to read text from document but it prints garbage value and warning about missing unicode mappings.
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
System.out.println(s.getText(document));//prints garbage values
System.out.println(document.getNumberOfPages());//right output
PDPageTree pages = document.getPages();
System.out.println(pages.get(0).getResources().getFontNames()); //prints [COSName{TT1}, COSName{TT3}, COSName{TT8}]
I tried to simply extract the contents of the file and wrie it back to other file. To my suprise it does read some characters(eg text which is selected in image) but I am not able to read values whch are written in bold.
private static void extractTextUse(String pdfFile) throws IOException { ExtractText.main(new String[]{pdfFile, "E:\\try-1.txt"}); }
I basically want to copy the string into java objects.
Below is the warning I am getting while reading the PDF file on both instances
Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+231 (231) in font JCBMGH+Mangal
Sep 05, 2016 10:00:37 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+232 (232) in font JCBLPH+Mangal,Bold
Sep 05, 2016 10:00:38 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+227 (227) in font JCBLPH+Mangal,Bold