Extra symbols when converting PDF to image with PDFBox

Question

I use Apache PDFBox 1.8.9. I have a one page PDF which contains text and I want to convert this page to image. The PDF is created with Libre Office. I use the following code:

PDDocument document = PDDocument.loadNonSeq(new File(filename), null); 
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
int page = 0;
for (PDPage pdPage : pdPages) {
 ++page;
 BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
 ImageIOUtil.writeImage(bim, "png", "/home/file" + "-" + page, 300); 
} 
document.close();

The code works, I get a PNG image. The problem is that there are a lot of strange extra symbols which make it very difficult to read the text. How to fix it?

The image I get is this (zoomed in):

bad conversion

and this is the same area in a PDF viewer:

original input pdf

The full PDF file can be downloaded at https://yadi.sk/i/iX-KJwlhhXMY2

The problem does not lie in the PDF - you don't say as much, but I assume the page shows correctly when opened with a regular PDF viewer. PDFBox seems to miscalculate a few font encodings there. Can you post this particular PDF on a public server for others to look at? — Jongware, Jun 28 '15 at 09:14
@Jongware Yes, in pdf viewer everything is ok. The PDF file I posted — Pavel_K, Jun 28 '15 at 09:36
After inspecting the file: yes, it *is* correct, and my own tool can successfully extract the original text. PDFBox makes a weird mistake in decoding the font: some characters are converted correctly, but for others it seems to **double interpret** the encoding. For instance, the *code* `01` is first decoded as `I` (correct), and then *this* code `49` gets remapped *again*, as `я` (see your first line). This needs to be looked at by a PDFBox expert. — Jongware, Jun 28 '15 at 10:22
Your file will work fine with the unreleased version 2.0, which has a different API. — Tilman Hausherr, Jun 28 '15 at 10:58
@Tilman Hausherr So is this a bug in 1.8.9? And can 2.0 be used in production mode? — Pavel_K, Jun 28 '15 at 11:01
@JimJim2000 Yes it is a bug in 1.8.9. Using 2.0 in production mode is, well, risky, you need to watch whats going on in the issues and decide on a version that is good for you. Since about friday it isn't so good but will be fixed soon. — Tilman Hausherr, Jun 28 '15 at 11:09
@Tilman Hausherr Could you provide example how to get image in 2.0 - I cant find in internet. If you do it as answer I will accept it and we close the question. — Pavel_K, Jun 28 '15 at 11:12
@JimJim2000 just go here :-) https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images — Tilman Hausherr, Jun 28 '15 at 11:18
@TilmanHausherr: so we can close this question as a duplicate? — Jongware, Jun 28 '15 at 11:58
@Jongware it is a duplicate, although not of the one in the link I gave, rather this https://stackoverflow.com/questions/24237313/pdfbox-pdf-to-image-generates-overlapping-text — Tilman Hausherr, Jun 28 '15 at 12:21
@Tilman: I can confirm the same issue with the PDF therein. The full stop, for example, first gets converted correctly from `18` to `.` (the correct Unicode `2E`), and then from that to `V`, the character in *position* `2E`. Good to hear it's resolved in a forthcoming new version! — Jongware, Jun 28 '15 at 12:49

Extra symbols when converting PDF to image with PDFBox

0 Answers0