4

I use Apache PDFBox 1.8.9. I have a one page PDF which contains text and I want to convert this page to image. The PDF is created with Libre Office. I use the following code:

PDDocument document = PDDocument.loadNonSeq(new File(filename), null); 
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
int page = 0;
for (PDPage pdPage : pdPages) {
 ++page;
 BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
 ImageIOUtil.writeImage(bim, "png", "/home/file" + "-" + page, 300); 
} 
document.close();

The code works, I get a PNG image. The problem is that there are a lot of strange extra symbols which make it very difficult to read the text. How to fix it?

The image I get is this (zoomed in):

bad conversion

and this is the same area in a PDF viewer:

original input pdf

The full PDF file can be downloaded at https://yadi.sk/i/iX-KJwlhhXMY2

Jongware
  • 21,058
  • 8
  • 43
  • 86
Pavel_K
  • 8,216
  • 6
  • 44
  • 127
  • The problem does not lie in the PDF - you don't say as much, but I assume the page shows correctly when opened with a regular PDF viewer. PDFBox seems to miscalculate a few font encodings there. Can you post this particular PDF on a public server for others to look at? – Jongware Jun 28 '15 at 09:14
  • 1
    @Jongware Yes, in pdf viewer everything is ok. The PDF file I posted – Pavel_K Jun 28 '15 at 09:36
  • After inspecting the file: yes, it *is* correct, and my own tool can successfully extract the original text. PDFBox makes a weird mistake in decoding the font: some characters are converted correctly, but for others it seems to **double interpret** the encoding. For instance, the *code* `01` is first decoded as `I` (correct), and then *this* code `49` gets remapped *again*, as `я` (see your first line). This needs to be looked at by a PDFBox expert. – Jongware Jun 28 '15 at 10:22
  • Your file will work fine with the unreleased version 2.0, which has a different API. – Tilman Hausherr Jun 28 '15 at 10:58
  • 1
    @Tilman Hausherr So is this a bug in 1.8.9? And can 2.0 be used in production mode? – Pavel_K Jun 28 '15 at 11:01
  • @JimJim2000 Yes it is a bug in 1.8.9. Using 2.0 in production mode is, well, risky, you need to watch whats going on in the issues and decide on a version that is good for you. Since about friday it isn't so good but will be fixed soon. – Tilman Hausherr Jun 28 '15 at 11:09
  • 1
    @Tilman Hausherr Could you provide example how to get image in 2.0 - I cant find in internet. If you do it as answer I will accept it and we close the question. – Pavel_K Jun 28 '15 at 11:12
  • @JimJim2000 just go here :-) https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images – Tilman Hausherr Jun 28 '15 at 11:18
  • @Tilman Hausherr Thank you! – Pavel_K Jun 28 '15 at 11:19
  • @TilmanHausherr: so we can close this question as a duplicate? – Jongware Jun 28 '15 at 11:58
  • 1
    @Jongware it is a duplicate, although not of the one in the link I gave, rather this https://stackoverflow.com/questions/24237313/pdfbox-pdf-to-image-generates-overlapping-text – Tilman Hausherr Jun 28 '15 at 12:21
  • @Tilman: I can confirm the same issue with the PDF therein. The full stop, for example, first gets converted correctly from `18` to `.` (the correct Unicode `2E`), and then from that to `V`, the character in *position* `2E`. Good to hear it's resolved in a forthcoming new version! – Jongware Jun 28 '15 at 12:49

0 Answers0