Get pixeldata from scanned PDF-document in Java

Question

I have some documents that I have digitalized with a Xerox scanner to a PDF file. Using Java, I am trying to extract RGB pixel data from it, to use in image recognition applications. Developing this from scratch is a little bit beyond my level, so I am relying on 3rd party libraries for the PDF prosessing.

So far I have tried 2 different libraries; PdfBox and PdfClown.

With PdfBox, I am trying to use the convertToImage() method to obtain a BufferedImage. With PdfClown I am trying to use the render(page,size) method from the Renderer class to obtain a BufferedImage. In both cases the returned image is blank. All pixels are white [(r,g,b) = (255,255,255)].

I have been able to get non-blank BufferedImage's from other pdf documents that dont originate from a scan, so I am suspecting that the problem is with the format of the scanned document.

Here is a sample PFD file: http://www.filedropper.com/innlevering1

Does anyone know how to solve this? Or can you offer a different approach?

to tell you that my method works and come back and tell me not it doesnt work in your case its a waste - so what are you looking for? troubleshoot the images? maybe your code has some peculiarity — gpasch, Feb 29 '16 at 23:14
Please share a sample PDF. (Not all scanners put images into PDFs alike...) — mkl, Mar 01 '16 at 07:20
The PDPage class in PDFBox 2.0 doesn't seem to contain the convertToImage() medthod. Do you have a suggestion as to how I could do this using the 2.0 version @TilmanHausherr? — Torben, Mar 01 '16 at 08:19
https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images — Tilman Hausherr, Mar 01 '16 at 08:27
Your file works fine if the jbig2 plugin is used - that is why you got this log output in 1.8: "Can't find an ImageIO plugin to decode the JBIG2 encoded datastream". — Tilman Hausherr, Mar 01 '16 at 08:38
Wow. I started writing an answer focusing on that it would be best to not `convertToImage` or `render` pages with scanned images but instead to *extract* those scanned images for optimum quality. Then I inspected the OP's sample file and saw that the scanner in question did not put the scan as single image into the page but as a low quality background JPEG and numerous high quality JBIG2s of patches where the scanner recognized writing... For such a patchwork file one unfortunately has to render, in particular as the scanner did not recognize all text... — mkl, Mar 01 '16 at 09:14
@Torben can you configure the scanner not to do that *optimization*? If the scan was included as a single image per page, image extraction would provide best material to OCR. — mkl, Mar 01 '16 at 09:20
@TilmanHausherr I'm sorry, but I am quite new to Java programming, so I can't get it to work properly. I downloaded the source files for the Levigo JBIG2 plugin, imported them to my src directory in Eclipse, and wrote an import statement in the code. I still get the "Can't find an ImageIO plugin to decode the JBIG2 encoded datastream." message. What am i doing wrong here? Edit: I am trying to use the 1.8 version — Torben, Mar 01 '16 at 22:30
@mkl For practical reasons that is not an option right now, so I will have to work with the current documents. — Torben, Mar 01 '16 at 22:30
@Torben you don't need the sources for the levigo plugin, only the .jar file, e.g. https://jbig2-imageio.googlecode.com/svn/maven-repository/com/levigo/jbig2/levigo-jbig2-imageio/1.6.3/levigo-jbig2-imageio-1.6.3.jar . Add it in Eclipse as a library to your project (sorry, can't tell you how, I use netbeans). There is no need for a code change, i.e. no import needed. — Tilman Hausherr, Mar 01 '16 at 22:39
@Torben Here's a video how to add a library to a project: https://www.youtube.com/watch?v=E1HTwMJhWVA (There are other videos if you don't understand the accent of the speaker, just enter something like "add library in eclipse"). — Tilman Hausherr, Mar 01 '16 at 22:46
@Torben I hope you were able to solve this. If yes, please delete the question or answer it yourself, to avoid "orphans". If no, "ping" me here, or re-ask your question in the PDFBox user mailing list. However, being new to java is of course a handicap, due to the complexity of the PDF specification. Good luck! — Tilman Hausherr, Mar 04 '16 at 19:11

score 0 · Answer 1 · answered Mar 05 '16 at 21:19

0

The problem was solved by installing the JBIG2-plugin. Everything works perfectly now. Thanks a lot for the help.

answered Mar 05 '16 at 21:19

Torben

53
3

Get pixeldata from scanned PDF-document in Java

1 Answers1