Extract Text from Image in PDF

Question

Assume my user went to a scanner in their office. The scanner is capable of generating a PDF of the scanned document. This is essentially the type of file that I have.

What I want to do is extract the text from this PDF. This is not a "first generation" pdf in the sense that the text is not embedded into the pdf. The text is embedded in the image that is in the PDF.

Is there functionality in iText of PDFBox that allows for this data to be retrieved? I am trying to avoid doing OCR on the image if possible. I was hoping there was something build into IText or PDFBox that does this.

Note that I am not talking about extracting "normal" text form a pdf as is outlined here: How to get raw text from pdf file using java

Your question might be clearer if you removed the mention of pdf entirely. Essentially you're wanting to read text from an image, if I'm reading this correctly. — cadams, Aug 18 '15 at 19:27
You want to do OCR without doing OCR. PDFBox and iText can only extract text that is stored as vector data. You want to get text that consists of pixels in a raster image. That's OCR. Neither PDFBox, nor iText support OCR. — Bruno Lowagie, Aug 18 '15 at 19:38
@cadams Yes, but on a PDF. I do not want to convert it to an image. It has to be done on the PDF itself. — user489041, Aug 18 '15 at 19:44
@BrunoLowagie I suppose what I meant was I do not want to use a third party library that does OCR. I was hoping that PDFBox or iText can do this. Im actually fairly sure that they can. I just need to figure out how to plug that functionality into it. — user489041, Aug 18 '15 at 19:45
Right. As far as I'm aware, what you're wanting to do is not possible. However, you can use a java wrapper for tesseract like tesjeract or Tess4J but you will have to convert the pdf to a png or tiff image format, which you seem to be trying to avoid. — cadams, Aug 18 '15 at 19:56
PDFBox has looked into building an extension that allows for OCR software to be plugged-in, but I don't think it has been implemented yet — cadams, Aug 18 '15 at 19:57
@cadams Ok great, thanks for the input. Too bad, looks like I will have to go a third party route. — user489041, Aug 18 '15 at 20:07
@cadams it was done in GSoC2014: https://issues.apache.org/jira/browse/PDFBOX-1912 — Tilman Hausherr, Aug 18 '15 at 20:19
@TilmanHausherr according to that page, it's a work in progress. — cadams, Aug 18 '15 at 20:28
@TilmanHausherr Oh yeah, looking at the comments, looks like there is a plugin for it. I'll post that info in an answer in case it's useful. — cadams, Aug 18 '15 at 20:30
@cadams The project was finished in 2014, but it was never integrated in the official PDFBox release, and it is based on the revision of some time in 2014. — Tilman Hausherr, Aug 18 '15 at 20:39

cadams · Accepted Answer · 2015-08-19T13:32:54.050

4

Ok, after some looking around, there doesn't seem to be a way to do this specifically with iText or PDFBox, but it looks like PDFBox does have a plugin for third-party software that can accomplish what you need. If that is of interest, links are here and here, sourced from here (from @TilmanHausherr).

edited Aug 19 '15 at 13:32

answered Aug 18 '15 at 20:34

cadams

866
1
7
20

Extract Text from Image in PDF

1 Answers1