1

I am using pdfbox to render a page to BufferedImage. The document is scanned sheet of paper (A4). Unfortunately, many of these documents have already been scaned and only OCR I have avaialable performs just while scanning. So I use tess4j to sort this documents.

            try (PDDocument inputPDF = PDDocument.load(pdf)) {
            firstPage = new PDFRenderer(inputPDF).renderImageWithDPI(0, 200);

However, this way of rendering is pretty slow. I need actually just a small part of the first page of that pdf, so rendering entire page is pointless. My question is: How to extract area as BufferedImage from pdf document. For example extract area sized 100x100 in upper right corner.

Thanks :)

  • Is the page scanned as one image? Or as many partial ones? In the former case I would recommend image extraction instead of rendering which should be much faster. – mkl Apr 22 '17 at 10:13
  • Alternatively, call setCropBox() on the first PDPage, call PDDocument.getPage(0). This may or may not speed up things slighly, but it is worth a try. Some help for your coordinates: The y=0 coordinate is at the bottom. An A4 page is 595 x 841 in PDF coordinates. – Tilman Hausherr Apr 22 '17 at 11:21
  • The page is scanned as one image. – Tomáš Jelínek Apr 22 '17 at 21:24
  • To extract images, choose an answer from here: https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox – Tilman Hausherr Apr 24 '17 at 08:16

0 Answers0