0

I have an image-only PDF file that looks like a scan of a really big page. Preview shows me that it is about 42x30 inches, and 3047x2160 pixelss. I guess it was scanned at 72dpi resolution.

Inches Pixels

I'm extracting this image with PDFBox by looking for instances of PDImageXObject, similar to https://stackoverflow.com/a/37664125/10026.

However, for this image, PDImageXObject.getWidth() and PDImageXObject.getHeight() give me 16928 and 12000, respectively. When I call PDImageXObject.getImage(), it creates an enormous BufferedImage in memory.

Is there a better way to get the image out of so that it keeps the original pixel size?

ykaganovich
  • 13,997
  • 7
  • 55
  • 90
  • 1
    You retrieve the original image resource from there pdf. When such an image resource is added to a page, it is subject to the current transformation matrix which can apply an arbitrary archive transformation, e.g. a rotation, mirroring, skewing, scaling,... the resulting transformed image is calculated in the viewer, so you cannot extract it from the pdf. – mkl May 09 '20 at 00:31
  • @mkl Are you saying that Preview is not reporting the correct number of pixels? Well, that's possible, but a 15mb JPG image ends up being extracted from a 3mb PDF, which I find highly suspicious. – ykaganovich May 09 '20 at 02:59
  • 1
    This is not surprising at all. JPG is not always the best compression. If the image in the PDF was a 1 bit image, then it would be better to save it as a G4 compressed TIFF image with "ImageIOUtil.writeImage()" (you'll need a TIFF writer, e.g. twelvemonkeys). – Tilman Hausherr May 09 '20 at 04:11
  • *Are you saying that Preview is not reporting the correct number of pixels?* - preview in your screen shot does not report pixels at all. It reports points. And a point is 1/72 inch. It has nothing to do with pixels. – mkl May 09 '20 at 08:05
  • @TilmanHausherr That makes sense, thanks. Is there a way to tell if the image is a 1 bit image, or, more generally, what would be the best compression for it? I have no prior way of telling what's in these PDF's, and I need to feed them to an OCR library. – ykaganovich May 10 '20 at 22:16
  • @mkl Indeed, I was confused about the difference between pixels and points. Thanks for clarifying that pixel for me (jk). – ykaganovich May 10 '20 at 22:18
  • You can see this by opening it with `PDFDebugger` and searching for the image XObject in the page resources dictionary. – Tilman Hausherr May 11 '20 at 08:09

0 Answers0