PDFBox extracted image is much bigger than original page

Question

I have an image-only PDF file that looks like a scan of a really big page. Preview shows me that it is about 42x30 inches, and 3047x2160 pixelss. I guess it was scanned at 72dpi resolution.

I'm extracting this image with PDFBox by looking for instances of PDImageXObject, similar to https://stackoverflow.com/a/37664125/10026.

However, for this image, PDImageXObject.getWidth() and PDImageXObject.getHeight() give me 16928 and 12000, respectively. When I call PDImageXObject.getImage(), it creates an enormous BufferedImage in memory.

Is there a better way to get the image out of so that it keeps the original pixel size?

You retrieve the original image resource from there pdf. When such an image resource is added to a page, it is subject to the current transformation matrix which can apply an arbitrary archive transformation, e.g. a rotation, mirroring, skewing, scaling,... the resulting transformed image is calculated in the viewer, so you cannot extract it from the pdf. — mkl, May 09 '20 at 00:31
@mkl Are you saying that Preview is not reporting the correct number of pixels? Well, that's possible, but a 15mb JPG image ends up being extracted from a 3mb PDF, which I find highly suspicious. — ykaganovich, May 09 '20 at 02:59
This is not surprising at all. JPG is not always the best compression. If the image in the PDF was a 1 bit image, then it would be better to save it as a G4 compressed TIFF image with "ImageIOUtil.writeImage()" (you'll need a TIFF writer, e.g. twelvemonkeys). — Tilman Hausherr, May 09 '20 at 04:11
*Are you saying that Preview is not reporting the correct number of pixels?* - preview in your screen shot does not report pixels at all. It reports points. And a point is 1/72 inch. It has nothing to do with pixels. — mkl, May 09 '20 at 08:05
@TilmanHausherr That makes sense, thanks. Is there a way to tell if the image is a 1 bit image, or, more generally, what would be the best compression for it? I have no prior way of telling what's in these PDF's, and I need to feed them to an OCR library. — ykaganovich, May 10 '20 at 22:16
@mkl Indeed, I was confused about the difference between pixels and points. Thanks for clarifying that pixel for me (jk). — ykaganovich, May 10 '20 at 22:18
You can see this by opening it with `PDFDebugger` and searching for the image XObject in the page resources dictionary. — Tilman Hausherr, May 11 '20 at 08:09

PDFBox extracted image is much bigger than original page

0 Answers0