Convert PDF files to images with PDFBox

Question

Can someone give me an example on how to use Apache PDFBox to convert a PDF file in different images (one for each page of the PDF)?

Can it be in only one image ? – Mohamed Taboubi Sep 19 '19 at 15:33 — Mohamed Taboubi, Sep 19 '19 at 15:33

Tilman Hausherr · Answer 1 · 2018-12-08T05:28:42.350

116

Solution for 1.8.* versions:

PDDocument document = PDDocument.loadNonSeq(new File(pdfFilename), null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
int page = 0;
for (PDPage pdPage : pdPages)
{ 
    ++page;
    BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
    ImageIOUtil.writeImage(bim, pdfFilename + "-" + page + ".png", 300);
}
document.close();

Don't forget to read the 1.8 dependencies page before doing your build.

Solution for the 2.0 version:

PDDocument document = PDDocument.load(new File(pdfFilename));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page)
{ 
    BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

    // suffix in filename will be used as the file format
    ImageIOUtil.writeImage(bim, pdfFilename + "-" + (page+1) + ".png", 300);
}
document.close();

The ImageIOUtil class is in a separate download / artifact (pdf-tools). Read the 2.0 dependencies page before doing your build, you'll need extra jar files for PDFs with jbig2 images, for saving to tiff images, and reading of encrypted files.

Make sure to use the latest version of whatever JDK version you are using, i.e. if you are using jdk8, then don't use version 1.8.0_5, use 1.8.0_191 or whatever is the latest at the time you're reading. Early versions were very slow.

edited Dec 08 '18 at 05:28

answered Apr 27 '14 at 18:06

Tilman Hausherr

14,950
6
51
80

1

For future readers of this Q&A you might also want to post a solution for 2.x (unless the API change in that respect has not yet stabilized, that is). – mkl Jan 15 '15 at 11:50
If the PDF has graphics on it this doesn't seem to include them. It may also lose many graphical details http://stackoverflow.com/questions/4523688/pdfbox-problem-with-converting-pdf-page-into-image http://stackoverflow.com/questions/22332791/converting-pdf-to-image-with-proper-formatting – Don Cheadle Feb 18 '15 at 15:01
10

Note that in order to use ImageIOUtil in 2.0 you will need to add a dependency on pdfbox-tools. – metaforge Dec 10 '15 at 17:33
3

Great example! Thanks. Two remarks though: * in the 2.0 version, `BufferedImage bim = ` is missing in the fifth line * Notice that "300" is the zoom level and not (as I assumed) a dpi value or something. It wasn't before I got tons of OutOfMemory exceptions that I thought to check the API! – colouredmirrorball Apr 21 '16 at 21:38
@colouredmirrorball thanks re the code mistake; the 300 is a zoom value that ensures a dpi quality :-) – Tilman Hausherr Apr 22 '16 at 05:07
What exactly is the dependency you have to add for ImageIOUtil when using PDFBox 2.0? – Aeseir Jul 03 '16 at 03:59
2

@Aeseir you need pdfbox-tools. Plus the levigo jbig2 decoder, and jai_imageio.jar. – Tilman Hausherr Jul 03 '16 at 05:09
1

Thanks bud, found out they migrated the tools to a different jar. – Aeseir Jul 03 '16 at 05:14
1

renderImageWithDPI is taking more than 2-3 mins for a few colored documents, I am using openjdk 8.0.202. Please help. – Gentleman May 07 '19 at 08:31
Please create an issue in the pdfbox JIRA issue tracker. Include the file. – Tilman Hausherr May 07 '19 at 09:52
This is happening only in a multithreaded application although only single thread is accessing PDFBox. Test application works fine, takes less than 1 secs for same page. – Gentleman May 07 '19 at 12:42
1

Then I can't tell… try creating a test scenario that reproduces the problem. And then open a JIRA issue. – Tilman Hausherr May 07 '19 at 15:58

score 13 · Answer 2 · answered Aug 30 '19 at 10:07

13

I tried it today with PdfBox 2.0.15.

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.rendering.*;
import java.awt.image.*;
import java.io.*;
import javax.imageio.*;


public static void PDFtoJPG (String in, String out) throws Exception
{
    PDDocument pd = PDDocument.load (new File (in));
    PDFRenderer pr = new PDFRenderer (pd);
    BufferedImage bi = pr.renderImageWithDPI (0, 300);
    ImageIO.write (bi, "JPEG", new File (out)); 
}

answered Aug 30 '19 at 10:07

chris01

8,111
7
35
65

1

works like a charm, just one thing. If the pdf document has more than one page use: pd.getNumberOfPages() to do a loop for every page. – Ralsho Aug 20 '20 at 08:59
This gives me java heap space what would be the reason ? – GhostDede Jan 11 '21 at 11:33
@GhostDede Did you play around with the DPI value? Maybe lowering solves your problem. – Spindizzy Apr 14 '21 at 10:11
Yes lowering DPI value solves my problem but I need 500 DPI – GhostDede Apr 28 '21 at 11:22

score 2 · Answer 3 · edited Nov 06 '17 at 09:36

2

w/o any extra dependencies you can just use the PDFToImage class already included in PDFBox.

Kotlin:

PDFToImage.main(arrayOf<String>("-outputPrefix", "newImgFilenamePrefix", existingPdfFilename))

other config opts: https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/PDFToImage.html

edited Nov 06 '17 at 09:36

Tilman Hausherr

14,950
6
51
80

answered Nov 05 '17 at 23:44

kittyminky

475
5
25

2

There will be dependencies. You'll see it when having a PDF with restricted rights or which has JPEG 2000 images or JBIG2 images. – Tilman Hausherr Nov 06 '17 at 09:36
2

There “will be dependencies” in *that* case. For my use case we are turning pdfs we create into imgs so there isn’t the issue of restricted rights or those different img kinds. – kittyminky Nov 06 '17 at 15:07
Such images are IN the PDFs. Example: https://issues.apache.org/jira/secure/attachment/12865265/jbig2.pdf You're only safe if you know in advance that your input PDFs don't have these types of images inside them. – Tilman Hausherr Nov 07 '17 at 21:54

score 2 · Answer 4 · edited Jul 01 '19 at 05:05

public class PDFtoJPGConverter {

    public List<File> convertPdfToImage(File file, String destination) throws Exception {

    File destinationFile = new File(destination);

    if (!destinationFile.exists()) {
        destinationFile.mkdir();
        System.out.println("DESTINATION FOLDER CREATED -> " + destinationFile.getAbsolutePath());
    }else if(destinationFile.exists()){
        System.out.println("DESTINATION FOLDER ALLREADY CREATED!!!");
    }else{
        System.out.println("DESTINATION FOLDER NOT CREATED!!!");
    }

    if (file.exists()) {
        PDDocument doc = PDDocument.load(file);
        PDFRenderer renderer = new PDFRenderer(doc);
        List<File> fileList = new ArrayList<File>();

        String fileName = file.getName().replace(".pdf", "");
        System.out.println("CONVERTER START.....");

        for (int i = 0; i < doc.getNumberOfPages(); i++) {
        // default image files path: original file path
        // if necessary, file.getParent() + "/" => another path
        File fileTemp = new File(destination + fileName + "_" + i + ".jpg"); // jpg or png
        BufferedImage image = renderer.renderImageWithDPI(i, 200);
        // 200 is sample dots per inch.
        // if necessary, change 200 into another integer.
        ImageIO.write(image, "JPEG", fileTemp); // JPEG or PNG
        fileList.add(fileTemp);
        }
        doc.close();
        System.out.println("CONVERTER STOPTED.....");
        System.out.println("IMAGE SAVED AT -> " + destinationFile.getAbsolutePath());
        return fileList;
    } else {
        System.err.println(file.getName() + " FILE DOES NOT EXIST");
    }
    return null;
    }

    public static void main(String[] args) {

    try {
        PDFtoJPGConverter converter = new PDFtoJPGConverter();
        Scanner sc = new Scanner(System.in);
        System.out.print("Enter your destination folder where save image \n");
        // Destination = D:/PPL/;
        String destination = sc.nextLine();

        System.out.print("Enter your selected pdf files name with source folder \n");
        String sourcePathWithFileName = sc.nextLine();
        // Source Path = D:/PDF/ant.pdf,D:/PDF/abc.pdf,D:/PDF/xyz.pdf
        if (sourcePathWithFileName != null || sourcePathWithFileName != "") {
        String[] files = sourcePathWithFileName.split(",");
        for (String file : files) {
            File pdf = new File(file);
            System.out.print("FILE:>> "+ pdf);
            converter.convertPdfToImage(pdf, destination);
        }
        }

    } catch (Exception ex) {
        ex.printStackTrace();
    }
    }
}

====================================

Here i am use Apache pdfbox-2.0.8 , commons-logging-1.2 and fontbox-2.0.8 Library

HAPPY CODING :)

score 0 · Answer 5 · answered Oct 03 '20 at 04:41

Here is part of my code to convert a pdf, from a multipart file, to jpg thumbnail. I'm saving the image as a base64 string. Pdfbox 2.0.21 version was used.

private static String generatePdfThumbnail(byte[] imageInBytesArray) throws IOException {
    PDDocument document = PDDocument.load(imageInBytesArray);
    PDFRenderer renderer = new PDFRenderer(document);
    BufferedImage bufferedImage = renderer.renderImage(0);
    Graphics2D bufImageGraphics = bufferedImage.createGraphics();
    bufImageGraphics.drawImage(bufferedImage, 0, 0, null);

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    boolean foundWriter = ImageIO.write(bufferedImage, "jpg", baos);
    byte[] fileContent = null;
    if (!foundWriter) {
      return "";
    }

    fileContent = baos.toByteArray();
    return Base64.getEncoder().encodeToString(fileContent);
  }

Convert PDF files to images with PDFBox

5 Answers5

Linked

Related