0

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file

After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:

org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

Now I am stuck and unable to find the solution. Please assist if anyone can.

//////UPDATE AS REPLY ON COMMENTS///

I am using pdfbox-1.8.10

Here is the code:

public void getimg ()throws Exception {

try {
        String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
        String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
        File oldFile = new File(sourceDir);
        if (oldFile.exists()){
              PDDocument document = PDDocument.load(sourceDir);
               List<PDPage> list =   document.getDocumentCatalog().getAllPages();
               String fileName = oldFile.getName().replace(".pdf", "_cover");
               int totalImages = 1;
               for (PDPage page : list) {
                   PDResources pdResources = page.getResources();
                   Map pageImages = pdResources.getXObjects();
                    if (pageImages != null){
                      Iterator imageIter = pageImages.keySet().iterator();
                      while (imageIter.hasNext()){
                      String key = (String) imageIter.next();
                      Object obj = pageImages.get(key);

                      if(obj instanceof PDXObjectImage) {
               PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;

                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);

                     totalImages++;
                      }
                      }
                    }
               }
        }  else {
                    System.err.println("File not exist");
                       }  
}
catch (Exception e){

    System.err.println(e.getMessage());
 }
 }

//// PARTIAL SOLUTION/////

I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

user3050590
  • 1,464
  • 3
  • 18
  • 35
  • Could you post the relevant code please? – lschuetze Jan 25 '16 at 09:40
  • 1
    What version are you using? Could you post the full stack trace? Possible cause: not all XObjects are images. Some can be forms. If it happens in your own code, then you should add an "instanceof PDXObjectImage" check. – Tilman Hausherr Jan 25 '16 at 09:50
  • I get an error when accessing your file URL, maybe it is a temporary URL. Try also using findResources() instead of getResources(); try also the ExtractImages command line tool (not documented) https://pdfbox.apache.org/1.8/commandline.html , try also the 2.0 version. – Tilman Hausherr Jan 25 '16 at 11:58
  • One problem with your code (I'm not saying that it is the cause) is that it won't go recursively. Better look at the source of the ExtractImages tool in the source code download. – Tilman Hausherr Jan 25 '16 at 12:01
  • https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup – Tilman Hausherr Jan 25 '16 at 12:05
  • Do also make sure you have all the needed extra libraries, e.g. for JBIG2 and JPEG2000 and TIF (see dependencies). Mention if you get any log output. – Tilman Hausherr Jan 25 '16 at 12:06
  • @Tilman Hausherr I have updated the pdf url. Please have a look. – user3050590 Jan 25 '16 at 12:26
  • Your PDF does not have any embedded images. The graphics on page 2, 4 and 5 are vector drawings. But don't believe me, use the PDFDebugger of the 2.0 version and look for Resources/XObjects. – Tilman Hausherr Jan 25 '16 at 12:33
  • @ Tilman Hausher , You are right. It does not extract vector graphics image. For others it is able to extract. Do you have any suggestion that how can I extract vector graphics image from PDF. Could you also, post your comment as a 'Answer' as well. Thanks – user3050590 Jan 25 '16 at 12:41

1 Answers1

2

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.

Code for 1.8 can be found here: https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup

Code for 2.0 can be found here: https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date

(Even these are not always perfect, see this answer)

The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

Tilman Hausherr
  • 14,950
  • 6
  • 51
  • 80