3

I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.

Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!

Here is the code I'm using to convert PDF pages to images:

private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
   PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
   List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
   for (PDPage pdPage : pdPages) { 
       BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
       ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
   }
   document.close();
}

Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?

Much appreciated for any help!

EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf

Dr Jorge
  • 353
  • 8
  • 15
  • Have you read this [article](http://www.adobe.com/content/dam/Adobe/en/products/acrobat/pdfs/adobe-acrobat-xi-protect-pdf-file-with-permissions-tutorial-ue.pdf) at adobe? I am pretty sure that you are having troubles with some content that was copy-protected by document's creators. I am also sure there are some ways to bypass that protection, however stackoverflow is not a place where such things should be discussed. – user3707125 Jan 08 '16 at 00:18
  • Nope, that is not the problem, the PDFs in question dont have any restriction regarding copying content, at least from what I saw on the files's "Document Restrictions Summary" – Dr Jorge Jan 08 '16 at 00:36
  • 1
    Checking the permissions was obviously the first thing I did, and only the dumbest person in the world would post an SO question like this one without checking this first. Despite of this, I edited my question to point out that I have made that check. – Dr Jorge Jan 08 '16 at 00:42
  • 2
    I asked because you didn't mention this fact, and also because half of the questions asked here can be googled in one minute or resolved by reading a documentation for one minute. Nothing personal :) – user3707125 Jan 08 '16 at 00:45
  • I know that I dont have much reputation, but I want to believe that those who make dumb questions have even way way less rep than I do :P Neverthless, thanks for your comment. – Dr Jorge Jan 08 '16 at 00:49
  • For images, there's the ExtractImages tool. For text, there's the ExtractText tool in PDFBox (works only on text with proper encoding). What you mention re: word, I suspect you're just adding the PDF as an attachment, and it shows a preview. You can do that too, i.e. just convert the pages to an image with the PDFToImage tool. See https://pdfbox.apache.org/1.8/commandline.html . If you mean something else, please link to a PDF with content that is "unselectable". – Tilman Hausherr Jan 08 '16 at 07:24
  • Please supply a sample PDF. Not selectable text might be anything from pure graphics to text in patterns. – mkl Jan 08 '16 at 07:45
  • @mkl who said anything about text in particular? On my question I talk about unselectable content. In fact, and to be more specific, the problem I have is precisely the oposite: I am not able to select images that are wrapped on text (which is selectable)... – Dr Jorge Jan 10 '16 at 01:16
  • *who said anything about text in particular?* You said "either text or images", and in my remark I focused on the former: text. But to go beyond, non selectable graphics can be anything from graphics in patterns to text (in type 3 fonts a character is drawn as its own content stream and can, therefore, contain anything). Thus, the request remains: Please supply a sample PDF. Furthermore you have not mentioned in which viewer the content was not selectable. This may also make a difference. By default I assume you mean Adobe Reader. – mkl Jan 10 '16 at 10:01
  • By the way, have you any restrictions concerning the format you wish that content to be extracted in? Depending on the content in question that may make a difference, too. If you e.g. want to extract vector graphics as a bitmap, we merely can draw the vector graphics on an image. If on the other hand you want a specific vector format, you might have to look for a library to write that format first. – mkl Jan 10 '16 at 10:07
  • Ideally, you'd have two docs, the same except one exhibits the problem. In any case, a sample file is likely necessary; the "selectable" quality may be a red-herring. If you have access to linux/bsd (or cygwin), you can try utilities like `pdfinfo {file}` to see more detail on pdf properties; or, see if other utilities "just happen" to work, e.g., `pdftotext` (both part of `poppler-utils` on ubuntu). Alternatively, here's a perl script to extract a region of a doc by coordinates (but I've not tried it): http://stackoverflow.com/questions/8986876/extract-a-region-of-a-pdf-page-by-coordinates – michael Jan 10 '16 at 11:30
  • I have just edited my question with the source code and with a link to a sample PDF. Please check it out and thank you all for your attention. – Dr Jorge Jan 10 '16 at 11:48
  • 2
    I just had a first look at your sample file. Indeed, in two cases there are images inside patterns which makes them unselectable in Adobe Reader and not extracted by standard text/image extraction parsers. This does not mean that they are not extractable, merely that one has to do a little coding here. I'm not in office today, but I'll have a look at this sometime tomorrow. – mkl Jan 10 '16 at 12:16

1 Answers1

5

The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.

Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.

PDFBox 1.8.10

You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    List<PDPage> pages = document.getDocumentCatalog().getAllPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.size(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Map<String, PDXObject> xObjects = resources.getXObjects();
    if (xObjects == null)
        return;

    for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
    {
        PDXObject xObject = entry.getValue();
        String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
        if (xObject instanceof PDXObjectForm)
            extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
        else if (xObject instanceof PDXObjectImage)
            extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
    }

    Map<String, PDPatternResources> patterns = resources.getPatterns();

    for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
    {
        String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
        extractPatternImages(patternEntry.getValue(), patternFormat);
    }
}

public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
    image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}

(ExtractPatternImages.java)

I applied it to your sample PDF like this

public void testtestDrJorge() throws IOException
{
    try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
    {
        PDDocument document = PDDocument.load(resource);
        extractPatternImages(document, "testDrJorge%s.%s");;
    }
}

(ExtractPatternImages.java)

and got two images:

  • `testDrJorge-0-R15-R14.png

    testDrJorge-0-R15-R14.png

  • testDrJorge-0-R38-R37.png

    testDrJorge-0-R38-R37.png

The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.

PDFBox 2.0.0 release candidate

I updated the code to PDFBox 2.0.0 (currently available as release candidate only):

public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
    PDPageTree pages = document.getDocumentCatalog().getPages();
    if (pages == null)
        return;

    for (int i = 0; i < pages.getCount(); i++)
    {
        String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
        extractPatternImages(pages.get(i), pageFormat);
    }
}

public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
    PDResources resources = page.getResources();
    if (resources == null)
        return;
    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
    COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
    if (resourcesDict == null)
        return;
    PDResources resources = new PDResources(resourcesDict);
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }
}

public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
    PDResources resources = form.getResources();
    if (resources == null)
        return;
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames == null)
        return;

    for (COSName xObjectName : xObjectNames)
    {
        PDXObject xObject = resources.getXObject(xObjectName);
        String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
        if (xObject instanceof PDFormXObject)
            extractPatternImages((PDFormXObject)xObject, xObjectFormat);
        else if (xObject instanceof PDImageXObject)
            extractPatternImages((PDImageXObject)xObject, xObjectFormat);
    }

    Iterable<COSName> patternNames = resources.getPatternNames();

    for (COSName patternName : patternNames)
    {
        String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
        extractPatternImages(resources.getPattern(patternName), patternFormat);
    }
}

public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
    String filename = String.format(imageFormat, "", image.getSuffix());
    ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}

and get

  • testDrJorge-0-COSName{R15}-COSName{R14}.png

    testDrJorge-0-COSName{R15}-COSName{R14}.png

  • testDrJorge-0-COSName{R38}-COSName{R37}.png

    testDrJorge-0-COSName{R38}-COSName{R37}.png

Looks like an improvement... ;)

mkl
  • 77,874
  • 12
  • 103
  • 212
  • Just for completenesses' sake: if you convert the entire file to RGB, do the images extract correctly? (Quite unexpectedly, my Acrobat Pro throws away the top right logo but does convert the second image to RGB.) – Jongware Jan 11 '16 at 14:24
  • 1
    @Jongware Cf. my edit, using PDFBox 2.0.0 both images from patterns extract correctly... ;) – mkl Jan 11 '16 at 14:59