Using PDFBox to remove Optional Content Groups that are not enabled

Question

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.

I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.

    for(PDPage page:pages) {
        PDResources resources = page.getResources();            
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        Collection tokens = parser.getTokens();
        ...
    }

I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.

What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.

Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.

Please help.

EDIT: Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:

PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
    PDXObject xobj = resources.getXObject(name);
    PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
    parser.parse();
    Object [] tokens = parser.getTokens().toArray();
    for(int i = 0;i<tokens.length-1;i++) {
        Object obj = tokens[i];
        if (obj instanceof COSName && obj.equals(COSName.OC)) {
            i++;
            Object obj = tokens[i];
            if (obj instanceof COSName) {
                PDPropertyList props = resources.getProperties((COSName)obj);
                if (props != null) {
...

However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.

This is quite some work... One has to find which OCGs are enabled. Then go recursively through all content streams as possibly explained in the current answer. Don't know about acroform. — Tilman Hausherr, Mar 31 '18 at 13:24
Actually finding which OCG's are enabled is the easy part. I can retrieve the list of optional content groups from the document catalog via `catalog.getOCProperties.getOptionalContentGroups()`. I can then iterate through them and easily see which ones are enabled and which are not. The problem is that I can't figure out how to copy all of the content from one PDF over to a new document when the source content is not inside of one of the disabled groups. — markt1964, Mar 31 '18 at 21:01

score 0 · Answer 1 · answered Mar 31 '18 at 10:02

0

Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?

I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)

Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.

    OCGDelete (PDDocument doc, int pageNum, String OCName) {
      PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
      PDResources pdResources = pdPage.getResources();
      PDFStreamParser pdParser = new PDFStreamParser(pdPage);

      int ocgStart
      int ocgLength

      Collection tokens = pdParser.getTokens();
      Object[] newTokens = tokens.toArray()

      try {
        for (int index = 0; index < newTokens.length; index++) {
            obj = newTokens[index]
            if (obj instanceof COSName && obj.equals(COSName.OC)) {
                // println "Found COSName at "+index   /// Found Optional Content
                startIndex = index
                index++
                if (index < newTokens.size()) {
                    obj = newTokens[index]
                    if (obj instanceof COSName) {
                        prop = pdRes.getProperties(obj)
                        if (prop != null && prop instanceof PDOptionalContentGroup) {
                            if ((prop.getName()).equals(delLayer)) {
                                println "Found the Layer to be deleted"
                                println "prop Name was " + prop.getName()

                                index++

                                if (index < newTokens.size()) {
                                    obj = newTokens[index]

                                    if ((obj.getName()).equals("BDC")) {
                                        ocgStart = index
                                        println("OCG Start " + ocgStart)
                                        ocgLength = -1
                                        index++

                                        while (index < newTokens.size()) {
                                            ocgLength++
                                            obj = newTokens[index]
                                            println " Loop through relevant OCG Tokens " + obj
                                            if (obj instanceof Operator && (obj.getName()).equals("EMC")) {

                                                println "the next obj was " + obj
                                                println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
                                                println("OCG End " + ocgLength++)
                                                break

                                            }

                                            index++
                                        }
                                        if (endIndex > 0) {
                                            println "End Index was something " + (startIndex + ocgLength)

                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    catch (Exception ex){
        println ex.message()
    }

    for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
        newTokens.removeAt(i)
    }


    PDStream newContents = new PDStream(doc);
    OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
    ContentStreamWriter writer = new ContentStreamWriter(output);
    writer.writeTokens(newTokens);
    output.close();
    pdPage.setContents(newContents);

  }

answered Mar 31 '18 at 10:02

PDFDev

1
1

1

Your token removal loop is incorrect: Replace `newTokens.removeAt(i)` by `newTokens.removeAt(ocgStart)`. As it is now, you remove every other element which obviously damages your stream considerably. (I haven't looked at the rest of your code in detail yet but that error just caught my eyes.) – mkl Mar 31 '18 at 11:15
@mkl. Thanks I tried many variations on this theme and none seemed to produce a valid PDF. One of the variations dumped the tokens. They seemed to correspond to the correct BMD/EMC content. Just to check my logic, I used the DeleteText sample to capture the text on a PDPage and tried to put that on a layer. That too produced a corrupted PDF. Won't stop till I have a solution. – PDFDev Mar 31 '18 at 11:50
Marked content has MP, DP, BMC, BDC, EMC, see p. 320 in the PDF specification. – Tilman Hausherr Mar 31 '18 at 13:25
**Update:** Progress of sorts. I revisited the loop and it appears to work (started the ocgLength at zero and removed the println statement that updated the ocgLength). At least on my test sample (from Illustrator with multiple layers). When I open the modified PDFs I'm not alerted to an error or prompted to save them. I've only validated this on a few PDFs, so YMMV - at least it is a start on exit. Still doesn't work on a PDF with Nested Layers - generated by iText. – PDFDev Mar 31 '18 at 14:18
@TilmanHausherr I'm specifically interested in Optional Content Groups so my code is limited to that use case. Once I get this working, I'll expand the feature set based on the project I am currently working on. – PDFDev Mar 31 '18 at 14:30
Thank you, although I had a link to that same question within my own, and ignoring that the above code is not java, but more like javascript, I had also noted that the solution there, and here does not work. The problem I am seeing is that in the `newTokens` array, there are no entries which satisfy the `(obj instanceof COSName && obj.equals(COSName.OC))` condition... although there most definitely is optional content on each page. – markt1964 Mar 31 '18 at 14:33
1

@PDFDev I've just re-read the section 8.11.3 "Making graphical content optional" of ISO 32000-2, and one issue of your approach becomes obvious: You simply delete everything between **BDC** and **EMC**. But that is too much! Even if they are in some invisible OCG, *graphics state operations, such as setting the colour, transformation matrix, and clipping, shall still be applied. In addition, graphics state side effects that arise from drawing operators shall be applied.* If you remove a layer, therefore, you have to keep the effects on the graphics state! – mkl Mar 31 '18 at 16:13
(Most likely many PDF generators prevent content of separate OCGs influencing each other for the sake of simplicity; thus, you often will have no problem even if you delete everything between **BDC** and **EMC**. But if you want to create a generic implementation for removing certain OCGs, you'll have to be more selective.) – mkl Mar 31 '18 at 16:18
@markt1964 *"there are no entries which satisfy the (obj instanceof COSName && obj.equals(COSName.OC)) condition... although there most definitely is optional content"* - in addition to sections of the page content stream which PDFDev focuses on, whole form XObjects can be made part of an OCG, and in that case one has to look for an **OC** key in the XObject dictionary in question. – mkl Mar 31 '18 at 16:26
@mkl I've added some more details to the problem, having tried what I understand from what you are saying here, but I still cannot figure out how to get at the optional content groups, let alone delete them. – markt1964 Mar 31 '18 at 20:31
@markt1964 Apologies, it is Groovy and not Java (or JavaScript) and I did see you'd mentioned the same article after I'd pressed enter and outside the time to edit - my bad – PDFDev Apr 01 '18 at 09:49
@markt1964 Can you supply a sample file to test against? – mkl Apr 01 '18 at 16:20
@mkl Here's a link to the file I was opening with pdfbox: https://drive.google.com/file/d/1MiBH4Gw1UmXhEfTNeEk6aRpkK4Nzdx4E/view?usp=sharing – markt1964 Apr 01 '18 at 17:06
your example file is indeed an example for whole xobjects being marked as optional content by **OC** entries in the xobjects dictionary. – mkl Apr 02 '18 at 10:15
Right, but I don't know how to tell which OCG entries belong to which optional content groups that I can get a list of via the document catalog. – markt1964 Apr 09 '18 at 04:12
Marked Content may be nested. This code does not take that into account. Such regions are opened by `BMC` or `BDC` and closed by `EMC` – Christoph Dietze Jun 26 '18 at 13:58

score 0 · Answer 2 · answered Aug 19 '20 at 17:14

Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...

This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).

public class MarkedContentRemover {

    private final MarkedContentMatcher matcher;
    
    /**
     * 
     */
    public MarkedContentRemover(MarkedContentMatcher matcher) {
        this.matcher = matcher;
    }
    
    public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
        ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
        
        PDResources pdResources = page.getResources();

        PDFStreamParser pdParser = new PDFStreamParser(page);
        
        
        PDStream newContents = new PDStream(doc);
        OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
        
        List<Object> operands = new ArrayList<>();
        Operator operator = null;
        Object token;
        int suppressDepth = 0;
        boolean resumeOutputOnNextOperator = false;
        int removedCount = 0;
        
        while (true) {

            operands.clear();
            token = pdParser.parseNextToken();
            while(token != null && !(token instanceof Operator)) {
                operands.add(token);
                token = pdParser.parseNextToken();
            }
            operator = (Operator)token;
            
            if (operator == null) break;
            
            if (resumeOutputOnNextOperator) {
                resumeOutputOnNextOperator = false;
                suppressDepth--;
                if (suppressDepth == 0)
                    removedCount++;
            }
            
            if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
                    || OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
                
                COSName contentId = (COSName)operands.get(0);

                final COSDictionary properties;
                if (operands.size() > 1) {
                    Object propsOperand = operands.get(1);
                    
                    if (propsOperand instanceof COSDictionary) {
                        properties = (COSDictionary) propsOperand;
    
                    } else if (propsOperand instanceof COSName) {
                        properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
                    } else {
                        properties = new COSDictionary();
                    }
                } else {
                    properties = new COSDictionary();
                }
                
                if (matcher.matches(contentId, properties)) {
                    suppressDepth++;
                }
                
            }
        
            if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
                if (suppressDepth > 0)
                    resumeOutputOnNextOperator = true;
            }

            else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
            }

            else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
            }
            
            else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
            }
            
            

            if (suppressDepth == 0) {
                newContentWriter.writeTokens(operands);
                newContentWriter.writeTokens(operator);
            }

        }
        
        if (resumeOutputOnNextOperator)
            removedCount++;

        

        newContentOutput.close();

        page.setContents(newContents);
        
        resourceSuppressionTracker.updateResources(pdResources);
        
        return removedCount;
    }

    
    private static class ResourceSuppressionTracker{
        // if the boolean is TRUE, then the resource should be removed.  If the boolean is FALSE, the resource should not be removed
        private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
        
        public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
            if (!(resourceNameOperand instanceof COSName)) return;
            if (preserve) {
                markForPreservation(resourceType, (COSName)resourceNameOperand);
            } else {
                markForRemoval(resourceType, (COSName)resourceNameOperand);
            }
        }
        
        public void markForRemoval(COSName resourceType, COSName refId) {
            if (!resourceIsPreserved(resourceType, refId)) {
                getResourceTracker(resourceType).put(refId, Boolean.TRUE);
            }
        }

        public void markForPreservation(COSName resourceType, COSName refId) {
            getResourceTracker(resourceType).put(refId, Boolean.FALSE);
        }
        
        public void updateResources(PDResources pdResources) {
            for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
                for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
                    if (refEntry.getValue().equals(Boolean.TRUE)) {
                        pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
                    }
                }
            }
        }
        
        private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
            return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
        }
        
        private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
            if (!tracker.containsKey(resourceType)) {
                tracker.put(resourceType, new HashMap<>());
            }
            
            return tracker.get(resourceType);
            
        }
    }
    
}

Helper class:

public interface MarkedContentMatcher {
    public boolean matches(COSName contentId, COSDictionary props);
}

Using PDFBox to remove Optional Content Groups that are not enabled

2 Answers2