Java - Issue with data extraction from PDF (PDFBox - 2.02)

Question

I am trying to extract data from a PDF file which contains data in separate tables & convert to excel. Based on this link as my need is more or less the same, I am using PDFBOX jar to do the extraction.

To test whether I can first extract the data from different tables in the pdf, tried with the code specified below. But it does not extract & gives an error stating Corrupt object reference, don't know what it means.

To see if there was any issue with the pdf itself, I checked with https://online2pdf.com & it successfully converted the pdf file to excel, so I believe there is no issue with the pdf file.

Hope the issue I face is clear & await inputs on what needs to be done to extract the data from the pdf

Error message:

2016-07-21 13:49:11 WARN  BaseParser:682 - Corrupt object reference at offset 6371
2016-07-21 13:49:11 WARN  BaseParser:682 - Corrupt object reference at offset 6373

java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376
    at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1017)
    at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1000)
    at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:879)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
    at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
    at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
    at main.Test.readPDF(Test.java:170)
    at main.Test.main(Test.java:76)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Code :

public static void main(String[] args){
try {
        File filePDF = new File("C:\\test.pdf");
        PDDocument document = PDDocument.load(filePDF);
        PDFTextStripper s = new PDFTextStripper();
        String content = s.getText(document);
        System.out.println(content);
    } catch (IOException e) {
            e.printStackTrace();
    }
}

@Setasign since the pdf document contains financial info. I wont be able to share that. However you can have a look at the Sample which contains the info which I am trying to extract (can download from the link as well). The link is -> https://www.dropbox.com/s/g5iorxzvg92ye1i/Sample%20Contract.pdf?raw=1 — iCoder, Jul 21 '16 at 13:05
The file does not bring the "Corrupt object reference" error. However there is no text extraction result, because the fonts don't have a ToUnicode entry. Try copy & paste with Adobe Reader, it won't work either. I suspect this is done on purpose, to avoid people extracting the data to provide services that the creator of the PDF also offers. — Tilman Hausherr, Jul 21 '16 at 20:41
@TilmanHausherr Agree the sample does not provide the error, but the file I have does. Anyway the extraction doesn't work in both cases. No the creator of the PDF does not provide that service. Maybe the way in which the PDF is made is not being recognised by PDFBox. Just wonder what tools is used by the website cited in my qns, as it does the extraction just fine. Maybe need to see if there is any other open source PDF extractors, any pointers towards that would be helpful — iCoder, Jul 22 '16 at 04:44
Try icepdf, jpedal and itext, they all have text extraction - try them. Maybe the tool you pointed to does OCR. — Tilman Hausherr, Jul 22 '16 at 05:25
@TilmanHausherr Thank you very much for pointing me towards icepdf. This one looks promising. Although post extraction needs quite a lot of working to do to get the info. I downloaded the Open Source version as the other 2 are paid versions only — iCoder, Jul 22 '16 at 15:35
Unfortunately even the icePdf is not able to extract all the info from the file. Maybe it is something to do with the way the file is created which is making it hard. I am clueless now on how to get this working. — iCoder, Jul 23 '16 at 09:47

score 0 · Answer 1 · answered Jul 27 '16 at 06:11

0

Finally found a jar (PDFxStream) file which extracts all the data from the PDF in this case. Although its a paid version, but its able to extract the complete info which the other paid ones was not able to extract.

The only thing is, it extracts as a String & I would need to parse this String & extract the specific info from it.

answered Jul 27 '16 at 06:11

iCoder

1,135
3
14
27

If you had shared a PDF with which the issue can be reproduced, you might have got a solution based on PDFBox and at the same time PDFBox would have been improved for all. – mkl Jul 27 '16 at 10:09
@mkl I understand your point. But as mentioned in my initial posting the file am actually working with contains Financial info & hence cannot share the same. I provided a link for the sample which contains the representative data of what I need to extract, if that can be used to enhance PDFBox would be wonderful. – iCoder Jul 27 '16 at 10:52

Java - Issue with data extraction from PDF (PDFBox - 2.02)

1 Answers1