I am trying to extract data from a PDF file which contains data in separate tables & convert to excel. Based on this link as my need is more or less the same, I am using PDFBOX jar to do the extraction.
To test whether I can first extract the data from different tables in the pdf, tried with the code specified below. But it does not extract & gives an error stating Corrupt object reference, don't know what it means.
To see if there was any issue with the pdf itself, I checked with https://online2pdf.com & it successfully converted the pdf file to excel, so I believe there is no issue with the pdf file.
Hope the issue I face is clear & await inputs on what needs to be done to extract the data from the pdf
Error message:
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6371
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6373
java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1017)
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1000)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:879)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at main.Test.readPDF(Test.java:170)
at main.Test.main(Test.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Code :
public static void main(String[] args){
try {
File filePDF = new File("C:\\test.pdf");
PDDocument document = PDDocument.load(filePDF);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
System.out.println(content);
} catch (IOException e) {
e.printStackTrace();
}
}