5

Can anybody help me about how to extract table data using itext or pdfbox, i have have a pdf with 1000 pages, my job is to parse a pdf and store data into database.

itsvks
  • 293
  • 1
  • 6
  • 16
  • 1
    If you want to try doing that with iText(Sharp), this thread on the iText mailing list may be of interest to you: [parse tabular data in PDF using iTextSharp](http://itext-general.2136553.n4.nabble.com/parse-tabular-data-in-PDF-using-iTextSharp-tt4657013.html). As @mark said in his answer, though, generic solutions are hit and miss. If your 1000 pages have very uniform tables a specially tailored extraction routine might be the best way to go. – mkl Jan 15 '13 at 09:26
  • Possible duplicate of [Parsing PDF files (especially with tables) with PDFBox](https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox) – beldaz Oct 15 '17 at 21:21

2 Answers2

4

PDFs do not contain any table structure elements unless is contains additional XML to define the table. Otherwise there is no structure. There is a blog article I wrote on how to find out.

Some tools like PdfBox will make an effort to guess the table but it can be hit and miss

Alexis Pigeon
  • 7,054
  • 11
  • 36
  • 44
mark stephens
  • 3,153
  • 14
  • 19
  • Thanks for replying...But we have a problem that we have a pdf file which contains record of examination results, that mean some columns and rows exist in pdf. then how to parse that pdf using Pdfbox and store data into database. – itsvks Jan 15 '13 at 14:37
  • @user1958037 have you meanwhile tried to use PdfBox as proposed by mark or iText along the lines of the mailing list thread I referred to? What problem have you run into? Furthermore, storing data in a database is a different matter altogether, what are your issues there? – mkl Jan 16 '13 at 09:48
1

you can use this code to extract the data in a string format:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

then you can use java regular expression to parse row by row and load values into your java POJO beans.