0

Unable to parse PDF document as (key,value) pair. Can anyone, please help to parse PDF file in a structured manner?

I was able to extract text from PDF file using below JAVA code.

    org.apache.pdfbox.pdmodel.PDDocument doc=null;
    org.apache.pdfbox.text.PDFTextStripper pdfStripper;

    java.io.File pdfFile=new java.io.File(filePathAv);
    try {
        doc=org.apache.pdfbox.pdmodel.PDDocument.load(pdfFile);
        if (doc.isEncrypted()) {
        try {
            doc.load(pdfFile, "");
            doc.setAllSecurityToBeRemoved(true);
            }
        catch(Exception e) {
            throw new PRRuntimeException(e);
            }
        }
        pdfStripper=new org.apache.pdfbox.text.PDFTextStripper();
        ExtractedText=pdfStripper.getText(doc);
    }
    catch(Exception e){ throw new PRRuntimeException(e); }
    finally {
      if (doc!=null) {try { doc.close();}
      catch(Exception e) {throw new PRRuntimeException(e);}}}

if there is a table in PDF file, can we extract LHS and RHS seperately?

Avinash
  • 113
  • 1
  • 1
  • 7
  • PDF contains table with two columns and many rows, 1) can we store each column in an separate array 2) or else can we store row wise as key value pair.... My ask is, can pdfbox identify tables in pdf? – Avinash Sep 01 '19 at 07:51
  • 1
    PDF does not know about tables. Basically, it positions text and graphics on the page (e.g. put the word "Hello" in a certain font 2cm from the top and 5cm from the left). To recognize higher level concepts one has to use heuristics (which sometimes fail). – Henry Sep 01 '19 at 07:59

0 Answers0