Extract data from pdf

Question

Please don't mark as duplicate. I have already been through many Stackoverflow links but they didn't solve my problem.

What I'm trying to do : I have to extract data from around 1,50,000 pdf files.

A sample pdf : All these pdf are identical in structure and contains data in a tabular format (No image). A snapshot of pdf looks like this.

What I've done : I have used pdf2htmlEX terminal command with Nodejs to convert the pdf file to html.

var child_process = require('child_process');
var request = require('request');
var spawn = child_process.spawn;

var url = 'http://url_to_extract_data_from_pdf?Id=' + id;    //id ranges from 1 to 1,50,000
var pdfFileStream = fs.createWriteStream(id + '.pdf');

request(url).pipe(pdfFileStream);

pdfFileStream.on('finish', function () {
    console.log('Pdf file downloaded');

    var pdfToHtml = spawn('pdf2htmlEX', [id + '.pdf']);

    pdfToHtml.on('close', function () {
        console.log('Pdf file converted to html');

        jsdom.env({
            url: "http://localhost:1000/" + id + ".html",    //hard coded url for server -> current server running on localhost:1000
            scripts: ["http://code.jquery.com/jquery.js"],
            done: function (err, window) {

                if(err)
                    console.log(err);

                else {
                    var $ = window.$;

                    //jquery selectors to extract data
                    console.log($(".x14.y30").text().trim());
                    console.log($(".x15.y31").text().trim());
                    console.log($(".x16.y32").text().trim());
                }
            }
        });
    });
});

Converted html file looks like this : The combination of class name x followed by a character and y followed by a character was unique for a particular div. For eg. there was only one div with xf and y10 class.

Where I'm stuck : Although all the pdf are identical in format and structure, the html file generated is not. So lets say $(".x14.y30").text() might be giving me something in pdf - 1, it would be giving something else in pdf - 2. I have also looked for some way in which I can modify the way classes were being assigned while a pdf file was converted to html. But all in vain. Extracted data needs to be then stored in a tab separated format.

Using this approach is not mandatory. Any better suggestion is welcome.

As the structure of the pdfs is the same, so should be the htmls structure. Thus instead of the class names use the order of the divs, ie I would expect that the first div in each html contains the same field, second div in each html is the same field and so on... — ain, Mar 26 '16 at 11:08
Furthermore, if *all the pdf are identical in format and structure, the html file generated is not*, the difference most likely is caused by some internal difference between the pdfs. Thus, you had better share pdfs with different output. — mkl, Mar 26 '16 at 13:30
Maybe not order but relative order ... like the first div after a div containing the string "date:" — Kevin Brown, Mar 26 '16 at 17:08

Extract data from pdf

0 Answers0