3

Please don't mark as duplicate. I have already been through many Stackoverflow links but they didn't solve my problem.

What I'm trying to do : I have to extract data from around 1,50,000 pdf files.

A sample pdf : All these pdf are identical in structure and contains data in a tabular format (No image). A snapshot of pdf looks like this.

enter image description here

What I've done : I have used pdf2htmlEX terminal command with Nodejs to convert the pdf file to html.

var child_process = require('child_process');
var request = require('request');
var spawn = child_process.spawn;

var url = 'http://url_to_extract_data_from_pdf?Id=' + id;    //id ranges from 1 to 1,50,000
var pdfFileStream = fs.createWriteStream(id + '.pdf');

request(url).pipe(pdfFileStream);

pdfFileStream.on('finish', function () {
    console.log('Pdf file downloaded');

    var pdfToHtml = spawn('pdf2htmlEX', [id + '.pdf']);

    pdfToHtml.on('close', function () {
        console.log('Pdf file converted to html');

        jsdom.env({
            url: "http://localhost:1000/" + id + ".html",    //hard coded url for server -> current server running on localhost:1000
            scripts: ["http://code.jquery.com/jquery.js"],
            done: function (err, window) {

                if(err)
                    console.log(err);

                else {
                    var $ = window.$;

                    //jquery selectors to extract data
                    console.log($(".x14.y30").text().trim());
                    console.log($(".x15.y31").text().trim());
                    console.log($(".x16.y32").text().trim());
                }
            }
        });
    });
});

Converted html file looks like this : The combination of class name x followed by a character and y followed by a character was unique for a particular div. For eg. there was only one div with xf and y10 class.

enter image description here

Where I'm stuck : Although all the pdf are identical in format and structure, the html file generated is not. So lets say $(".x14.y30").text() might be giving me something in pdf - 1, it would be giving something else in pdf - 2. I have also looked for some way in which I can modify the way classes were being assigned while a pdf file was converted to html. But all in vain. Extracted data needs to be then stored in a tab separated format.

Using this approach is not mandatory. Any better suggestion is welcome.

Akshay Soam
  • 1,482
  • 3
  • 20
  • 37
  • As the structure of the pdfs is the same, so should be the htmls structure. Thus instead of the class names use the order of the divs, ie I would expect that the first div in each html contains the same field, second div in each html is the same field and so on... – ain Mar 26 '16 at 11:08
  • Furthermore, if *all the pdf are identical in format and structure, the html file generated is not*, the difference most likely is caused by some internal difference between the pdfs. Thus, you had better share pdfs with different output. – mkl Mar 26 '16 at 13:30
  • Maybe not order but relative order ... like the first div after a div containing the string "date:" – Kevin Brown Mar 26 '16 at 17:08

0 Answers0