Please don't mark as duplicate. I have already been through many Stackoverflow links but they didn't solve my problem.
What I'm trying to do : I have to extract data from around 1,50,000 pdf files.
A sample pdf : All these pdf are identical in structure and contains data in a tabular format (No image). A snapshot of pdf looks like this.
What I've done : I have used pdf2htmlEX
terminal command with Nodejs
to convert the pdf file to html.
var child_process = require('child_process');
var request = require('request');
var spawn = child_process.spawn;
var url = 'http://url_to_extract_data_from_pdf?Id=' + id; //id ranges from 1 to 1,50,000
var pdfFileStream = fs.createWriteStream(id + '.pdf');
request(url).pipe(pdfFileStream);
pdfFileStream.on('finish', function () {
console.log('Pdf file downloaded');
var pdfToHtml = spawn('pdf2htmlEX', [id + '.pdf']);
pdfToHtml.on('close', function () {
console.log('Pdf file converted to html');
jsdom.env({
url: "http://localhost:1000/" + id + ".html", //hard coded url for server -> current server running on localhost:1000
scripts: ["http://code.jquery.com/jquery.js"],
done: function (err, window) {
if(err)
console.log(err);
else {
var $ = window.$;
//jquery selectors to extract data
console.log($(".x14.y30").text().trim());
console.log($(".x15.y31").text().trim());
console.log($(".x16.y32").text().trim());
}
}
});
});
});
Converted html file looks like this : The combination of class name x followed by a character and y followed by a character was unique for a particular div. For eg. there was only one div with xf
and y10
class.
Where I'm stuck : Although all the pdf are identical in format and structure, the html file generated is not. So lets say $(".x14.y30").text()
might be giving me something in pdf - 1, it would be giving something else in pdf - 2. I have also looked for some way in which I can modify the way classes were being assigned while a pdf file was converted to html. But all in vain. Extracted data needs to be then stored in a tab separated format.
Using this approach is not mandatory. Any better suggestion is welcome.