Python 3 - Data mining from PDF

Question

I'm working on a project that requires obtaining data from some PDF documents.

Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.

I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.

I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.

Update: PyPDF2 is not grabbing any text whatsoever from the pdf document.

Have you tried https://pythonhosted.org/PyPDF2/? – dfranca Aug 17 '16 at 11:20 — dfranca, Aug 17 '16 at 11:20
I think not, I'll try it out, thanks. – EndermanAPM Aug 17 '16 at 11:20 — EndermanAPM, Aug 17 '16 at 11:20

score 3 · Answer 1 · edited Aug 17 '16 at 11:55

3

The PyPDF2 seems to be the best one available for Python3 It's well documented and the API is simple to use.

It also can work with encrypted files, retrieve metadata, merge documents, etc

A simple use case for extracting the text:

from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
    if f:
        ipdf = PdfFileReader(f)
        text = [p.extractText() for p in ipdf.pages]

edited Aug 17 '16 at 11:55

EndermanAPM

277
2
21

answered Aug 17 '16 at 11:40

dfranca

4,381
2
23
49

Without opening it without binary mode it just crashes. If I do, it returns an empty array. – EndermanAPM Aug 17 '16 at 11:45
The .pages actually contains the objects? – dfranca Aug 17 '16 at 11:51
Yes it does contain them. – EndermanAPM Aug 17 '16 at 11:54
It has to be the format of my pdf, this [example pdf](http://www.publishers.org.uk/_resources/assets/attachment/full/0/2091.pdf) is working fine. – EndermanAPM Aug 17 '16 at 12:00

score 1 · Answer 2 · answered Aug 17 '16 at 11:26

1

I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.

link - http://coolwanglu.github.io/pdf2htmlEX/

Hope this helps.

answered Aug 17 '16 at 11:26

T0m

143
2
12

Seems like an interesting approach, I'll give it a try. – EndermanAPM Aug 17 '16 at 11:46
It's crashing with the basic usage to me: [output](https://gist.github.com/EndermanAPM/f61cf23b3d44dace4ecdc1101a474791), maybe because the pdf format? – EndermanAPM Aug 17 '16 at 11:57
1

Have you tried running it as admin, it looks as thought it does not have the privileges to save to the temp directory – T0m Aug 17 '16 at 17:15
Yeah, you're right, although I find quite weird that it requires Admin privileges to write to the temp folder. Oh, well, it does quite a decent job at keeping the format... but I don't know how well will I be able to scrape the file with [this](https://gist.github.com/EndermanAPM/c51a3e50236be5610f9d0c37b08eae81) kind of output. Edit: Also, using ttfautohint doesn't seem to make any difference. – EndermanAPM Aug 18 '16 at 07:03

score 1 · Answer 3 · answered Aug 17 '16 at 11:26

1

Here is an example of pyPDF2 codes:

from PyPDF2 import PdfFileReader

pdfFileObj = open("FileName", "rb")
pdfReader  = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]

more information on pyPDF2 here.

answered Aug 17 '16 at 11:26

taufikedys

260
2
7

print(data) should print the text, right? Because at the moment is returning an empty array. – EndermanAPM Aug 17 '16 at 11:36
it should ... Unfortunately, it is known that some PDF can't be opened by PyPDF2. Especially those who are not generated properly from text (e.g. scanned text/images). – taufikedys Aug 17 '16 at 11:58
It might be the format, although the text is from text or the foxit toolkit wouldn't be able to get it – EndermanAPM Aug 17 '16 at 12:03

score 0 · Answer 4 · answered Aug 20 '16 at 17:03

I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":

https://github.com/opticaliqlusion/pypdf

It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:

class StreamIterator(PdfTreeVisitor):
    '''For deflating (not crossing) the streams'''
    def visit_stream(self, node):
        print(node.value)
        pass
...
StreamIterator().visit(tree)

Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.

Cheers!

Python 3 - Data mining from PDF

4 Answers4