I'm working on a project that requires obtaining data from some PDF documents.
Currently I'm using Foxit toolkit
(calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$
it's just something I can't afford for such a small project.
I've tested all the free converters that I could find (like
xpdf
,pdftotext
) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.I've tried some
Python
modules likepdfminer
but they don't seem to work well inPython 3
.I can't get the data before it's converted to PDF because I get them from a phone carrier.
I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.
Update: PyPDF2 is not grabbing any text whatsoever from the pdf document.