Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

416 questions
96
votes
5 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…
DuckPuncher
  • 4,335
  • 4
  • 21
  • 38
73
votes
15 answers

How do I use pdfminer as a library

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would…
jmeich
  • 875
  • 1
  • 7
  • 8
36
votes
2 answers

How to extract text and text coordinates from a PDF file?

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner. Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text…
pnj
  • 1,107
  • 1
  • 9
  • 13
21
votes
4 answers

Pdfminer python 3.5

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?) I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it…
gary
  • 223
  • 1
  • 2
  • 7
21
votes
1 answer

How does one obtain the location of text in a PDF with PDFMiner?

PDFMiner's documentation says: PDFMiner allows one to obtain the exact location of text in a page However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.
technillogue
  • 1,182
  • 3
  • 12
  • 27
20
votes
7 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…
kramer65
  • 39,074
  • 90
  • 255
  • 436
18
votes
8 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…
Jinu Joseph
  • 322
  • 1
  • 2
  • 16
14
votes
4 answers

PDFminer: PDFTextExtractionNotAllowed Error

I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…
Tyler Lazoen
  • 141
  • 1
  • 4
13
votes
6 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…
aristotll
  • 6,572
  • 5
  • 30
  • 47
12
votes
1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…
Katharsis
  • 159
  • 1
  • 1
  • 8
11
votes
6 answers

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it…
Randomly Named User
  • 1,779
  • 7
  • 25
  • 44
10
votes
3 answers

PDF Miner PDFEncryptionError

I'm trying to extract text from pdf-files and later try to identify the references. I'm using pdfminer 20140328. With unencrypted files its running well, but I got now a file where i get: File…
RichieK
  • 384
  • 4
  • 15
9
votes
1 answer

How to use PDFminer.six with python 3?

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool. I have already tried…
Urvish
  • 507
  • 1
  • 7
  • 16
8
votes
2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…
7
votes
1 answer

python pdfminer converts pdf file into one chunk of string with no spaces between words

I was using the following code mainly taken from DuckPuncher's answer to this post Extracting text from a PDF file using PDFMiner in python? to convert pdfs to text files: def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr =…
Yue Zhao
  • 115
  • 1
  • 8
1
2 3
27 28