Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

416 questions

votes

5 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…

asked Oct 21 '14 at 18:56

DuckPuncher

4,335
4
21
38

votes

15 answers

How do I use pdfminer as a library

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would…

python pdf pdfminer

asked Apr 20 '11 at 03:50

jmeich

votes

2 answers

How to extract text and text coordinates from a PDF file?

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner. Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text…

python pdf pdfminer

asked Apr 06 '14 at 18:31

pnj

1,107
1
9
13

votes

4 answers

Pdfminer python 3.5

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?) I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it…

python-3.x pdf text extract pdfminer

asked Oct 04 '16 at 14:24

gary

votes

1 answer

How does one obtain the location of text in a PDF with PDFMiner?

PDFMiner's documentation says: PDFMiner allows one to obtain the exact location of text in a page However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

python pdf position pdfminer

asked Aug 11 '14 at 16:35

technillogue

1,182
3
12
27

votes

7 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…

python pdf pdfminer pdf-scraping

asked Jan 28 '15 at 13:02

kramer65

39,074
90
255
436

votes

8 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…

python python-3.x pypdf2 pdfminer pdf-extraction

asked Apr 16 '19 at 08:54

Jinu Joseph

votes

4 answers

PDFminer: PDFTextExtractionNotAllowed Error

I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…

python pdf text nlp pdfminer

asked Oct 11 '16 at 16:18

Tyler Lazoen

votes

6 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…

python text-extraction pdfminer

asked Jan 05 '16 at 07:33

aristotll

6,572
5
30
47

votes

1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…

python pdf search pypdf pdfminer

asked Oct 27 '16 at 15:18

Katharsis

votes

6 answers

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it…

python pdf hyperlink pypdf pdfminer

asked Jan 02 '15 at 15:08

Randomly Named User

1,779
7
25
44

votes

3 answers

PDF Miner PDFEncryptionError

I'm trying to extract text from pdf-files and later try to identify the references. I'm using pdfminer 20140328. With unencrypted files its running well, but I got now a file where i get: File…

python pdf encryption pdfminer

asked Dec 18 '15 at 14:19

RichieK

votes

1 answer

How to use PDFminer.six with python 3?

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool. I have already tried…

python-3.x pypdf2 pdfminer

asked Jun 07 '19 at 12:10

Urvish

votes

2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…

python pdfminer pdftotext ner natural-language-processing

asked Mar 03 '18 at 18:35

Aryan

votes

1 answer

python pdfminer converts pdf file into one chunk of string with no spaces between words

I was using the following code mainly taken from DuckPuncher's answer to this post Extracting text from a PDF file using PDFMiner in python? to convert pdfs to text files: def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr =…

python-3.x pdfminer

asked Mar 23 '18 at 19:56

Yue Zhao

2 3

…

27 28 Next