14

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:

ipython stack trace:

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.

Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.

Many thanks!

Edit:

I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

Pale Blue Dot
  • 5,611
  • 10
  • 66
  • 100
Louis Thibault
  • 16,122
  • 21
  • 72
  • 136
  • maybe PDFMiner terminates the search for the Root Node after the first %%EOF label _but_ after that label can come more Nodes so it doesn't find it. Another reason coulbe be that the files are compressed? – Quonux Jul 08 '12 at 16:15
  • @Quonux, And how would I go about testing whether or not this is the case? Is there an option to force PDFMiner to search the entire document for a Root Node? Concerning the possibility of compression, is there a way to check for this? What can be done if the files are compressed? – Louis Thibault Jul 08 '12 at 16:48
  • @Quonux, I've added a stack trace from a similar attempt using pypdf. Does this help narrow down the cause? – Louis Thibault Jul 08 '12 at 17:09
  • Maybe the parser expects an %%EOF label, but none is found... maybe you can fix it with: - open the "incorrect" file - write/append in binary mode "%%EOF\n" at the end of the file - close it - try to parse again – Quonux Jul 08 '12 at 18:53

5 Answers5

6

The solution in slate pdf is use 'rb' --> read binary mode.

Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.

fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc
Carlos Neves
  • 73
  • 1
  • 2
5

interesting problem. i had performed some kind of research:

function which parsed pdf (from miners source code):

def set_parser(self, parser):
        "Set the document to use a given PDFParser object."
        if self._parser: return
        self._parser = parser
        # Retrieve the information of each header that was appended
        # (maybe multiple times) at the end of the document.
        self.xrefs = parser.read_xref()
        for xref in self.xrefs:
            trailer = xref.get_trailer()
            if not trailer: continue
            # If there's an encryption info, remember it.
            if 'Encrypt' in trailer:
                #assert not self.encryption
                self.encryption = (list_value(trailer['ID']),
                                   dict_value(trailer['Encrypt']))
            if 'Info' in trailer:
                self.info.append(dict_value(trailer['Info']))
            if 'Root' in trailer:
                #  Every PDF file must have exactly one /Root dictionary.
                self.catalog = dict_value(trailer['Root'])
                break
        else:
            raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
        if self.catalog.get('Type') is not LITERAL_CATALOG:
            if STRICT:
                raise PDFSyntaxError('Catalog not found!')
        return

if you will be have problem with EOF another exception will be raised: '''another function from source'''

def load(self, parser, debug=0):
        while 1:
            try:
                (pos, line) = parser.nextline()
                if not line.strip(): continue
            except PSEOF:
                raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
            if not line:
                raise PDFNoValidXRef('Premature eof: %r' % parser)
            if line.startswith('trailer'):
                parser.seek(pos)
                break
            f = line.strip().split(' ')
            if len(f) != 2:
                raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
            try:
                (start, nobjs) = map(long, f)
            except ValueError:
                raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
            for objid in xrange(start, start+nobjs):
                try:
                    (_, line) = parser.nextline()
                except PSEOF:
                    raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
                f = line.strip().split(' ')
                if len(f) != 3:
                    raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
                (pos, genno, use) = f
                if use != 'n': continue
                self.offsets[objid] = (int(genno), long(pos))
        if 1 <= debug:
            print >>sys.stderr, 'xref objects:', self.offsets
        self.load_trailer(parser)
        return

from wiki(pdf specs): A PDF file consists primarily of objects, of which there are eight types:

Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object

Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.

i thk the problem is your "damaged pdf" have a few 'root elements' on the page.

Possible solution:

you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).

ps: i think it some kind of bug in product.

Dmitry Zagorulkin
  • 7,780
  • 3
  • 32
  • 55
  • Dmitry, thanks for your response. If I understand correctly, you suspect that this is a bug in PDFMiner? I'm surprised because similar behavior is also observed in PyPDF. Or did you mean that the bug is in whichever software created the "broken" PDF? Concerning your solution, do you mean that I should add print lines in the PDFParser object methods whevever they manage an xref object? I'm a little unclear on exactly what I should be doing. Thanks! – Louis Thibault Jul 12 '12 at 12:37
  • Good day. Just take a two files(normal and damaged) and try to analyse each in pdf analyser tool. i think that in damaged pdf will be invalid xref structure. After analysis try to repair pdf structure(http://www.w3.org/WAI/GL/WCAG20-TECHS/pdf.html) – Dmitry Zagorulkin Jul 12 '12 at 12:58
  • http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ great set of tools for performing this kind of actions. – Dmitry Zagorulkin Jul 12 '12 at 13:03
  • Dimitry, Thanks for the links. I'll give pdftk a shot. Why bother fixing something if the work's already been done ;-) – Louis Thibault Jul 12 '12 at 13:25
  • 1
    give me a feedback if you make pdf repair tool in python =) – Dmitry Zagorulkin Jul 12 '12 at 13:53
1

I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:

  1. Open a pdf file using the (ubuntu) document viewer.

  2. Goto File

  3. Goto print

  4. Choose print as file and check the mark "pdf"

If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:

for f in *.pdfx
do
lowriter --headless --convert-to pdf "$f"
done

Note I called the original (problematic) pdf files as pdfx.

eyllanesc
  • 190,383
  • 15
  • 87
  • 142
DanielTheRocketMan
  • 2,812
  • 4
  • 29
  • 51
0

An answer above is right. This error appears only in windows, and workaround is to replace with open(path, 'rb') to fp = open(path,'rb')

AeroSM
  • 23
  • 4
0

I got this error as well and kept trying fp = open('example','rb')

However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
So make sure you don't have the PDF open in memory elsewhere as well.

dasvootz
  • 383
  • 2
  • 15