12

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

  • PDFBox || iText (Java)
  • Google Docs Import
  • PDF2HTML || PDF2Table

GIF

  • Tesseract-OCR

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<p>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

Brian Tompsett - 汤莱恩
  • 5,195
  • 62
  • 50
  • 120
Vilius
  • 1,099
  • 4
  • 19
  • 32
  • 6
    PDF->text is rarely straightforward. PDF is a document layout language, not a markup language. Depending on how the pdf generator's mood is that day, it can generate entirely different documents each time. – Marc B Apr 24 '12 at 15:12
  • I see. The only thing that bothers me is that some pdf to xls parsers work quite fine. So why is not there any open source projects that are also capable to parse a pdf table reliable? – Vilius Apr 24 '12 at 15:36
  • 5
    If you can contact the people who write this menu, see what format it is produced in. They might create it in a format that is much easier to extract text from. – halfer Apr 24 '12 at 19:17
  • That was also an option I was thinking of, but there were two problems with it: 1. universities like to hide their information and only make it accessible if they want to and 2. I was also thinking of finding an approach which would be applyable to more cafeterias then just the one I meant ;) I will just continue with my "trial and error" method! – Vilius Apr 24 '12 at 19:53
  • The sample pdf is is located at http://goo.gl/xc8r3. @njzk2: Why should I forget OCR? – Vilius May 05 '12 at 09:04
  • Possible duplicate of [Parsing PDF files (especially with tables) with PDFBox](https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox) – beldaz Oct 15 '17 at 21:21

8 Answers8

10

Tabula is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

thadk
  • 952
  • 1
  • 8
  • 17
  • Agreed, the accuracy that I've seen so far is outstanding (it mentions that table headers can still be problematic, but I've had no problems with them so far). I just wish there was an API... – RTF Apr 07 '14 at 21:07
  • 1
    Oh, there is. The engine that powers Tabula is tabula-extractor, and you can get it here: https://github.com/jazzido/tabula-extractor - it's written with jruby, which you'll need, but the instructions are straightforward. – RTF Apr 08 '14 at 19:18
  • An updated list of tools: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html – thadk May 06 '16 at 18:43
9

I have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.

Following are some sample pdf files and results:

  1. Input file: sample-1.pdf, result: sample-1.html
  2. Input file: sample-4.pdf, result: sample-4.html

Visit my project page at traprange

or my article at traprange

Tho
  • 17,326
  • 6
  • 51
  • 41
  • 1
    great work on this project! you may want to consider adding support for border lines analysis to separate rows and columns, not just by distance – Eugene Aug 09 '16 at 07:26
5

You can use Camelot to extract tables from your PDF and export it to an HTML file. CSV, Excel and JSON are also supported. You can check out the documentation at: http://camelot-py.readthedocs.io. It gives more accurate results as compared to other open-source table extraction tools and libraries. Here's a comparison.

You can use the following code snippet to go forward with your task:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables[0].to_html('file.html')

Disclaimer: I'm the author of the library.

Vinayak Mehta
  • 349
  • 3
  • 11
3

If you are looking to extract data from tables once a week and you are on Windows then, please check this freeware pdf utility that includes automated table detection and table to CSV, XML conversion: PDF Viewer utility.

The utility is free for both commercial and non-commercial usage for non-developers (and there is the separate version for developers who want to automate via API).

Disclaimer: I work for ByteScout

Eugene
  • 2,614
  • 17
  • 21
  • The software is awesome but the prince, not that much for a person where one dollar 1 is almost 4. :( – Jack Aug 07 '16 at 17:22
  • 1
    @jack pdf utility (PDF Multitool) is completely free, did you mean PDF Extractor SDK? – Eugene Aug 07 '16 at 18:36
  • I just tested the option to convert to HTML, this is by far the best software for that I've ever found. Did you worked on this software? I want to use that extract within a software so yeah, I mean the SDK. – Jack Aug 07 '16 at 22:34
  • @jack is there a way to PM you? – Eugene Aug 09 '16 at 07:22
  • sure, you can email me at jackj33 at google's mail server – Jack Aug 09 '16 at 20:59
2

I have tried many of the OCR and text converter software's and though I believe once should write the program self converting PDF to text as the Image is better understood by the person performing task.

I had also tried to use Google and many other Online (about 900 website) and Offline(about 1000 softwares) products by different companies. If you want to extract text from any method such as OCR or Text from PDF, then most accurate program I found is PDFTOHTML. The accuracy rate of PDFTOHTML is about 98% and Google Online has about 94% accuracy. It is a very good software which also provide you the correct format of text i.e. bold, italic etc of the text.

Vineet1982
  • 7,211
  • 4
  • 28
  • 62
  • You're right with the ability of text recognition itself. PDF2HTML provides a quite good result, but it still cannot handle tables within a pdf document - it just cannot recognize their existence. I though, was searching for a "tool" that can also detect tables and convert them (together with the information in it) to data like HTML or XML. – Vilius May 01 '12 at 22:34
  • Nobody, nobody in the world can extract the ocr/image to html tables or any other thing. Tables are not used for the purpose of display the text and if the tables have borders then might be it would be possible but quite difficult. One has to deal with 2 things OCR and PDF. Nothing is impossible but very difficult. One has to first extract the text of every position of text from ocr and then mark them as in PDF. Try to make with PS (ghost-script) also as many printing techniques use them. Change your gif image to PS First then to PDF might give to correct answer – Vineet1982 May 02 '12 at 03:48
2

for major templates Tabula is the best option for open source while Abbyy PDF editor is a great solution for enterprise-level pdf data extraction and modification. Abbyy works on OCR.

Tabula have two option for auto table detection and another is manually by providing coordinates.

Jay Gajjar
  • 23
  • 6
  • Although your two answer might be correct. You should post some links to encourage research ;). Also I think the problem @Vilius is having is a conceptual one. I think I'll be easier to just extract the data from PDF/PNG/GIF into plain text. With that, then you can create a HTML/XML from it... but the engine will be better, since it has a lower scope/responsability. – aemonge Mar 27 '19 at 14:09
0

Are the tables in the same place each time? If you can find the dimentions of each box, you could use a tool to split the PDF into multiple documents, each of which contain one box, after which you can use whatever tool you want to convert each smaller PDF to HTML (such as the tools mentioned in other answers). Random Google searches pulled up PyPdf, which looked like it might have some useful functions.

If you aren't able to hard code the size of the box (or want to apply the problem to multiple menus in different formats), the obvious method to me (I said obvious, not easy) would be edge detection to find where the border of the table would be, and then apply the splitting I talked about before.

Ryan Leonard
  • 958
  • 1
  • 8
  • 26
  • The hardcoded approach is not applyable to my situation. Since there are new menus each week with different amount of meals, the table structure varies in the size of the table cells... After reading a lot more stuff on SO and stuff from google, I actually have found a way to detect "data" in images: Hough transformation. It still does not completely fit my demands – Vilius May 03 '12 at 15:18
  • @Vilius why doesn't the transformation completely "fit [your] demands"? – Ryan Leonard May 03 '12 at 17:26
  • Since there are different kind of menus, I would probably need to hardcode a lot of stuff, but I want to make it more generic. So the Hough Transformation would be sufficient, but not efficient enough. – Vilius May 05 '12 at 09:03
0

I recently ran into a similar problem.

An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.

The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Shaun Poore
  • 562
  • 10
  • 20