84

I need to compare large count of PDF files for it optical content. Because the PDF files was created on different platforms and with different versions of the software there are structural differences. For example:

  • the chunking of text can be different
  • the write order can be different
  • the position can be differ some pixels

It should compare the content like a human people and not the internal structure. I want test for regressions between different versions of the PDF generator that we used.

casperOne
  • 70,959
  • 17
  • 175
  • 239
Horcrux7
  • 21,867
  • 21
  • 85
  • 134
  • 3
    A partial answer would be to use [pdftotext](http://en.wikipedia.org/wiki/Pdftotext) and compare the text contained. – Sklivvz Sep 28 '08 at 11:05
  • But this will ignore all non text informations like lines, boxes, pictures, charts, etc. I think also that it not show the optical positions of text else the structural position. – Horcrux7 Sep 28 '08 at 11:30
  • I agree, it is not a sufficient criteria. On the other hand it is a necessary criteria, therefore it is adequate as a unit test. – Sklivvz Sep 28 '08 at 11:35
  • You can always add a better unit test later! – Sklivvz Sep 28 '08 at 11:36
  • If there are images on pages, and you want a human-like evaluation for those, there's not much you can do but have a human compare those pages, unless you want to work on a whole new project, just as big as your current one, to try it out. – Chris Charabaruk Sep 28 '08 at 11:52
  • Never actually been in your situation before, but I've tried [ExamDiff Pro](http://www.prestosoft.com/edp_examdiffpro.asp) to compare PDFs and it worked for me. – cubex Sep 28 '08 at 11:35
  • I think Bitmap check should work in your case. I use a automation tool to compare 2 images using bitmap check point – Chanakya Sep 29 '08 at 17:57
  • What an intelligent, \\\*#?`%& decision to close this question as **'not constructive'**! *(Gotta luv it when question-closing-moderators destroy community content which carries tags where these same mods don't have any personal reputation in!)* – Kurt Pfeifle Sep 18 '12 at 21:22
  • Another case of useless closing a question concerning a highly relevant realworld use-case. I wish I knew how to propose a sound reasoning on Meta so this will stop eventually. It just *feels* so wrong every time it happens. – sjas Jan 22 '14 at 14:02
  • related: http://superuser.com/q/46123/35237 – Tobias Kienzler Dec 02 '14 at 10:06
  • There is a FREE library to compare pdf pixel by pixel. Check this blog. http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ – vins Jun 16 '15 at 23:37
  • You can user [Copyleaks Compare Two PDF](https://copyleaks.com/text-compare/compare-pdf-files) free tool. You can upload up to 12 files for comparison. Additional, the comparison is textual not semantics (GIT style). – No1Lives4Ever Jul 26 '20 at 04:57

10 Answers10

39

Because there is no such tool available that we have written one. You can download the i-net PDF content comparer and use it. I hope that help other with the same problem. If you have problems with it or you have feedback for us then you can contact our support.

enter image description here

Epaga
  • 35,261
  • 53
  • 143
  • 239
Horcrux7
  • 21,867
  • 21
  • 85
  • 134
  • The advantage of this tool is, that it's neither a pure text comparer nor an image comparer. It compares by structure, checks if the containing elements are "the same" - so your compared PDFs do not have to match 100% but be within a definable similarity. And it's for free. – gamma Oct 14 '10 at 05:22
  • I'd recommend this too! It crashed on a document so I sent it to them. They fixed it! :D I feel great. It can generate images with differences or it can give you a textual report in the console. – Janus Troelsen Jun 10 '11 at 21:09
  • 4
    @gamma Where is that application free? It costs at least 200 USD per year (!). It's only free once for 30 days. That's way too expensive for what I'd do with it. – ygoe Oct 11 '12 at 08:10
  • @LonelyPixel Yep, you're right. Version 1.0 was for free (as of 2010-10-14). We've changed quite a bit on it and it's now a paid tool (2012-10). You can however try it for 30 days without any limitations. It has really gained a lot of new features, stability and reliability. I hope you still have a look at it ;) – gamma Oct 11 '12 at 11:16
  • I too need to compare pdf files - I have come up with a jar using apache pdfbox. Check this http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download. – vins Jun 14 '15 at 00:11
  • This is a great tool. Unfortunately, it gets severely distracted by line numbers (I am comparing my author-generated pdf to publisher page proofs that do have line numbers). Could the tool be made to ignore (line) numbers? – bers Jun 22 '17 at 07:49
21

There is actually a diffpdf tool.

http://www.qtrac.eu/diffpdf.html

Its weakness is that it doesn't react well when additions make new text shift partially to a new page. For instance, if old page 4 should be compared to the end of page 5 and the beginning of page 6, you'll need to shift parameters to compare the two slices separately.

jabial
  • 376
  • 3
  • 4
13

I've used a home-baked script which

  • converts all pages on two PDFs to bitmaps
  • colors pages of PDF 1 to red-on-white
  • changes white to transparent on pages of PDF 2
  • overlays each page from PDF 2 on top of the corresponding page from PDF 1
  • runs conversion/coloring and overlaying in parallel on multiple cores

Software used:

  • GhostScript for PDF-to-bitmap conversion
  • ImageMagick for coloring, transparency and overlay
  • inotify for synchronizing parallel processes
  • any PNG-capable image viewer for reviewing the result

Pros:

  • simple implementation
  • all tools used are open source
  • great for finding small differences in layout

Cons:

  • the conversion is slow
  • major differences between PDFs (e.g. pagination) result in a mess
  • bitmaps are not zoomable
  • only works well for black-and-white text and diagrams
  • no easy-to-use GUI

I've been looking for a tool which would do the same on PDF/PostScript level.

Here's how our script invokes the utilities (note that ImageMagick uses GhostScript behind the scenes to do the PDF->PNG conversion):

$ convert -density 150x150 -fill red -opaque black +antialias 1.pdf back%02d.png
$ convert -density 150x150 -transparent white +antialias 2.pdf front%02d.png
$ composite front01.png back01.png result01.png # do this for all pairs of images
akaihola
  • 24,161
  • 5
  • 52
  • 64
  • 1
    Why not share the full script? – Janus Troelsen May 19 '11 at 20:25
  • 1
    This is what I used for compositing: `for i in $(seq -w 0 05); do /cygdrive/c/Progra~1/ImageMagick-6.6.9-Q8/composite.exe 1-$i.png 2-$i.png result-$i.png; done` – Janus Troelsen May 19 '11 at 21:40
  • Here's a script that doesn't write temporary files to disk and uses Poppler's pdftoppm, which is faster than Ghostscript: https://gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a `pdfdiff` directory and additionally prints the numbers of the pages which differ between the two PDFs. – Brecht Machiels Mar 31 '16 at 13:47
12

I don't seem to be able to see this here, so here it is: via superuser: How to compare the differences between two PDF files? (answer #229891, by @slestak), there is

https://github.com/vslavik/diff-pdf

(build steps for Ubuntu Natty can be found in get-diff-pdf.sh)

As far as I can see, it basically overlays the text/graphics of each page in the pdf(s), allowing you to easily see if there were any changes...

Cheers!

Community
  • 1
  • 1
sdaau
  • 32,015
  • 34
  • 178
  • 244
9

We've also used pdftotext (see Sklivvz's answer) to generate ASCII versions of PDFs and wdiff to compare them.

Use pdftotext's -layout switch to enhance readability and get some idea of changes in the layout.

To get nice colored output from wdiff, use this wrapper script:

#!/bin/sh
RED=$'\e'"[1;31m"
GREEN=$'\e'"[1;32m"
RESET=$'\e'"[0m"
wdiff -w$RED -x$RESET -y$GREEN -z$RESET -n $1 $2
akaihola
  • 24,161
  • 5
  • 52
  • 64
4

I think your best approach would be to convert the PDF to images at a decent resolution and than do an image compare.

To generate images from PDF you can use Adobe PDF Library or the solution suggested at Best way to convert pdf files to tiff files.

To compare the generated TIFF files I found GNU tiffcmp (for windows part of GnuWin32 tiff) and tiffinfo did a good job. Use tiffcmp -l and count the number of lines of output to find any differences. If you are happy to have a small amount of content change (e.g. anti-aliasing differences) then use tiffinfo to count the total number of pixels and you can then generate a percentage difference value.

By the way for anyone doing simple PDF comparison where the structure hasn't changed it is possible to use command line diff and ignore certain patterns, e.g. with GNU diff 2.7:

diff --brief -I xap: -I xapMM: -I /CreationDate -I /BaseFont -I /ID --binary --text

This still has the problem that it doesn't always catch changes in generated font names.

Community
  • 1
  • 1
danio
  • 7,981
  • 5
  • 42
  • 54
1

Our product, PDF Comparator - http://www.premediasystems.com/pdfc.html" - will do this quite elegantly and efficiently. It's also not free, and is a Mac OS X only application.

  • This tool compare pixel by pixel. This is very simple. The question was a compare like a human people do it. – Horcrux7 Aug 05 '10 at 09:07
  • 1
    @Horcrux7: But how else than comparing 'pixel by pixel' do human eyes compare different pages that are similar looking?!? – Kurt Pfeifle Sep 18 '12 at 21:49
  • @KurtPfeifle - I realize this is an old comment...but human beings do **not** compare images on a pixel to pixel basis; the way human beings compare differences in images is pretty complex, but relies heavily on pattern recognition and heuristics. – CBRF23 Aug 18 '15 at 17:28
  • @CBRF23: True, and I'm aware of that -- but all this heuristics in the end still roots in "pixel-by-pixel" comparisons. For some other, higher level heuristics, performed with `ImageMagick`, see some of my other answers: [one](http://stackoverflow.com/a/27047191/359307) -- [two](http://stackoverflow.com/a/27976171/359307) -- [three](http://stackoverflow.com/a/28252818/359307). – Kurt Pfeifle Aug 18 '15 at 17:38
  • @CBRF23: ...and the original poster, (at)Hocrux7 even mentioned "pixels" in his question, and explicitely didn't want "internal structure" of the files compared (even though his comment here again contradicts it). – Kurt Pfeifle Aug 18 '15 at 17:41
  • @KurtPfeifle - nice examples how to use ImageMagik - but I would not compare that to human perception, humans just aren't built for pixel by pixel comparisons. I prove my point: using your wizard example with the four images, pick any two of them and try to identify all the different pixels without using any tools - just your eyes. I guarantee you cannot do it. You may spot some clusters of pixels that are different, but without using tools (e.g. software, or writing utensils) you will not be able to do this. You cannot identify how many pixels there are, let alone all that are different. – CBRF23 Aug 18 '15 at 17:46
  • @KurtPfeifle - I'm not arguing this answer is useful - just refuting your assertion that a pixel by pixel comparison is analogous to how human beings perceive differences in images ;) – CBRF23 Aug 18 '15 at 17:48
  • @CBRF23: You're missing the point. The OP (from *2008*!) asked for a ***tool*** to compare a "large number of PDF files" -- just because he didn't want to have it done it by humans themselves. The (good and bad) answers here reflect what people at the time suggested. *(I myself came across this thread only in 2012!)*. --- ***Of course*** I cannot identify, without tools, all pixels that are different! What makes you think I said so? -- If you ***ask for a tool***, you have to base it on pixel-by-pixel comparisons. And even human perception, in the end, is rooted in "pixel-by-pixel" viewing... – Kurt Pfeifle Aug 18 '15 at 17:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/87302/discussion-between-cbrf23-and-kurt-pfeifle). – CBRF23 Aug 18 '15 at 17:56
  • @CBRF23: Sorry, I'm just on my way off + offline.... – Kurt Pfeifle Aug 18 '15 at 17:57
  • @KurtPfeifle - no worries, it's off-topic discussion on a seven year old post - we both have better things to do ;) – CBRF23 Aug 18 '15 at 18:03
1

Based on your needs, a convert to text solution would be the easiest and most direct. I did think the bitmap idea was pretty cool.

user602475
  • 11
  • 1
0

blubeam pdf software will do this for you

0

You can batch compare pdf files with Tarkware Pdf Comparer. But it's not free and requires Adobe Acrobat.

erks
  • 1
  • 1