check files for equality

Question

what's the most elegant way to check to files for equality in Python? Checksum? Bytes comparing? Think files wont' be larger than 100-200 MB

wump · Answer 1 · 2010-11-26T10:19:25.487

9

What about filecmp module? It can do file comparison in many different ways with different tradeoffs.

And even better, it is part of the standard library:

http://docs.python.org/library/filecmp.html

edited Nov 26 '10 at 10:19

answered Nov 26 '10 at 09:38

wump

4,047
21
24

Interesting module, but it gives you less than asked in the question: "returning True if [the files] *seem* equal, False otherwise". I take this as meaning that the comparison is approximate. It would be interesting to know how approximate it is. Furthermore, I was not able to find how the comparison can be done "in many different ways with different trade-offs": could you elaborate on this? – Eric O Lebigot Nov 26 '10 at 13:02
Using hashlib to get the MD5 is also 'approximate'. The only way to be sure is to do a byte-by-byte comparison. filecmp supports this, by passing False through the `shallow` parameter. – wump Nov 30 '10 at 08:34

invert · Accepted Answer · 2010-11-30T12:44:46.393

6

use hashlib to get the md5 of each file, and compare the results.

#! /bin/env python
import hashlib
def filemd5(filename, block_size=2**20):
    f = open(filename)
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    f.close()
    return md5.digest()

if __name__ == "__main__":
    a = filemd5('/home/neo/todo')
    b = filemd5('/home/neo/todo2')
    print(a == b)

Update: As of Python 2.1 there is a filecmp module that does just what you want, and has methods to compare directories too. I never knew about this module, I'm still learning Python myself :-)

>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False

edited Nov 30 '10 at 12:44

answered Nov 26 '10 at 08:57

invert

1,956
13
20

What is the purpose of performing an MD5 hash? Why not simply read the two files block by block until one block differs? This would skip the MD5 calculation phase, *and* would be robust against (admittedly unlikely) hash collisions. – Eric O Lebigot Nov 26 '10 at 13:08
2

@EOL you have a valid point there, that will work too. The only advantage I see, is by storing the hash + file modified date, and using those pre-calculated values again in the future. – invert Nov 30 '10 at 12:23

Joey · Answer 3 · 2010-11-26T08:41:05.737

4

Ok, this might need two separate answers.

If you have many files to compare, go for the checksum and cache the checksum for each file. To be sure, compare matching files byte for byte afterwards.

If you have only two files, go directly for byte comparison because you have to read the file anyway to compute the checksum.

In both cases, use the file size as an early way of checking for inequality.

edited Nov 26 '10 at 08:41

answered Nov 26 '10 at 08:35

Joey

316,376
76
642
652

Even when comparing multiple files, the checksum might be counter productive. If you just want to check that `a == b == c == d`, then I don't see the point of it. If you want something like `e in (a, b, c, d)`, and you then want to do it with `e, f, g` etc., then I think the checksum starts to pay for itself. – aaronasterling Nov 26 '10 at 08:50
1

Well, the most common case for comparing multiple files is to find duplicates. At least I've rarely seen the need to make sure that a number of files are all alike. – Joey Nov 26 '10 at 10:07

score 1 · Answer 4 · answered Nov 26 '10 at 09:31

Before attempting any of the other solutions, you might want to do os.path.getsize(...) on both files. If that differs, there is no need to compare bytes or calculate checksum.

Of course, this only helps if the filesize isn't fixed.

Example:

def foo(f1, f2):
    if not os.path.getsize(f1) == os.path.getsize(f2):
        return False # Or similar

    ... # Checksumming / byte-comparing / whatever

score -2 · Answer 5 · answered Nov 26 '10 at 08:36

-2

I would do checksum with MD5 (for example) instead of byte comaprasion plus the date check and depend on you needs name check.

answered Nov 26 '10 at 08:36

SubniC

8,049
2
24
30

2

What does a file's date have to do with its contents? – Joey Nov 26 '10 at 08:37
Checksum is good solution, i agree, but what do you mean saying "check dates"? – illegal-immigrant Nov 26 '10 at 08:39
2

Don't you have to read both files in to get their checksum anyways? If so, then I think that all the checksum does is add a risk of collision. Edit: Unless you want to compare multiple files as Joey just stated in an answer. – aaronasterling Nov 26 '10 at 08:41
ups... @taras.roshko right, the date is less usefull. – SubniC Nov 26 '10 at 09:08

score -2 · Answer 6 · answered Nov 26 '10 at 08:40

-2

What about shelling out to cmp?

import commands
status, output = commands.getstatusoutput("/usr/bin/cmp file1 file2")
if (status == 0):
  print "files are same"
elif (status == 1):
  print "files differ"
else:
  print "uh oh!"

answered Nov 26 '10 at 08:40

Paul Schreiber

12,094
4
36
61

1

Good luck on a Windows system ;-) – Joey Nov 26 '10 at 08:41
1

Not a cross platform solution... – illegal-immigrant Nov 26 '10 at 08:45

check files for equality

6 Answers6