5

what's the most elegant way to check to files for equality in Python? Checksum? Bytes comparing? Think files wont' be larger than 100-200 MB

illegal-immigrant
  • 7,648
  • 7
  • 46
  • 78

6 Answers6

9

What about filecmp module? It can do file comparison in many different ways with different tradeoffs.

And even better, it is part of the standard library:

http://docs.python.org/library/filecmp.html

wump
  • 4,047
  • 21
  • 24
  • Interesting module, but it gives you less than asked in the question: "returning True if [the files] *seem* equal, False otherwise". I take this as meaning that the comparison is approximate. It would be interesting to know how approximate it is. Furthermore, I was not able to find how the comparison can be done "in many different ways with different trade-offs": could you elaborate on this? – Eric O Lebigot Nov 26 '10 at 13:02
  • Using hashlib to get the MD5 is also 'approximate'. The only way to be sure is to do a byte-by-byte comparison. filecmp supports this, by passing False through the `shallow` parameter. – wump Nov 30 '10 at 08:34
6

use hashlib to get the md5 of each file, and compare the results.

#! /bin/env python
import hashlib
def filemd5(filename, block_size=2**20):
    f = open(filename)
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    f.close()
    return md5.digest()

if __name__ == "__main__":
    a = filemd5('/home/neo/todo')
    b = filemd5('/home/neo/todo2')
    print(a == b)

Update: As of Python 2.1 there is a filecmp module that does just what you want, and has methods to compare directories too. I never knew about this module, I'm still learning Python myself :-)

>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False
invert
  • 1,956
  • 13
  • 20
  • What is the purpose of performing an MD5 hash? Why not simply read the two files block by block until one block differs? This would skip the MD5 calculation phase, *and* would be robust against (admittedly unlikely) hash collisions. – Eric O Lebigot Nov 26 '10 at 13:08
  • 2
    @EOL you have a valid point there, that will work too. The only advantage I see, is by storing the hash + file modified date, and using those pre-calculated values again in the future. – invert Nov 30 '10 at 12:23
4

Ok, this might need two separate answers.

If you have many files to compare, go for the checksum and cache the checksum for each file. To be sure, compare matching files byte for byte afterwards.

If you have only two files, go directly for byte comparison because you have to read the file anyway to compute the checksum.

In both cases, use the file size as an early way of checking for inequality.

Joey
  • 316,376
  • 76
  • 642
  • 652
  • Even when comparing multiple files, the checksum might be counter productive. If you just want to check that `a == b == c == d`, then I don't see the point of it. If you want something like `e in (a, b, c, d)`, and you then want to do it with `e, f, g` etc., then I think the checksum starts to pay for itself. – aaronasterling Nov 26 '10 at 08:50
  • 1
    Well, the most common case for comparing multiple files is to find duplicates. At least I've rarely seen the need to make sure that a number of files are all alike. – Joey Nov 26 '10 at 10:07
1

Before attempting any of the other solutions, you might want to do os.path.getsize(...) on both files. If that differs, there is no need to compare bytes or calculate checksum.

Of course, this only helps if the filesize isn't fixed.

Example:

def foo(f1, f2):
    if not os.path.getsize(f1) == os.path.getsize(f2):
        return False # Or similar

    ... # Checksumming / byte-comparing / whatever
plundra
  • 16,024
  • 3
  • 30
  • 24
-2

I would do checksum with MD5 (for example) instead of byte comaprasion plus the date check and depend on you needs name check.

SubniC
  • 8,049
  • 2
  • 24
  • 30
-2

What about shelling out to cmp?

import commands
status, output = commands.getstatusoutput("/usr/bin/cmp file1 file2")
if (status == 0):
  print "files are same"
elif (status == 1):
  print "files differ"
else:
  print "uh oh!"
Paul Schreiber
  • 12,094
  • 4
  • 36
  • 61