6

The Python 2 docs for filecmp() say:

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Which sounds like two files which are identical except for their os.stat() signature will be considered unequal, however this does not seem to be the case, as illustrated by running the following code snippet:

import filecmp
import os
import shutil
import time

with open('test_file_1', 'w') as f:
    f.write('file contents')
shutil.copy('test_file_1', 'test_file_2')
time.sleep(5)  # pause to get a different time-stamp
os.utime('test_file_2', None)  # change copied file's time-stamp

print 'test_file_1:', os.stat('test_file_1')
print 'test_file_2:', os.stat('test_file_2')
print 'filecmp.cmp():', filecmp.cmp('test_file_1', 'test_file_2')

Output:

test_file_1: nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0,
  st_uid=0, st_gid=0, st_size=13L, st_atime=1320719522L, st_mtime=1320720444L, 
  st_ctime=1320719522L)
test_file_2: nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, 
  st_uid=0, st_gid=0, st_size=13L, st_atime=1320720504L, st_mtime=1320720504L, 
  st_ctime=1320719539L)
filecmp.cmp(): True

As you can see the two files' time stamps — st_atime, st_mtime, and st_ctime— are clearly not the same, yet filecmp.cmp() indicates that the two are identical. Am I misunderstanding something or is there a bug in either filecmp.cmp()'s implementation or its documentation?

Update

The Python 3 documentation has been rephrased and currently says the following, which IMHO is an improvement only in the sense that it better implies that files with different time stamps might still be considered equal even when shallow is True.

If shallow is true, files with identical os.stat() signatures are taken to be equal. Otherwise, the contents of the files are compared.

FWIW I think it would have been better to simply have said something like this:

If shallow is true, file content is compared only when os.stat() signatures are unequal.

martineau
  • 99,260
  • 22
  • 139
  • 249

2 Answers2

8

You're misunderstanding the documentation. Line #2 says:

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Files with identical os.stat() signatures are taken to be equal, but the logical inverse is not true: files with unequal os.stat() signatures are not necessarily taken to be unequal. Rather, they may be unequal, in which case the actual file contents are compared. Since the file contents are found to be identical, filecmp.cmp() returns True.

As per the third clause, once it determines that the files are equal, it will cache that result and not bother re-reading the file contents if you ask it to compare the same files again, so long as those files' os.stat structures don't change.

Adam Rosenfield
  • 360,316
  • 93
  • 484
  • 571
  • Yes, I assumed that the logical inverse of line #2 would be true. Thanks. Is there a Python built-in that would consider them unequal or must I roll my own? – martineau Nov 08 '11 at 10:01
  • After locating and examining the code for filecmp.py I can now plainly see what `cmp()` does -- which I think is inconsistent with the documentation. When `shallow` is given (or defaults) to a true value it is not being honored, in the sense that the files are not read, unless their `os.stat()` based signatures match. Seems to me that if `shallow` is true their contents shouldn't ever be compared and only their signatures considered -- which is what the docs say and the behavior I desire. – martineau Nov 14 '11 at 07:42
1

It seems that 'rolling your own' is indeed what is required to produce a desirable result. It would simply be nice if the documentation were clear enough to make a casual reader reach that conclusion.

Here's the function I am presently using:

def cmp_stat_weak(a, b):
    sa = os.stat(a)
    sb = os.stat(b)
    return (sa.st_size == sb.st_size and sa.st_mtime == sb.st_mtime)
  • Yes, that's the sort of comparison I thought `filecmp.cmp()` would do using its `shallow` option, but how to do it was not my question -- and besides creating a drop-in replacement for the what function does is a little more involved that this. Also, I don't consider myself a "casual reader" in the sense that my perusal of the document was in any way shallow or superficial, it was quite the opposite. "typical" would be a better adjective. – martineau Jan 30 '13 at 20:21