How can I determine if two files are identical?

Question

I need to synchronize files from directory A to directory B. I check for files in A and then compare them with files in B one by one. If a file of same name as A is found in B, I check to see if files are different by comparing their size. If the file sizes are different, I log this and move on to next file. However if the file sizes are same, I need to verify the contents of the files are different as well. For this, I thought of creating hashes of both files and compare them. Is this better or should I compare the files byte by byte? Please also tell why would you choose either one of the methods.

I am using C# (.NET 4) and need to preserve all files on B while replicating newly added files on A and reporting (and skipping) any duplicates.

Thanks.

EDIT: This job will run nightly and I have the option of storing hashes of files on directory B only, directory A will be populated dynamically so I can not pre-hash those files. Also which hash algorithms are better for this purpose as I want to avoid hash collisions as well.

score 1 · Answer 1 · answered Jun 07 '13 at 06:46

If you need to synhronize files, there's another thing you can compare: file date - if this is any different, the file has been most probably changed.

Also, in really most of cases the hash (I'd go for md5 or sha1 - not crc because of limited value range and therefore rather frequent collisions) will be sufficient. And if those hashs are equal you should do a byte-by-byte compare. Surely this is an additional step, but it's rarely needed if at all.

Actually you should save the hash on B, so you don't need to recalculate it every time, but you must make sure, that the files on B cannot be changed without updating their hashs.

zmbq · Answer 2 · 2013-06-07T06:53:53.307

You already have a hash-function here. Your hash function is file-->(filename, filesize). Also, since you can only have one file with a given filename in a directory, you are guaranteed not to have more than one collision for each file per run.

You're asking if you need a better one. Well, I don't know, is performance adequate with the hash function you already have? If it's adequate for you, you don't need a better hash function.

Matthew Watson · Answer 3 · 2013-06-07T07:33:31.997

If you use only a hash code to compare two files, then if the hash codes differ you can be sure that the files are different.

But if the hash codes are the same, then you don't know for sure if the files are really the same.

If you use a 32-bit hash code then there is a 1 in 2^32 chance that the files are different even though the hash code is the same. For a 64-bit hash code, the chance is naturally 1 in 2^64.

Storing the hash codes for all the files on B will make initial comparing much faster, but you then have to decide what to do if two hash codes are the same. Do you take a chance and assume that they are both the same? Or do you go and do a byte-by-byte comparison after you discover two files with the same hash?

Note that if you do a byte-by-byte comparison after you have computed the hash code for a file, you'll end up accessing the file contents twice. This can make using hash codes slower if a significant proportion of the files are the same. As ever, you have to do some timings to see which is faster.

If you can live with the small chance that you'll falsely assume two files to be the same you can avoid the confirming comparison... but I wouldn't like to take that chance myself.

In summary, I would probably just do the comparison each time and not bother with the hashing (other than what you're already doing with comparing the filename and size).

Note that if you find that almost all files that match by filename and size are also identical, then using hashing will almost certainly slow things down.

How can I determine if two files are identical?

3 Answers3