16

I want to compare two files in C# and see if they are different. They have the same file names and they are the exact same size when different. I was just wondering if there is a fast way to do this without having to manually go in and read the file.

Thanks

Toz
  • 273
  • 2
  • 3
  • 8
  • Cheers guys, lots of good answers. I'll probably use byte by byte comparison. I'll explain my situation in more detail: I'm downloading files from a site every 5 mins, and checking to see if the file is different to the previous downloaded file. It will be different once a day, when it is I stop downloading the files. As the comparisons will be the same most of the time I think byte by byte comparison will be best. Thanks again! – Toz Oct 28 '11 at 15:49
  • Lot's of opinions on this one, Toz. Be sure to read the comments to make sure you're doing what's best for your use case. Good luck! – Random Oct 28 '11 at 16:03
  • Would have been helpful to know the use case earlier. Anyhow.. you might look into the ETag HTTP header. Let the web server do all the work. – Sam Axe Oct 28 '11 at 16:32
  • @Boo: On *those* points you are 100% correct. – jason Oct 28 '11 at 16:39

7 Answers7

29

Depending on how far you're looking to take it, you can take a look at Diff.NET

Here's a simple file comparison function:

// This method accepts two strings the represent two files to 
// compare. A return value of 0 indicates that the contents of the files
// are the same. A return value of any other value indicates that the 
// files are not the same.
private bool FileCompare(string file1, string file2)
{
     int file1byte;
     int file2byte;
     FileStream fs1;
     FileStream fs2;

     // Determine if the same file was referenced two times.
     if (file1 == file2)
     {
          // Return true to indicate that the files are the same.
          return true;
     }

     // Open the two files.
     fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read);
     fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read);

     // Check the file sizes. If they are not the same, the files 
        // are not the same.
     if (fs1.Length != fs2.Length)
     {
          // Close the file
          fs1.Close();
          fs2.Close();

          // Return false to indicate files are different
          return false;
     }

     // Read and compare a byte from each file until either a
     // non-matching set of bytes is found or until the end of
     // file1 is reached.
     do 
     {
          // Read one byte from each file.
          file1byte = fs1.ReadByte();
          file2byte = fs2.ReadByte();
     }
     while ((file1byte == file2byte) && (file1byte != -1));

     // Close the files.
     fs1.Close();
     fs2.Close();

     // Return the success of the comparison. "file1byte" is 
     // equal to "file2byte" at this point only if the files are 
     // the same.
     return ((file1byte - file2byte) == 0);
}
KeatsPeeks
  • 18,163
  • 5
  • 50
  • 81
James Johnson
  • 43,670
  • 6
  • 67
  • 106
  • This could not work if the file is changed only one character. The stream lenght would be the same, but the content is not the same. This is not valid! – Daniel Peñalba Oct 28 '11 at 15:31
  • 2
    This was actually pulled from microsoft's website. It does an equality comparison, a length comparison, and a byte-by-byte comparison. I think you might be wrong about this one. – James Johnson Oct 28 '11 at 15:40
  • Sorry, I didn't want to sound rude :-) The question said "They have the same file names and they are the exact same size when different". So this approach is error prone. We develop a Version Control System and this kind of implementations could cause a disaster in the SCM database. – Daniel Peñalba Oct 28 '11 at 15:41
  • 4
    @Daniel: sorry to resurrect an old post, but can you care to explain what is the actual problem with the code? It seems ok to me, it is checking file size and then byte by byte. How could a single character difference escape the check? Gracias! – argatxa Sep 03 '12 at 10:58
  • I would say the code is correct. – Stabledog Sep 12 '13 at 05:30
  • 13
    The code is correct, but you have to scroll down to see the bit that does the byte by byte comparison. I assume Daniel didn't scroll down. – Dave Knight Dec 02 '13 at 10:32
  • On this damned OS X you don't get to see the whole code **AND** there is no scrollbar that suggests that there's more code. – Andrei Rînea Apr 03 '15 at 16:43
  • Any reason to use `(file1byte - file2byte) == 0` instead of just `file1byte == file2byte`? – Juan Jan 28 '18 at 18:34
18

I was just wondering if there is a fast way to do this without having to manually go in and read the file.

Not really.

If the files came with hashes, you could compare the hashes, and if they are different you can conclude the files are different (same hashes, however, does not mean the files are the same and so you will still have to do a byte by byte comparison).

However, hashes use all the bytes in the file, so no matter what, you at some point have to read the files byte for byte. And in fact, just a straight byte by byte comparison will be faster than computing a hash. This is because a hash reads all the bytes just like comparing byte-by-byte does, but hashes do some other computations that add time. Additionally, a byte-by-byte comparison can terminate early on the first pair of non-equal bytes.

Finally, you can not avoid the need for a byte-by-byte read. If the hashes are equal, that doesn't mean the files are equal. In this case you still have to compare byte-by-byte.

jason
  • 220,745
  • 31
  • 400
  • 507
  • Could you explain why you would need to compare byte for byte if the hashes are the same? Why would the hashes be the same if the data is different? – scottm Oct 28 '11 at 15:45
  • 2
    If you have same hashes you can be quite certain that the files are the same. You're right that you need to compare the files byte by byte to be absolutely sure (and especially if your security depends on this). But some systems like git rely on the fact that two different files with the same hash won't appear inside the system. Of course, this all assumes good hash, not something like `GetHashCode()`. – svick Oct 28 '11 at 15:46
  • 2
    @scottm: Because unequal files can have equal hashes. This is the pigeonhole principle. Let's say we're using md5. md5 produces a 128-bit hash of the file. Therefore, there are 2^128 different hashes. There are way more than 2^128 different files. Therefore, since we are mapping a space with more than 2^128 different values to a space with 2^128 values, there must be collisions. Hashes are not unique signatures. – jason Oct 28 '11 at 15:48
  • @Downvoters: Three downvotes? Wow. – jason Oct 28 '11 at 15:51
  • I didn't downvote this one, I don't know who's doing all that, but I have to disagree about the hashing. It does have a valid use if the hashes are persisted for comparison in the future. I've done this with image files, and the speed is amazing. – Random Oct 28 '11 at 15:53
  • 1
    @Random: And I mentioned that you can use hashes to conclude files are unequal when the hashes are unequal, and that it's advantageous when they are precomputed. If they are not precomputed, they are not faster, and if they are equal, we still have to go byte-by-byte. – jason Oct 28 '11 at 15:55
  • 1
    @svick: Sure git relies on that ASSUMPTION, but it is prone to a hash collision. It IS a problem for git. – jason Oct 28 '11 at 16:13
  • 1
    @Jason, I think it is a potential problem, but it's not a problem in praxis. If you know you won't have attackers trying to break the hash, you don't have to worry about collisions. – svick Oct 28 '11 at 16:19
  • 1
    @svick: But that's *exactly* one of the purposes for git using SHA-1! It's to detect corruptions to the repository, possibly by malicious attackers. – jason Oct 28 '11 at 16:22
4

Well, I'm not sure if you can in the file write timestamps. If not, your unique alternative, is comparing the content of the files.

A simple approach is comparing the files byte-to-byte, but if you're going to compare a file several times with others, you can calculate the hashcode of the files and compare it.

The following code snippet shows how you can do it:

    public static string CalcHashCode(string filename)
    {
        FileStream stream = new FileStream(
            filename,
            System.IO.FileMode.Open,
            System.IO.FileAccess.Read,
            System.IO.FileShare.ReadWrite);

        try
        {
            return CalcHashCode(stream);
        }
        finally
        {
            stream.Close();
        }
    }

    public static string CalcHashCode(FileStream file)
    {
        MD5CryptoServiceProvider md5Provider = new MD5CryptoServiceProvider();
        Byte[] hash = md5Provider.ComputeHash(file);
        return Convert.ToBase64String(hash);
    }

If you're going to compare a file with others more that one time, you can save the file hash and compare it. For a single comparison, the byte-to-byte comparison is better. You need also to recompute hash when the file changes, but if you're going to do massive comparisons (more than one time), I recommend using the hash approach.

Daniel Peñalba
  • 27,557
  • 31
  • 124
  • 209
  • 4
    No! You **STILL** have to compare byte by byte if the hashes are equal. And if the hashes are unequal, it's faster to just do byte-by-byte because it can terminate early on the first non-equal pair of bytes but hash has to go the whole way through the file! Argh! – jason Oct 28 '11 at 15:41
  • For the record, two separate files both with the same file name and length would almost certainly (almost) have to have two different timestamps. If nothing else, it would take at least a couple of miliseconds to write the second copy. – AllenG Oct 28 '11 at 15:42
  • 1
    If you have both files available, I think that calculating a hash for both will be actually slower than comparing them directly. – svick Oct 28 '11 at 15:42
  • 1
    @svick: Yes. The byte-by-byte can terminate early, hashes still read all of the contents just like byte-by-byte might, and if the hashes are equal, we have to go byte-by-byte anyway. – jason Oct 28 '11 at 15:45
  • 1
    @Jason: The idea is store the file hash in any place and take advantage about it. Yes of course, doing a byte-by-byte comparison is better if you're going to calculate only one time. – Daniel Peñalba Oct 28 '11 at 15:46
  • 1
    @Daniel Peñalba: But you STILL have to go byte-by-byte when the hashes are equal to be 100% certain the files are equal. – jason Oct 28 '11 at 15:50
  • @Jason: This is only a performance discussion. In our case, we compare a disk tree against a remote disk tree. We have precalculated hashes in both disk and remote, so our problem is only compare a pair of hashses per file. And also we recalculate the hash when the file changes. Without doubt, this is the best way to do it. – Daniel Peñalba Oct 28 '11 at 15:53
  • Sounds like we're talking about whether the hash is a *perfect hash* or not. The MD5 hashing algorithm is *not* perfect, so Jason is right - there is the possibility of hash collision. – Ben Nov 15 '13 at 17:19
3

If the filenames are the same, and the file sizes are the same, then, no, there is no way to know if they have different content without examining the content.

AllenG
  • 7,946
  • 26
  • 38
0

Read the file into a stream, then hash the stream. That should give you a reliable result for comparing.

byte[] fileHash1, fileHash2;

using (SHA256Managed sha = new SHA256Managed())
{
    fileHash1 = sha.ComputeHash(streamforfile1);
    fileHash2 = sha.ComputeHash(streamforfile2);
}

for (int i = 0; (i < fileHash1.Length) && (i < fileHash2.Length); i++)
    {
        if (fileHash[i] != fileHash2[i]) 
        { 
             //files are not the same
             break; 
        }
    }
Random
  • 1,826
  • 3
  • 19
  • 31
-2

If they are not complied files then use a diff tool like KDiff or WinMerge. It will highlight were they are different.

http://kdiff3.sourceforge.net/

http://winmerge.org/

Jamie
  • 2,246
  • 15
  • 26
  • The question is about how to programmatically compare two files in .net. The asker is writing code in C# and needs the program he’s writing to compare two files. He’s probably not interested in a shellout or a GUI tool. – binki Jan 29 '16 at 05:50
-2

pass each file stream through an MD5 hasher and compare the hashes.

Sam Axe
  • 31,472
  • 7
  • 52
  • 80
  • This is not faster than just comparing byte by byte and you still have to go byte by byte when the hashes are equal! – jason Oct 28 '11 at 15:25
  • Its less work. And the OP expessed a desire to avoid doing byte comparison themselves. – Sam Axe Oct 28 '11 at 15:27
  • 1
    But if the hashes are equal, you STILL have to manually read the files and compare byte by byte to conclude they actually are equal. It is NOT less work. You can not obviate then need for a byte by byte comparison. – jason Oct 28 '11 at 15:29
  • 1
    Less programming work. CPUs arent sentiant (yet) so who cares if it has to do any extra work. Modern CPUs are quick enough that you wont notice the extra work unless you're doing a lot of them in a short amount of time. But the OP didn't indicate that was the case. – Sam Axe Oct 28 '11 at 15:33
  • 1
    You're not paying attention: you **STILL** have to do the byte by byte comparison if the hashes are equal. Using hashes is not less work, it is **MORE** work because you have to write the byte-by-byte comparison AND code to use the hashing algorithm, and the logic to use byte-by-byte when the hashes are equal. – jason Oct 28 '11 at 15:42
  • No. There is NO reason to do a byte-by-byte comparison by hand if hashes are equal. Equal hashes (to within statistical probabilities) means that the files are the same. – Sam Axe Oct 28 '11 at 15:47
  • 1
    No, it means they have the same hash. It does NOT "means that the files are the same." – jason Oct 28 '11 at 15:51
  • 1
    Hex codes `d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89 55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70` and `d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89 55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70` have the same md5 hash. They are not equal. – jason Oct 28 '11 at 15:53
  • @Jason, there's also the problem if the files are NOT of equal size. I know the question says the files are, but assuming they may not be, we can eliminate checking that. Computed hashes will be of equal size. – Random Oct 28 '11 at 15:59
  • 1
    @L.B: The same problem applies to ANY hashing algorithm. ANY. Hashes take a large space and collapse it to a small space. EVERY hashing algorithm will have collisions, and lots of them. – jason Oct 28 '11 at 16:00
  • @Random: I don't understand what you're saying. – jason Oct 28 '11 at 16:01
  • @Jason, I know, you are right in theory, but most cryptographic applications count on the "uniqueness" of modern hash alg.s. I would do the same for file comparison – L.B Oct 28 '11 at 16:10
  • 1
    @L. B: Comparing two files for differences isn't a cryptographic application. We are not trying to check if two files are *probably* equal, but rather if they *are* equal. – jason Oct 28 '11 at 16:15
  • @Jason: what part of "to within statistical probabilities" is confusing? Of course there are collisions. The liklihood of a collision in a real-world situation is vanishingly small. This isn't a lab. – Sam Axe Oct 28 '11 at 16:24
  • 1
    @Boo: Astounding. The OP wants to know if the files are the same, not if they are probably the same. – jason Oct 28 '11 at 16:33