4

I want to recurse several directories and find duplicate files between the n number of directories.

My knee-jerk idea at this is to have a global hashtable or some other data structure to hold each file I find; then check each subsequent file to determine if it's in the "master" list of files. Obviously, I don't think this would be very efficient and the "there's got to be a better way!" keeps ringing in my brain.

Any advice on a better way to handle this situation would be appreciated.

Nate222
  • 846
  • 1
  • 13
  • 25
  • sounds pretty efficient to me. – tster May 11 '10 at 22:48
  • 1
    To what degree must you look to find duplicate files (name, name/size, name/size/content, content regardless of name)? Is it expected that there will be many duplicate files or will that be the exception? How many files will typically be processed? – Ragoczy May 11 '10 at 22:52
  • I needed a straight name compare and most likely a byte-by-byte comparison (user selected method clearly indicating that the byte comparison will be slower). Also, yes there's going to be thousands of duplicates. :( – Nate222 May 11 '10 at 23:01
  • 1
    for future readers: http://stackoverflow.com/questions/1358510/how-to-compare-2-files-fast-using-net/2637350#2637350 seems to suggest that a byte-by-byte comparison may be faster than hashing – BKSpurgeon Feb 13 '16 at 00:25
  • you could use LINQ: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/how-to-query-for-duplicate-files-in-a-directory-tree-linq **Updated link** – Richard May 11 '10 at 22:55

5 Answers5

15

You could avoid hashing by first comparing file sizes. If you never find files with the same sizes you don't have to hash them. You only hash a file once you find another file with the same size, then you hash them both.

That should be significantly faster than blindly hashing every single file, although it'd be more complicated to implement that two-tiered check.

John Kugelman
  • 307,513
  • 65
  • 473
  • 519
  • 1
    Honestly, using proper encapsulation and class design, that wouldn't add very much complexity I think. – tster May 11 '10 at 23:22
3

I'd suggest keeping multiple in-memory indexes of files.

Create one that indexes all files by file length:

Dictionary<int, List<FileInfo>> IndexBySize;

When you're processing new file Fu, it's a quick lookup to find all other files that are the same size.

Create another that indexes all files by modification timestamp:

Dictionary<DateTime, List<FileInfo>> IndexByModification;

Given file Fu, you can find all files modified at the same time.

Repeat for each signficiant file characteristic. You can then use the Intersect() extension method to compare multiple criteria efficiently.

For example:

var matchingFiles
    = IndexBySize[fu.Size].Intersect(IndexByModification[fu.Modified]);

This would allow you to avoid the byte-by-byte scan until you need to. Then, for files that have been hashed, create another index:

Dictionary<MD5Hash, List<FileInfo>> IndexByHash;

You might want to calculate multiple hashes at the same time to reduce collisions.

Bevan
  • 40,925
  • 10
  • 77
  • 128
  • thank you. Two questions: (1) you declare a dictionary at the top of your answer: is IndexBySize the name of the dictionary (i.e. reference) because I cannot quite follow what is being done here. (2) secondly you state: "You might want to calculate multiple hashes at the same time to reduce collisions" --> I don't follow - can you please elaborate on that point? – BKSpurgeon Feb 13 '16 at 11:28
  • Yes, `IndexBySize` is the name of the first dictionary - it allows you to find all the other files you've already seen with a particular size. `IndexByModification` is the name of the second dictionary - allowing you to find files already seen based on their modification timestamp. Both are shortcuts to find potential duplicates for the file currently being considered. – Bevan Feb 15 '16 at 01:28
  • 1
    The quick to calculate hashing functions are also likely to have collisions - two files that are *not* the same that *do* have the same hash. There are two ways to address this problem - use a hash function like SHA-256 that is extremely unlikely to give false matches, or use multiple different (but independent) quick hashing functions. – Bevan Feb 15 '16 at 01:31
2

Your approach sounds sane to me. Unless you have very good reason to assume that it will not suffice your performance requirements, I'd simply implement it this way and optimize it later if necessary. Remember that "premature optimization is the root of evil".

Adrian Grigore
  • 31,759
  • 32
  • 127
  • 205
1

the best practice , as John Kugelman said , is first to compare two files with the same size , if they have different sizes , its obvious that they are not duplicates.

if you find two files with same size , for better performance , you can compare the first 500 KB of two files , if the first 500 KB are same , you can compare the rest of the bytes. in this way you dont have to read all bytes of a (for example ) 500 MB file to gain its hash, so you save time and boost performance

farhad
  • 87
  • 1
  • 7
0

For a byte-comparison where you're expecting many duplicates, then you're likely best off with the method you're already looking at.

If you're really concerned about efficiency and know that duplicates will always have the same filename, then you could start by comparing filenames alone and only hash bytes when you find a duplicate name. That way you'd save the time of hashing files that have no duplicate in the tree.

Ragoczy
  • 2,739
  • 1
  • 17
  • 17