4

We have a 150 Gb data folder. Within that, file content is any format (doc, jpg, png, txt, etc). We need to check all file content against each other to check if there are is duplicate file content. If so, then print the file path name list. For that, first I used ArrayList<File> to store all files, then used FileUtils.contentEquals(file1, file2) method. When I try it for a small amount of files(Folder) it's working but for this 150Gb data folder, it's not showing any result. I think first storing all files in an ArrayList makes the problem. JVM Heap problem, I am not sure.

Anyone have better advice and sample code to handle this amount of data? Please help me.

Lehue
  • 395
  • 8
  • 23
Mostafizur
  • 69
  • 2
  • 9

2 Answers2

5

Calculate the MD5 hash of each file and store in a HashMap with the MD5 hash as the key and the file path as the value. When you add a new file to the HashMap, you can easily check if there is already a file with that MD5 hash.

The chance of a false match is very small, but if you want you can use FileUtils.contentEquals to confirm the match.

e.g:

void findMatchingFiles(List<String> filepaths)
{
    HashMap<String, String> hashmap = new HashMap<String, String>();
    for(String filepath in filepaths)
    {
        String md5 = getFileMD5(filepath); // see linked answer
        if(hashmap.containsKey(md5))
        {
             String original = hashmap.get(md5);
             String duplicate = filepath;

             // found a match between original and duplicate
        }
        else
        {
             hashmap.put(md5, filepath);
        }
    }
}

If there are multiple identical files this will find a match of each of them with the first one, but not a match of all of them to each other. If you want the latter you can store a hash from the MD5 string to a list of filepaths instead of just to the first one.

Community
  • 1
  • 1
samgak
  • 22,290
  • 4
  • 50
  • 73
1

Use a HashTable and store MD5 hash of file contents as a key and file path as value. MD5 hash size is 16 bytes irrespective of content size. So it doesn't matter if your files are 150 GB each or even larger. When you will encounter a new file, calculate its MD5 hash and check if its already in the HashTable. Lookup and insertion in hashtable will be amortized O(1). Besides, MD5 has very few chance of collision. So to avoid false positive, you can check the file content incase of a match.

Note: I didn't notice while writing @samgak has already given an elaborated answer. You can use the code snippet his answer :)

Kaidul
  • 14,015
  • 11
  • 68
  • 139