2

I want to load the MD5 of may different files. I am following this answer to do that but the main problem is that the time taken to load the MD5 of the files ( May be in hundreds) is a lot.

Is there any way which can be used to find the MD5 of an file without consuming much time.

Note- The size of the file may be large ( May go up to 300MB).

This is the code which I am using -

import java.io.*;
import java.security.MessageDigest;

public class MD5Checksum {

   public static byte[] createChecksum(String filename) throws Exception {
       InputStream fis =  new FileInputStream(filename);

       byte[] buffer = new byte[1024];
       MessageDigest complete = MessageDigest.getInstance("MD5");
       int numRead;

       do {
           numRead = fis.read(buffer);
           if (numRead > 0) {
               complete.update(buffer, 0, numRead);
           }
       } while (numRead != -1);

       fis.close();
       return complete.digest();
   }

   // see this How-to for a faster way to convert
   // a byte array to a HEX string
   public static String getMD5Checksum(String filename) throws Exception {
       byte[] b = createChecksum(filename);
       String result = "";

       for (int i=0; i < b.length; i++) {
           result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
       }
       return result;
   }

   public static void main(String args[]) {
       try {
           System.out.println(getMD5Checksum("apache-tomcat-5.5.17.exe"));
           // output :
           //  0bb2827c5eacf570b6064e24e0e6653b
           // ref :
           //  http://www.apache.org/dist/
           //          tomcat/tomcat-5/v5.5.17/bin
           //              /apache-tomcat-5.5.17.exe.MD5
           //  0bb2827c5eacf570b6064e24e0e6653b *apache-tomcat-5.5.17.exe
       }
       catch (Exception e) {
           e.printStackTrace();
       }
   }
}
Rahulrr2602
  • 571
  • 8
  • 32

2 Answers2

2

You cannot use hashes to determine any similarity of content.
For instance, generating the MD5 of hellostackoverflow1 and hellostackoverflow2 calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes.

But as already noted, some speed might be achieved by using native code, thus the NDK. Additionally, if you still want to compare files for exact matches, first compare the size in bytes, after that use a hashing algorithm with enough speed and a low risk of collisions. As stated, CRC32 is fine.

Ch4t4r
  • 1,278
  • 1
  • 10
  • 27
0

Hash/CRC calculation takes some time as the file has to be read completely.

The code of createChecksum you presented is nearly optimal. The only parts that can be tweaked is the read buffer size (I would use a buffer size 2048 bytes or larger). However this may get you a maximum of 1-2% speed improvement.

If this is still too slow the only option left is to implement the hashing in C/C++ and use it as native method. Besides that there is nothing you can do.

Robert
  • 33,260
  • 14
  • 84
  • 130
  • Thank You very much for the answer. Can you please provide me with an example on how to do that as I am not vey familiar with C/C++. Also Is it fine if `crc32` checksum to check if two files are same or not? – Rahulrr2602 Jan 13 '18 at 12:26
  • For checking whether two files are the same you may use crc32. By the way, are you checking if the file sizes match before calculating the hashsum? – Ch4t4r Jan 13 '18 at 12:31
  • 1
    @Rahulrr2602: to use a md5 or crc32 is up to you. It depends on your requirements how likely a collision may occur and what the consequences are. See [this question](https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash) for details. Presenting a native implementation is out of scope if you don't have C experience. May be there is an existing library for Android available but I don't know one. – Robert Jan 13 '18 at 12:44
  • @Ch4t4r Thanks, but I a not checking the size of a file before finding the MD5. The reason is that I want to check the similarity of the file based on content and not the size. Is it possible for two different file to have same content but are of different format and hence have different size? – Rahulrr2602 Jan 13 '18 at 13:00
  • 1
    You cannot use hashes to determine any similarity of content. For instance, generating the MD5 of `hellostackoverflow1` and `hellostackoverflow2` calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes. – Ch4t4r Jan 14 '18 at 09:47
  • @Ch4t4r Thank You very much. Really solved my problem. Can you please post an answer so that I can accept it. – Rahulrr2602 Jan 14 '18 at 16:37