What is the fastest way to load the MD5 of an file?

Question

I want to load the MD5 of may different files. I am following this answer to do that but the main problem is that the time taken to load the MD5 of the files ( May be in hundreds) is a lot.

Is there any way which can be used to find the MD5 of an file without consuming much time.

Note- The size of the file may be large ( May go up to 300MB).

This is the code which I am using -

import java.io.*;
import java.security.MessageDigest;

public class MD5Checksum {

   public static byte[] createChecksum(String filename) throws Exception {
       InputStream fis =  new FileInputStream(filename);

       byte[] buffer = new byte[1024];
       MessageDigest complete = MessageDigest.getInstance("MD5");
       int numRead;

       do {
           numRead = fis.read(buffer);
           if (numRead > 0) {
               complete.update(buffer, 0, numRead);
           }
       } while (numRead != -1);

       fis.close();
       return complete.digest();
   }

   // see this How-to for a faster way to convert
   // a byte array to a HEX string
   public static String getMD5Checksum(String filename) throws Exception {
       byte[] b = createChecksum(filename);
       String result = "";

       for (int i=0; i < b.length; i++) {
           result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
       }
       return result;
   }

   public static void main(String args[]) {
       try {
           System.out.println(getMD5Checksum("apache-tomcat-5.5.17.exe"));
           // output :
           //  0bb2827c5eacf570b6064e24e0e6653b
           // ref :
           //  http://www.apache.org/dist/
           //          tomcat/tomcat-5/v5.5.17/bin
           //              /apache-tomcat-5.5.17.exe.MD5
           //  0bb2827c5eacf570b6064e24e0e6653b *apache-tomcat-5.5.17.exe
       }
       catch (Exception e) {
           e.printStackTrace();
       }
   }
}

@pskink Tried buffer up to `32764` but still a lot of tie is being consumed. Also is there any harm to increase the buffer size further? — Rahulrr2602, Jan 13 '18 at 12:23
A lot of time will be going in appending strings in the hex conversion. Use `BigInteger.toHexString()``. — user207421, Jan 14 '18 at 17:25

score 2 · Accepted Answer · answered Jan 14 '18 at 17:09

You cannot use hashes to determine any similarity of content.
For instance, generating the MD5 of hellostackoverflow1 and hellostackoverflow2 calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes.

But as already noted, some speed might be achieved by using native code, thus the NDK. Additionally, if you still want to compare files for exact matches, first compare the size in bytes, after that use a hashing algorithm with enough speed and a low risk of collisions. As stated, CRC32 is fine.

score 0 · Answer 2 · answered Jan 13 '18 at 12:23

0

Hash/CRC calculation takes some time as the file has to be read completely.

The code of createChecksum you presented is nearly optimal. The only parts that can be tweaked is the read buffer size (I would use a buffer size 2048 bytes or larger). However this may get you a maximum of 1-2% speed improvement.

If this is still too slow the only option left is to implement the hashing in C/C++ and use it as native method. Besides that there is nothing you can do.

answered Jan 13 '18 at 12:23

Robert

33,260
14
84
130

Thank You very much for the answer. Can you please provide me with an example on how to do that as I am not vey familiar with C/C++. Also Is it fine if `crc32` checksum to check if two files are same or not? – Rahulrr2602 Jan 13 '18 at 12:26
For checking whether two files are the same you may use crc32. By the way, are you checking if the file sizes match before calculating the hashsum? – Ch4t4r Jan 13 '18 at 12:31
1

@Rahulrr2602: to use a md5 or crc32 is up to you. It depends on your requirements how likely a collision may occur and what the consequences are. See [this question](https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash) for details. Presenting a native implementation is out of scope if you don't have C experience. May be there is an existing library for Android available but I don't know one. – Robert Jan 13 '18 at 12:44
@Ch4t4r Thanks, but I a not checking the size of a file before finding the MD5. The reason is that I want to check the similarity of the file based on content and not the size. Is it possible for two different file to have same content but are of different format and hence have different size? – Rahulrr2602 Jan 13 '18 at 13:00
1

You cannot use hashes to determine any similarity of content. For instance, generating the MD5 of `hellostackoverflow1` and `hellostackoverflow2` calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes. – Ch4t4r Jan 14 '18 at 09:47
@Ch4t4r Thank You very much. Really solved my problem. Can you please post an answer so that I can accept it. – Rahulrr2602 Jan 14 '18 at 16:37

What is the fastest way to load the MD5 of an file?

2 Answers2