368

I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?

If it is, some code samples would be appreciated, because I don't have much experience with cryptography.

CodesInChaos
  • 100,017
  • 20
  • 197
  • 251
broke
  • 7,508
  • 14
  • 50
  • 81

7 Answers7

847

It's very simple using System.Security.Cryptography.MD5:

using (var md5 = MD5.Create())
{
    using (var stream = File.OpenRead(filename))
    {
        return md5.ComputeHash(stream);
    }
}

(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)

How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)

If you need to represent the hash as a string, you could convert it to hex using BitConverter:

static string CalculateMD5(string filename)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(filename))
        {
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
        }
    }
}
Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • 261
    If you want the "standard" looking md5, you can do: return `BitConverter.ToString(md5.ComputeHash(stream)).Replace("-","").ToLower();` – aquinas May 09 '12 at 16:25
  • @aquinas What would be the preferred format when inserting into a database?. – broke May 09 '12 at 16:29
  • That would be the format I would use. It will give you a format like this: 837a6f4fad381c2a7b909032133ddaf6, which is almost always how you'll see MD5 hashes formatted. – aquinas May 09 '12 at 16:32
  • 80
    MD5 is in System.Security.Cryptography - just to surface the info more. – Hans Apr 17 '13 at 05:06
  • 2
    What about CRC32 instead of MD5? – Kala J Jul 28 '14 at 17:23
  • I am wondering, if I had a database file and I wanted to make sure it wasn't corrupted, can I run a CRC32 checksum on it to check integrity in a similar way you illustrated above? – Kala J Jul 28 '14 at 18:01
  • @KalaJ: Yes, absolutely. – Jon Skeet Jul 28 '14 at 18:01
  • @JonSkeet, for database integrity, would the type of checksum matter in terms of security? Would a CRC32 checksum be appropriate or should I use something like SHA? Does .NET have a built in CRC32 algorithm? Thank you! – Kala J Jul 30 '14 at 13:26
  • 6
    @KalaJ: If you're trying to spot deliberate tampering, then CRC32 is entirely inappropriate. If you're only talking about spotting data transfer failures, it's fine. Personally I'd probably use SHA-256 just out of habit :) I don't know about support for CRC32 in .NET offhand, but you can probably search for it as quickly as I can :) – Jon Skeet Jul 30 '14 at 13:40
  • @JonSkeet, What could be causing something like this? http://stackoverflow.com/questions/25040912/running-sha-checksum-on-db-but-db-mdf-is-being-used-by-another-process – Kala J Jul 30 '14 at 16:29
  • FYI: If you are comparing 2 streams, the read position must be the same on both stream for the MD5 Hash to compute the same for identical files. Just ran into this issue. – Chris - Haddox Technologies Dec 30 '14 at 20:32
  • It's not quite so simple with text files -- it is all too easy to end up with the "same" file with different line endings on different computers (e.g. from a perforce sync or git pull of a text file with client-specific line ending conversion). This can result in that "same" file having different checksums, which can cause issues, depending on your application. If this is an issue you may need to use TransformBlock and friends to accumulate the hash over the non-end-of-line portion of the file. – yoyo Feb 26 '15 at 20:26
  • 12
    @aquinas I think `.Replace("-", String.Empty)` is a better approach. I went through a one hour debug session because I get wrong results when comparing a user input to the file hash. – fabwu Jan 01 '17 at 13:16
  • @wuethrich44 Are you just objecting to the use of `""` instead of `string.Empty`? – Jon Skeet Jan 01 '17 at 14:41
  • @JonSkeet Yes when I use `""` and compare the hash to another string (user input) it is not equal. I compare the strings with ordinal equals. Do you know why this happening? – fabwu Jan 02 '17 at 10:55
  • @wuethrich44: No, but it wouldn't be due to the use of `""` instead of `string.Empty`. It's absolutely fine to use `""`. I suggest you ask another question with details, if you can still reproduce the problem. – Jon Skeet Jan 02 '17 at 10:58
  • @JonSkeet Ok then I will open a separate question and send you the link. – fabwu Jan 02 '17 at 11:12
  • 7
    @wuethrich44, I think the problem you're having is if you copy/paste the code in aquinas comment verbatim; I happened to notice the same thing. There are two invisible characters--a "zero-width non-joiner" and a Unicode "zero width space"--between the "empty" quotes in the raw HTML. I don't know if it was in the original comment or if SO is to blame here. – Chris Simmons Jan 19 '17 at 16:17
  • If you want the string in the format used by Azure Blob's then the code in this answer might be helpful: http://stackoverflow.com/a/43647643/411428 – Manfred Apr 27 '17 at 03:37
  • To speed it up for large files, it's better to use it with a buffersize, e.g.: using (var stream = new BufferedStream(File.OpenRead(filename), 1048576) – Beetee Aug 09 '17 at 09:04
71

This is how I do it:

using System.IO;
using System.Security.Cryptography;

public string checkMD5(string filename)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(filename))
        {
            return Encoding.Default.GetString(md5.ComputeHash(stream));
        }
    }
}
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
BoliBerrys
  • 751
  • 5
  • 3
  • 2
    I upvoted you because more people need to do things like this. – Krythic Jan 08 '16 at 00:12
  • 6
    I think swapping the `using` blocks would be useful, because opening a file is more probably going to fail. Fail early/fast approach saves you the resources needed to create (and destroy) the MD5 instance in such scenarios. Also you can omit the braces of the first `using` and save a level of indentation without losing readability. – Palec Jan 08 '16 at 10:00
  • 14
    This converts the 16 bytes long result to a string of 16 chars, not the expected 32 chars hex value. – NiKiZe Jan 14 '16 at 19:54
  • 3
    This code does not produce the expected result (assumed expectation). Agreeing with @NiKiZe – Nick Jan 15 '16 at 17:31
  • also a reference is missing: using System.Text; – Mohsen Abasi Dec 25 '16 at 09:52
  • 1
    @Palec, do you realise you just optimised your failure case? "When our program errors it returns that error .0000000000001s quicker to the user than before!". Unless its box processing a metric crap ton of requests where smth like this might matter its a really, really low value optimisation. – Quibblesome Jan 07 '19 at 18:12
  • 1
    @Quibblesome, I was just trying to promote the general idea that the order of nesting of using statements matters. Elsewhere, the difference might be significant. Why not practice the habit of detecting failure early? I agree, though, that in this specific snippet, the habit brings almost no benefit. – Palec Jan 08 '19 at 08:00
  • 1
    Unlike Jon Skeet's answer with BitConverter, Encoding.Default.GetString returns nonascii character gibberish for me (running within Unity). – idbrii Jul 24 '19 at 23:24
8

I know this question was already answered, but this is what I use:

using (FileStream fStream = File.OpenRead(filename)) {
    return GetHash<MD5>(fStream)
}

Where GetHash:

public static String GetHash<T>(Stream stream) where T : HashAlgorithm {
    StringBuilder sb = new StringBuilder();

    MethodInfo create = typeof(T).GetMethod("Create", new Type[] {});
    using (T crypt = (T) create.Invoke(null, null)) {
        byte[] hashBytes = crypt.ComputeHash(stream);
        foreach (byte bt in hashBytes) {
            sb.Append(bt.ToString("x2"));
        }
    }
    return sb.ToString();
}

Probably not the best way, but it can be handy.

Badaro Jr.
  • 174
  • 2
  • 8
  • I have made a small change to your GetHash function. I've turned it into an extension method and removed the reflection code. – Leslie Marshall Feb 28 '17 at 18:41
  • 3
    `public static String GetHash(this Stream stream) where T : HashAlgorithm, new() { StringBuilder sb = new StringBuilder(); using (T crypt = new T()) { byte[] hashBytes = crypt.ComputeHash(stream); foreach (byte bt in hashBytes) { sb.Append(bt.ToString("x2")); } } return sb.ToString(); }` – Leslie Marshall Feb 28 '17 at 18:42
  • This actually worked.... thank you!. I spent far to long looking online for the result that would produce a normal 32 char md5 string than I would have expected. This a little more complicated that I would prefer but it definitely works. – Troublesum May 26 '17 at 00:01
  • 1
    @LeslieMarshall if you are going to use it as a extension method then you should reset the stream location rather than leaving it at the end position – MikeT Jul 03 '17 at 11:24
  • I had better luck with @LeslieMarshall's method using `(T) HashAlgorithm.Create(typeof(T).Name)` and removing the `new()` constraint. For my implementation, I also changed it so the parameter is `this byte[] resource` and putting the stream in the method with `using var stream = new MemoryStream(resource)`. You'll then only need to tell the compiler that `crypt` isn't null. – Shenk Jun 30 '20 at 01:53
3

Here is a slightly simpler version that I found. It reads the entire file in one go and only requires a single using directive.

byte[] ComputeHash(string filePath)
{
    using (var md5 = MD5.Create())
    {
        return md5.ComputeHash(File.ReadAllBytes(filePath));
    }
}
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Ashley Davis
  • 9,255
  • 7
  • 58
  • 79
  • 53
    The downside of using `ReadAllBytes` is that it loads the whole file into a single array. That doesn't work at all for files larger than 2 GiB and puts a lot of pressure on the GC even for medium sized files. Jon's answer is only slightly more complex, but doesn't suffer from these problems. So I prefer his answer over yours. – CodesInChaos Dec 15 '14 at 11:23
  • 1
    Put in the `using`s after each other with out the first curly braces `using (var md5 = MD5.Create()) using (var stream = File.OpenRead(filename))` gives you one using per line without unnecessary indentation. – NiKiZe Jan 14 '16 at 19:50
  • 3
    @NiKiZe You can put an entire program on one line and eliminate ALL indentation. You can even use XYZ as variable names! What is the benefit to others? – Derek Johnson Aug 11 '17 at 17:59
  • @DerekJohnson the point I was trying to make was probably that "and only requires a single `using` directive." was not really a good reason to read everything into memory. The more effective approach is to stream in the data into `ComputeHash`, and if possible `using` should only be used, but I can totally understand if you want to avoid the extra level of indentation. – NiKiZe Aug 12 '17 at 18:32
3

I know that I am late to party but performed test before actually implement the solution.

I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.

    DateTime current = DateTime.Now;
    string file = @"C:\text.iso";//It's 2.5 Gb file
    string output;
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(file))
        {
            byte[] checksum = md5.ComputeHash(stream);
            output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
            Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
        }
    }
Romil Kumar Jain
  • 18,891
  • 8
  • 55
  • 87
2

And if you need to calculate the MD5 to see whether it matches the MD5 of an Azure blob, then this SO question and answer might be helpful: MD5 hash of blob uploaded on Azure doesnt match with same file on local machine

Community
  • 1
  • 1
Manfred
  • 4,592
  • 3
  • 23
  • 26
  • If you think that the answer is not great, then downvoting is fine. However, leaving a comment describing the reasons for the downvoate would help to improve answers over time. By leaving a comment with suggestions for improving an answer you can better contribute to Stack Overflow. Thanks! – Manfred Sep 26 '19 at 22:51
0

For dynamically-generated PDFs. The creation date and modified dates will always be different.

You have to remove them or set them to a constant value.

Then generate md5 hash to compare hashes.

You can use PDFStamper to remove or update dates.

Khalil
  • 867
  • 2
  • 15
  • 28