24

With python 2.7 the following code computes the mD5 hexdigest of the content of a file.

(EDIT: well, not really as answers have shown, I just thought so).

import hashlib

def md5sum(filename):
    f = open(filename, mode='rb')
    d = hashlib.md5()
    for buf in f.read(128):
        d.update(buf)
    return d.hexdigest()

Now if I run that code using python3 it raise a TypeError Exception:

    d.update(buf)
TypeError: object supporting the buffer API required

I figured out that I could make that code run with both python2 and python3 changing it to:

def md5sum(filename):
    f = open(filename, mode='r')
    d = hashlib.md5()
    for buf in f.read(128):
        d.update(buf.encode())
    return d.hexdigest()

Now I still wonder why the original code stopped working. It seems that when opening a file using the binary mode modifier it returns integers instead of strings encoded as bytes (I say that because type(buf) returns int). Is this behavior explained somewhere ?

tshepang
  • 10,772
  • 21
  • 84
  • 127
kriss
  • 21,366
  • 15
  • 89
  • 109
  • 1
    related: http://stackoverflow.com/q/4949162/ – jfs Oct 27 '11 at 15:41
  • 2
    Would it be faster if you did larger reads, closer to the filesystem's file block size? (for instance 1024 bytes on Linux ext3 and 4096 bytes or more on Windows NTFS) – rakslice Aug 01 '13 at 18:23

3 Answers3

37

I think you wanted the for-loop to make successive calls to f.read(128). That can be done using iter() and functools.partial():

import hashlib
from functools import partial

def md5sum(filename):
    with open(filename, mode='rb') as f:
        d = hashlib.md5()
        for buf in iter(partial(f.read, 128), b''):
            d.update(buf)
    return d.hexdigest()

print(md5sum('utils.py'))
jfs
  • 346,887
  • 152
  • 868
  • 1,518
Raymond Hettinger
  • 182,864
  • 54
  • 321
  • 419
  • Yes, that's exactly what I was trying to do. I finally achieved that with a less elegant solution than yours using a generator. – kriss Oct 20 '11 at 00:13
  • This leaks the file handle on some Python implementations. You should at least call `close`. – phihag Oct 20 '11 at 00:29
  • 1
    I've added `with` statement to close the file properly. – jfs Oct 20 '11 at 02:09
  • 1
    @phihag: is there really python implementation where the automatic close actually *leaks* file handles ? I thought it merely delayed the releasing of these file handles until garbage collection ? – kriss Oct 20 '11 at 07:05
  • but with statement is indeed nice anyway – kriss Oct 20 '11 at 07:12
  • @kriss Oops, you're right - close gets called eventually, even on Jython. However, that's only the case if you don't have an exception stacktrace lying around in `sys.exc_info` (for example if a `read` failed), so it's good form to call `close` or use the `with` statement. – phihag Oct 20 '11 at 08:45
  • @J.F.Sebastian Adding the with-statement "improved" the code at the expense of obfuscating the answer to the OP's question. A lot of people get confused or distracted by with-statement semantics, so it doesn't belong in an answer addressing iteration fundamentals. People who get hung-up on "leaking file handles" are wasting their time on something that almost never matters in real code. The with-statement is nice, but automatic file closing is a separate topic that isn't worth the distraction from an otherwise clear answer to a basic question about reading files in chunks. – Raymond Hettinger Sep 04 '12 at 22:51
  • 5
    @RaymondHettinger: if you don't like it; just revert the change. I've considered it to be a too minor change to discuss. Though I strongly disagree with your reasoning. Public code should follow best practices *especially* if it is aimed for beginners. If best practices are too hard to follow (though I don't think it is the case) for such a common task then the language should change. – jfs Sep 05 '12 at 00:59
10
for buf in f.read(128):
  d.update(buf)

.. updates the hash sequentially with each of the first 128 bytes values of the file. Since iterating over a bytes produces int objects, you get the following calls which cause the error you encountered in Python3.

d.update(97)
d.update(98)
d.update(99)
d.update(100)

which is not what you want.

Instead, you want:

def md5sum(filename):
  with open(filename, mode='rb') as f:
    d = hashlib.md5()
    while True:
      buf = f.read(4096) # 128 is smaller than the typical filesystem block
      if not buf:
        break
      d.update(buf)
    return d.hexdigest()
phihag
  • 245,801
  • 63
  • 407
  • 443
  • 1
    This is eat the whole RAM if you open a huge file. That's why we buffer. – Umur Kontacı Oct 19 '11 at 23:46
  • @fastreload Already added that ;). Since the original solution didn't even work for files with >128 bytes, I don't think memory is an issue, but I added a buffered read anyway. – phihag Oct 19 '11 at 23:49
  • Well done then, yet OP claimed that he could use his code in Python 2.x and stopped working on 3.x. And I remember I made 1 byte buffer for calculating md5 of 3 gb iso file for benchmarking and it did not fail. My bet is, python 2.7 has a failsafe mechanism that whatever the user input is, the minimum buffer size does not go below a certain level. What do you say? – Umur Kontacı Oct 19 '11 at 23:53
  • 2
    @fastreload The code didn't crash in Python 2 since iterating over a `str` produced `str`. The result was still wrong for files larger than 128 Bytes. Sure, you can adjust the buffer size as you want (unless you have a fast SSD, the CPU will get bored anyway, and good OSs preload the next bytes of the file). Python 2.7 has definitely no such failsafe mechanism; that would violate the contract of `read`. The OP did just not compare the results of the script with the canonical `md5sum`'s, or the results of the script on two files that with 128 identical first bytes. – phihag Oct 19 '11 at 23:57
  • yes, my original code is indeed broken (but not yet in the wild). I just didn't tested it on large files with the same beginning. I should have guessed there was a real problem as it was running way too fast. – kriss Oct 20 '11 at 00:16
  • This answer is incorrect when it says, "iterating over a bytes produces str objs". list(b'abc') --> [97, 98, 99] – Raymond Hettinger Oct 20 '11 at 23:32
  • @RaymondHettinger Oops, stupid me. Tested it in 2.7 and was surprised to get `str`s - duh. Fixed. – phihag Oct 21 '11 at 07:39
1

I finally changed my code to the version below (that I find easy to understand) after asking the question. But I will probably change it to the version suggested by Raymond Hetting unsing functools.partial.

import hashlib

def chunks(filename, chunksize):
    f = open(filename, mode='rb')
    buf = "Let's go"
    while len(buf):
        buf = f.read(chunksize)
        yield buf

def md5sum(filename):
    d = hashlib.md5()
    for buf in chunks(filename, 128):
        d.update(buf)
    return d.hexdigest()
kriss
  • 21,366
  • 15
  • 89
  • 109
  • This will now work if the file lenght is not a multiple of chunksize, read will infact return a shorter buffer in the last read. The termination is given by an empty buffer, that's why the "not buf" condition in the example code above (that works). – Mapio Oct 20 '11 at 05:18
  • @Mapio: there is indeed a kind of bug in my code, but not at all where you say. The file length is irrelevant. The code above works provided there is no partial read returning incomplete buffers. If a partial read occurs, it will stop too soon (but taking the partial buffer into account). A partial read may occur in some case, say if the program receive a managed interrupt signal while reading, then continue with read after returning from interruption. – kriss Oct 20 '11 at 07:00
  • Well, in the above comment, when speaking of "code above" I'm refering to the old version. This current one is now working (even if it's not the best possible solution). – kriss Oct 20 '11 at 11:47