Get MD5 hash of big files in Python

Question

I have used hashlib (which replaces md5 in Python 2.6/3.0) and it worked fine if I opened a file and put its content in hashlib.md5() function.

The problem is with very big files that their sizes could exceed RAM size.

How to get the MD5 hash of a file without loading the whole file to memory?

I would rephrase: "How to get the MD5 has of a file without loading the whole file to memory?" — XTL, Feb 24 '12 at 12:29

score 223 · Answer 1 · edited Dec 07 '15 at 19:50

223

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the 'jacksum' tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

edited Dec 07 '15 at 19:50

TheDoctor

1,348
2
13
28

answered Jul 15 '09 at 12:59

29

What's important to notice is that the file which is passed to this function must be opened in binary mode, i.e. by passing `rb` to the `open` function. – Frerich Raabe Jul 21 '11 at 13:02
11

This is a simple addition, but using `hexdigest` instead of `digest` will produce a hexadecimal hash that "looks" like most examples of hashes. – tchaymore Oct 16 '11 at 02:26
Shouldn't it be `if len(data) < block_size: break`? – Erik Kaplun Nov 02 '12 at 10:35
2

Erik, no, why would it be? The goal is to feed all bytes to MD5, until the end of the file. Getting a partial block does not mean all the bytes should not be fed to the checksum. – Nov 02 '12 at 20:12
@FrerichRaabe: Thanks. I always forget that and then my code blows up on Windows machines. – Harvey Apr 12 '13 at 13:58
Mandatory to get the right hashsum: Reset the position where to read from! I.e. by adding `f.seek(0)` to the first line of the first algorithm. Otherwise the start location is unknown and may be in the middle of the file. – user2084795 Jun 15 '15 at 18:39
2

@user2084795 `open` __always__ opens a fresh file handle with the position set to the start of the file, _(unless you open a file for append)._ – Steve Barnes Jul 05 '17 at 09:15

score 163 · Accepted Answer · edited Nov 22 '19 at 11:34

163

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

edited Nov 22 '19 at 11:34

Boris

7,044
6
62
63

answered Jul 15 '09 at 12:55

Yuval Adam

149,388
85
287
384

82

You can just as effectively use a block size of any multiple of 128 (say 8192, 32768, etc.) and that will be much faster than reading 128 bytes at a time. – jmanning2k Jul 15 '09 at 15:09
41

Thanks jmanning2k for this important note, a test on 184MB file takes (0m9.230s, 0m2.547s, 0m2.429s) using (128, 8192, 32768), I will use 8192 as the higher value gives non-noticeable affect. – JustRegisterMe Jul 17 '09 at 19:33
If you can, you should use [`hashlib.blake2b`](https://docs.python.org/3/library/hashlib.html#blake2) instead of `md5`. Unlike MD5, [BLAKE2](https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2) is secure, and it's even faster. – Boris Nov 22 '19 at 11:59
3

@Boris, you can't actually say that BLAKE2 is secure. All you can say is that it hasn't been broken yet. – vy32 Apr 08 '20 at 14:57
@vy32 you can't say it's definitely going to be broken either. We'll see in 100 years, but it's at least better than MD5 which is definitely insecure. – Boris Apr 08 '20 at 15:50
@Boris, I didn't mean to imply to you that it's going to be broken. All we know is that it hasn't been broken yet. MD5 being "broken" is a funny thing. It's still not susceptible to the second pre-image attack, which is why it's still widely used in digital forensics. – vy32 Apr 08 '20 at 17:17

Piotr Czapla · Answer 3 · 2019-11-30T17:34:12.530

113

Below I've incorporated suggestion from comments. Thank you al!

python < 3.7

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): 
            h.update(chunk)
    return h.digest()

python 3.8 and above

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        while chunk := f.read(chunk_num_blocks*h.block_size): 
            h.update(chunk)
    return h.digest()

original post

if you care about more pythonic (no 'while True') way of reading the file check this code:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() func needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

edited Nov 30 '19 at 17:34

answered Nov 18 '10 at 09:24

Piotr Czapla

23,150
23
90
120

18

Better still, use something like `128*md5.block_size` instead of `8192`. – mrkj Jan 06 '11 at 22:51
1

mrkj: I think it's more important to pick your read block size based on your disk and then to ensure that it's a multiple of `md5.block_size`. – Harvey Apr 12 '13 at 14:10
6

the `b''` syntax was new to me. Explained [here](http://stackoverflow.com/a/6269785/1174169). – cod3monk3y Feb 18 '14 at 05:19
@Harvey Is there any rule of thumb for common disk block sizes? Or otherwise any recommendations for determining optimal block size? – ThorSummoner Mar 16 '15 at 05:23
1

@ThorSummoner: Not really, but from my working finding optimum block sizes for flash memory, I'd suggest just picking a number like 32k or something easily divisible by 4, 8, or 16k. For example, if your block size is 8k, reading 32k will be 4 reads at the correct block size. If it's 16, then 2. But in each case, we're good because we happen to be reading an integer multiple number of blocks. – Harvey Mar 16 '15 at 14:21
1

"while True" is quite pythonic. – Jürgen A. Erhard Dec 16 '15 at 09:07
In Python 3.8+ you can just do `while chunk := f.read(8192):` – Boris Nov 22 '19 at 11:52
is there a better way we can read the small chunks in parallel and get the hash, for better processing time/ performance if the file size is too big? – Soumya Jul 25 '20 at 22:21
Is there an automatic way to set the block-size for ones disk? – Tom de Geus Nov 30 '20 at 09:38

score 51 · Answer 4 · edited Jun 03 '14 at 21:02

51

Here's my version of @Piotr Czapla's method:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

edited Jun 03 '14 at 21:02

Luqmaan

2,012
26
34

answered Jun 21 '12 at 17:54

Nathan Feger

17,890
10
56
69

Bastien Semene · Answer 5 · 2013-08-27T08:16:14.140

30

Using multiple comment/answers in this thread, here is my solution :

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()

This is "pythonic"
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performances optimizations

And finally,

- This has been built by a community, thanks all for your advices/ideas.

edited Aug 27 '13 at 08:16

answered Jul 22 '13 at 08:14

Bastien Semene

599
7
20

3

One suggestion: make your md5 object an optional parameter of the function to allow alternate hashing functions, such as sha256 to easily replace MD5. I'll propose this as an edit, as well. – Hawkwing Aug 15 '13 at 19:41
1

also: digest is not human-readable. hexdigest() allows a more understandable, commonly recogonizable output as well as easier exchange of the hash – Hawkwing Aug 15 '13 at 19:51
Others hash formats are out of the scope of the question, but the suggestion is relevant for a more generic function. I added a "human readable" option according to your 2nd suggestion. – Bastien Semene Aug 27 '13 at 08:17
Can you elaborate on how 'hr' is functioning here? – EnemyBagJones Mar 23 '18 at 18:19
@EnemyBagJones 'hr' stands for human readable. It returns a string of 32 char length hexadecimal digits: https://docs.python.org/2/library/md5.html#md5.md5.hexdigest – Bastien Semene Mar 27 '18 at 09:46

Laurent LAPORTE · Answer 6 · 2016-12-04T17:41:50.450

A Python 2/3 portable solution

To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:

To be py27/py3 portable, you ought to use the io packages, like this:

import hashlib
import io


def md5sum(src):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        content = fd.read()
        md5.update(content)
    return md5

If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:

def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

The trick here is to use the iter() function with a sentinel (the empty string).

The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:

def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
    calculated = 0
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
            calculated += len(chunk)
            callback(calculated)
    return md5

score 4 · Answer 7 · answered Jul 07 '15 at 20:43

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration...

def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
    """
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)

    Linux Ext4 block size
    sudo tune2fs -l /dev/sda5 | grep -i 'block size'
    > Block size:               4096

    Input:
        path: a path
        algorithm: an algorithm in hashlib.algorithms
                   ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
        block_size: a multiple of 128 corresponding to the block size of your filesystem
        human_readable: switch between digest() or hexdigest() output, default hexdigest()
    Output:
        hash
    """
    if algorithm not in hashlib.algorithms:
        raise NameError('The algorithm "{algorithm}" you specified is '
                        'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))

    hash_algo = hashlib.new(algorithm)  # According to hashlib documentation using new()
                                        # will be slower then calling using named
                                        # constructors, ex.: hashlib.md5()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
             hash_algo.update(chunk)
    if human_readable:
        file_hash = hash_algo.hexdigest()
    else:
        file_hash = hash_algo.digest()
    return file_hash

sunqiang · Answer 8 · 2009-07-15T12:55:58.247

0

u can't get it's md5 without read full content. but u can use update function to read the files content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b)

edited Jul 15 '09 at 12:55

answered Jul 15 '09 at 12:54

sunqiang

6,106
1
28
32

score 0 · Answer 9 · answered Apr 28 '19 at 13:37

0

I think the following code is more pythonic:

from hashlib import md5

def get_md5(fname):
    m = md5()
    with open(fname, 'rb') as fp:
        for chunk in fp:
            m.update(chunk)
    return m.hexdigest()

answered Apr 28 '19 at 13:37

Waket Zheng

2,535
1
9
17

score 0 · Answer 10 · answered May 08 '19 at 10:48

0

I don't like loops. Based on @Nathan Feger:

md5 = hashlib.md5()
with open(filename, 'rb') as f:
    functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

answered May 08 '19 at 10:48

Sebastian Wagner

1,655
2
16
24

What possible reason is there to replace a simple and clear loop with a functools.reduce abberation containing multiple lambdas? I'm not sure if there's any convention on programming this hasn't broken. – Naltharial May 14 '19 at 16:44
My main problem was that `hashlib`s API doesn't really play well with the rest of Python. For example let's take `shutil.copyfileobj` which closely fails to work. My next idea was `fold` (aka `reduce`) which folds iterables together into single objects. Like e.g. a hash. `hashlib` doesn't provide operators which makes this a bit cumbersome. Nevertheless were folding an iterables here. – Sebastian Wagner May 14 '19 at 18:46

score -1 · Answer 11 · answered Aug 02 '16 at 11:22

Implementation of accepted answer for Django:

import hashlib
from django.db import models


class MyModel(models.Model):
    file = models.FileField()  # any field based on django.core.files.File

    def get_hash(self):
        hash = hashlib.md5()
        for chunk in self.file.chunks(chunk_size=8192):
            hash.update(chunk)
        return hash.hexdigest()

score -3 · Answer 12 · edited Aug 27 '16 at 20:49

-3

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
    strip1 = i.strip('\n')
    hash_object = hashlib.md5(strip1.encode())
    hash2 = hash_object.hexdigest()
    print hash2

edited Aug 27 '16 at 20:49

WhiZTiM

19,970
3
36
56

answered Jul 17 '16 at 21:37

mhmad msarwe

1

1

please, format the code in the answer, and read this section before giving answers: http://stackoverflow.com/help/how-to-answer – Farside Jul 17 '16 at 21:53
1

This will not work correctly as it is reading the file in text mode line by line then messing with it and printing the md5 of each stripped, encoded, line! – Steve Barnes Jul 05 '17 at 09:18

score -4 · Answer 13 · answered Apr 04 '15 at 12:50

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach, viz:

FileHash=hashlib.md5(FileData).hexdigest()

I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

This is not answering the question and 20 MB is hardly considered a *very big file* that may not fit into RAM as discussed here. — Chris, May 04 '15 at 12:03

Get MD5 hash of big files in Python

13 Answers13

python < 3.7

python 3.8 and above

original post

Linked