Hashing a file in Python

Question

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

I want it to be able to hash a file. I need it to read until the EOF, whatever the file size may be. — user3358300, Feb 27 '14 at 03:00
that is exactly what `file.read()` does - read the entire file. — isedev, Feb 27 '14 at 03:01
With the code I have it reads and hashes the file but I verified it and the hash given by my program is wrong. I have read on here in similar cases that it must go through a loop in order to read the whole file but I can't figure out how to make it work for my code. Take this one for example: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python?rq=1 — user3358300, Feb 27 '14 at 03:09
@user3358300 you may want to take a look at the code I've shown in my answer below. I think it may help. — Randall Hunt, Feb 27 '14 at 04:18
How can I get the SHA256 hash of a large file in Python2 that will match the ones provided in ASC files? — user742864, Apr 09 '20 at 23:45
https://www.quickprogrammingtips.com/python/how-to-calculate-sha256-hash-of-a-file-in-python.html ???? — user742864, Apr 10 '20 at 02:35
SHA1 should not be used anymore because it has been proven to be possible to [generate multiple files with the same SHA1 hash](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html). SHA256 and SHA3 are considered far more secure. — user9811991, Dec 05 '20 at 02:18

score 165 · Answer 1 · edited May 23 '17 at 12:25

165

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python

Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.

edited May 23 '17 at 12:25

Community

1
1

answered Feb 27 '14 at 03:52

Randall Hunt

10,499
5
29
39

@ranman Hello, I couldn't get the {0}".format(sha1.hexdigest()) part. Why do we use it instead of just using sha1.hexdigest() ? – Belial Jul 08 '15 at 14:25
@Belial What wasn't working? I was mainly just using that to differentiate between the two hashes... – Randall Hunt Sep 11 '15 at 22:47
@ranman Everything is working, I just never used this and haven't seen it in the literature. "{0}".format() ... unknown to me. :) – Belial Sep 12 '15 at 11:26
1

How should I choose `BUF_SIZE`? – Martin Thoma Aug 08 '17 at 15:09
@ranman If you had n files, what would be the run time? I'm curious how the buffer size affects it. – TheRealFakeNews Nov 05 '17 at 19:50
AFIAK the asymptotic (like BigO style) runtime is not different for N files when using buffers vs when not using buffers. The real runtime may indeed be different though. It can take longer to allocate larger buffers but allocating a buffer also has a fixed constant cost of asking the operating system to do something for you. You'd have to experiment to find something optimal. It might be worth it to have one thread going through and getting the file sizes and setting up an optimal buffer size map as you're iterating through your files. Beware premature optimizations though! – Randall Hunt Nov 06 '17 at 07:18
1

This does doesn't generate the same results as the `shasum` binaries. The other answer listed below (the one using memoryview) is compatible with other hashing tools. – Robert Hafner Jan 31 '19 at 18:53
@tedivm Sure? Tested it with Python2/3 and got the same results compared to sha1sum and md5sum – Murmel Sep 19 '19 at 09:07
@RandallHunt What about using the hash's block size as buffer size, like [Mitar](https://stackoverflow.com/a/55542529/1885518) does? – Murmel Sep 19 '19 at 09:58
The original version of this answer was written in 2014 so it's very possible there's a better way of doing things now. I'd just add that benchmarking is probably the most effective method - the open buffer size, filesystem buffer size, and algorithm buffer size are likely all different and simply reading the block size of the hashing algo may not be the most efficient method. If someone tries it all out I'm happy to update the answer. – Randall Hunt Sep 25 '19 at 09:00
SHA1 should not be used anymore because it has been proven to be possible to [generate multiple files with the same SHA1 hash](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html). SHA256 and SHA3 are considered far more secure. – user9811991 Dec 05 '20 at 02:18

maxschlepzig · Answer 2 · 2019-05-10T17:51:53.510

78

For the correct and efficient computation of the hash value of a file (in Python 3):

Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

edited May 10 '19 at 17:51

answered Jul 02 '17 at 17:23

maxschlepzig

27,589
9
109
146

2

How do you know what is an optimal block size? – Mitar Mar 02 '18 at 05:45
1

@Mitar, a lower bound is the maximum of the physical block (traditionally 512 bytes or 4KiB with newer disks) and the systems page size (4KiB on many system, other common choices: 8KiB and 64 KiB). Then you basically do some benchmarking and/or look at published [benchmark results and related work](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=ed2f4a9c4d77462f357353eb73ee4306c28b37f1;hb=HEAD#l23) (e.g. check what current rsync/GNU cp/... use). – maxschlepzig Mar 02 '18 at 20:31
Would [`resource.getpagesize`](https://docs.python.org/2/library/resource.html#resource.getpagesize) be of any use here, if we wanted to try to optimize it somewhat dynamically? And what about [`mmap`](https://docs.python.org/3/library/mmap.html)? – jpmc26 May 14 '18 at 17:40
@jpmc26, getpagesize() isn't that useful here - common values are 4 KiB or 8 KiB, something in that range, i.e. something much smaller than the 128 KiB - 128 KiB is generally a good choice. mmap doesn't help much in our use case as we sequentially read the complete file from front to back. mmap has advantages when the access pattern is more random-access like, if pages are accessed more than once and/or if it the mmap simplifies read buffer management. – maxschlepzig May 15 '18 at 08:13
Unlike the "top voted" answer this answer actually provides the same results as the `shasum` function. – Robert Hafner Jan 31 '19 at 18:47
@tedivm, that's probably because this answer is using sha256 while Randall's answer uses sha1 and md5, which are the hashing algorithms specified by the OP. Try comparing their results to `sha1sum` and `md5sum`. – Kyle A Mar 09 '19 at 00:52
4

I benchmarked both the solution of (1) @Randall Hunt and (2) yours (in this order, is important due to file cache) with a file of around 116GB and sha1sum algorithm. Solution 1 was modified in order to use a buffer of 20 * 4096 (PAGE_SIZE) and set buffering parameter to 0. Solution 2 only algorithm was modified (sha256 -> sha1). Result: (1) 3m37.137s (2) 3m30.003s . The native sha1sum in binary mode: 3m31.395s – bioinfornatics Jul 19 '19 at 09:55
This might be a good solution for specific use cases (equally sized files, time to do benchmarking), but I miss a note about `open()` already using buffering on its own which might be the best option for a general purpose implementation. See [Mitar's answer](https://stackoverflow.com/a/55542529/1885518) for more – Murmel Sep 19 '19 at 09:53
@Murmel what do you mean with 'equally sized files'? This answer is a general purpose solution. If you call `open()` with `buffering=0` it doesn't do any buffering. Mitar's answer implements buffer churning. – maxschlepzig Sep 19 '19 at 17:02

score 24 · Answer 3 · answered Apr 05 '19 at 19:56

24

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.

answered Apr 05 '19 at 19:56

Mitar

5,851
2
44
68

Perfect answer, but it would be nice, if you would back your statements with the related doc: [Python3 - open()](https://docs.python.org/3/library/functions.html#open) and [Python2 - open()](https://docs.python.org/2/library/functions.html#open). Even mind the diff between both, Python3's approach is more sophisticated. Nevertheless, I really appreciated the consumer-centric perspective! – Murmel Sep 19 '19 at 09:28
1

`hash.block_size` is documented just as the 'internal block size of the hash algorithm'. Hashlib **doesn't** find it _ideal_. Nothing in the package documentation suggests that `update()` prefers `hash.block_size` sized input. It doesn't use less CPU if you call it like that. Your `file.read()` call leads to many unnecessary object creations and superfluous copies from the file buffer to your new chunk bytes object. – maxschlepzig Sep 19 '19 at 17:15
Hashes update their state in `block_size` chunks. If you are not providing them in those chunks, they have to buffer and wait for enough data to appear, or split given data into chunks internally. So, you can just handle that on the outside and then you simplify what happens internally. I find this ideal. See for example: https://stackoverflow.com/a/51335622/252025 – Mitar Sep 19 '19 at 21:04
1

The `block_size` is much smaller than any useful read size. Also, any useful block and read sizes are powers of two. Thus, the read size is divisible by the block size for all reads except possibly the last one. For example, the sha256 block size is 64 bytes. That means that `update()` is able to directly process the input without any buffering up to any multiple of `block_size`. Thus, only if the last read isn't divisible by the block size it has to buffer up to 63 bytes, once. Hence, your last comment is incorrect and doesn't support the claims you are making in your answer. – maxschlepzig Nov 05 '19 at 20:43
The point is that one does not have to optimize buffering because it is already done by Python when reading. So you just have to decide on the amount of looping you want to do when hashing over that existing buffer. – Mitar Nov 06 '19 at 04:44

score 7 · Answer 4 · answered Feb 12 '18 at 00:57

7

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

answered Feb 12 '18 at 00:57

phyyyl

462
3
14

1

Is it cross-platform (Linux + Win)? Is it working with Python3? Also is it still maintained? – Basj Nov 07 '20 at 17:28
Yes it is cross platform and will still work. Also the other stuff in the package works fine. But I will no longer maintain this package of personal experiments, because it was just a learning for me as a developer. – phyyyl Nov 14 '20 at 22:22

score 5 · Answer 5 · answered Jun 05 '20 at 11:54

5

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

answered Jun 05 '20 at 11:54

Antti Haapala

117,318
21
243
279

Naive question ... what is the advantage of using `mmap` in this scenario? – Jonathan B. Sep 28 '20 at 17:42
1

@JonathanB. most methods needlessly create `bytes` objects in memory, and call `read` too many or too little times. This will map the file directly into the virtual memory, and hash it from there - the operating system can map the file contents directly from the buffer cache into the reading process. This means this could be faster by a significant factor over [this one](https://stackoverflow.com/a/22058673/918959) – Antti Haapala Sep 28 '20 at 18:15
@JonathanB. I did the test and the difference is not that significant in *this* case, we're talking about ~15 % over the naive method. – Antti Haapala Sep 28 '20 at 18:26
2

I benchmarked this vs the read chunk by chunk method. This method took 3GB memory for hashing a 3GB file while maxschlepzig's answer took 12MB. They both roughly took the same amount of time on my Ubuntu box. – Seperman Mar 17 '21 at 18:40
@Seperman you're measuring the RAM usage incorrectly. The memory is still available, the pages are mapped from the buffer cache. – Antti Haapala Mar 17 '21 at 19:09
@AnttiHaapala That makes sense. How do you recommend I measure the RAM usage of the process on Linux to see the mmap usage vs physical memory usage? – Seperman Mar 17 '21 at 22:39
For example when I look at Htop, these are some numbers I see: VIRT: 2884M, RES: 2122M. From my understanding RES is the physical RAM that is used. – Seperman Mar 17 '21 at 23:42
@Seperman well yes, that is more appropriate number than VIRT - but I added `os.system("free")` in between several points there and the "available" memory doesn't decrease. – Antti Haapala Mar 18 '21 at 04:34

score -2 · Answer 6 · answered Jun 10 '18 at 09:04

-2

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

answered Jun 10 '18 at 09:04

Ome Mishra

41
3

2

You are basically doing `echo $USER_INPUT | md5sum > encrypted.txt && cat encrypted.txt` which does not deal with hashing of files, especially not with big ones. – Murmel Sep 19 '19 at 09:35
1

hashing != encrypting – bugmenot123 Dec 22 '19 at 14:05

Hashing a file in Python

6 Answers6

Addendum!

Linked

Related