0

I have a large file which I will upload in chunks using Python. Each chunk will be ~4MB and the file could be quite large. I would like to (most efficiently) calculate an MD5 value for each of the chunks as well as an MD5 for the entire file. I fully understand how to calculate MD5 based on the hashlib reference docs and other stackoverflow questions on efficiently calculating MD5 values for large files.

The easiest solution I see is to have a hashlib.md5() instance for each chunk and one for the total data. However, this means effectively running the md5 algorithm twice over the full data and doing a bunch of digesting. I can optimize this ever so slightly by calling copy() on the first hashlib.md5() value after it processes the first chunk, but after that point I don't see how to do this more effectively.

Is there a better way I can basically combine the MD5 values for each chunk into a total MD5 for the full file using Python?

Community
  • 1
  • 1
Emily Gerner
  • 2,327
  • 13
  • 14
  • 1
    have you profiled your code to see if this is actually any sort of bottle neck? I assume the disk access cost will dwarf the md5sum cost – Joran Beasley Feb 26 '16 at 00:32
  • It will, many things will dwarf the cost. But disk cost is not something I can change, and MD5 is. Between doing and not doing MD5 we do see a difference. If there's a way to optimize this, why not consider it? – Emily Gerner Feb 26 '16 at 23:06
  • bah fair point ... but no theres no way to combine md5's to get the same md5 the entire corpus returns – Joran Beasley Feb 26 '16 at 23:08
  • Figured that might be the case but I wasn't familiar enough with the MD5 algorithm itself to know if there was some way to hash components. Oh well. – Emily Gerner Feb 29 '16 at 17:57

1 Answers1

1

you can modify the answer in the other thread you linked

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
            yield(hashlib.md5(buf).hexdigest())
    yield m.hexdigest()

this keeps a running md5 total for the whole file as it iterates so you are at least only iterating the file contents one time

(note that you would call this like)

md5s = list(generate_file_md5("/path/","file.txt",chunksize))
md5s[-1] # the whole file checksum
md5s[:-1] # the partial md5s
Joran Beasley
  • 93,863
  • 11
  • 131
  • 160
  • Going to give this a +1 since I appreciate the code as best effort (ie, doesn't do the impossible of reusing the chunk md5s's, but is probably best we can do) and I think it's valuable to have on StackOverflow. Then for posterity i'm going to mark my own question as a duplicate since I effectively found an answer that tells me what I want to do is not possible. :) – Emily Gerner Mar 02 '16 at 02:16