Hash algorithm for dynamic growing/streaming data?

Question

Are there any algorithms that you can continue hashing from a known hash digest? For example, the client upload a chunk of file to ServerA, I can get a md5 sum of the uploaded content, then the client upload the rest of the file chunk to ServerB, can I transfer the state of md5 internals to ServerB and finish the hashing?

There was a cool black magic hack based on md5 I found years ago at comp.lang.python, but it's using ctypes for a specific version of md5.so or _md5.dll, so it's not quite portable code for different python interpreter versions or other programming languages. Besides, the md5 module is deprecated in python standard library since 2.5 so I need to find a more general solution.

What's more, can the state of the hashing be stored in the hex digest itself? (So I can continue hashing a stream of data with an existing hash digest, not a dirty internal hack.)

score 2 · Answer 1 · answered May 03 '11 at 06:43

This is theoretically possible (the md5 so far should contain all the state you need to continue) but it looks like the normal APIs don't provide what you need. If you can suffice with a CRC instead, this will probably be a lot easier, since those are more commonly used for the "streaming" cases like you need. See here:

binascii.crc32(data[, crc])

crc32() accepts an optional crc input which is the checksum to continue from.

Hope that helps.

abbot · Accepted Answer · 2011-05-03T11:52:39.263

Not from the known digest, but from the known state. You can use a pure python MD5 implementation and save its state. Here is an example using _md5.py from from PyPy:

import _md5

def md5_getstate(md):
    return (md.A, md.B, md.C, md.D, md.count + [], md.input + [], md.length)

def md5_continue(state):
    md = _md5.new()
    (md.A, md.B, md.C, md.D, md.count, md.input, md.length) = state
    return md

m1 = _md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = _md5.new()
m.update("hello, world!")
print m.hexdigest()

As e.dan noted, you can also use almost any checksuming algorithm (CRC, Adler, Fletcher), but they do not protect you well from the intentional data modification, only from the random errors.

EDIT: of course, you can also re-implement the serialization method using ctypes from the thread you referenced in a more portable way (without magic constants). I believe this should be version/architecture independent (tested on python 2.4-2.7, both i386 and x86_64):

# based on idea from http://groups.google.com/group/comp.lang.python/msg/b1c5bb87a3ff5e34

try:
    import _md5 as md5
except ImportError:
    # python 2.4
    import md5
import ctypes

def md5_getstate(md):
    if type(md) is not md5.MD5Type:
        raise TypeError, 'not an MD5Type instance'
    return ctypes.string_at(id(md) + object.__basicsize__,
                            md5.MD5Type.__basicsize__ - object.__basicsize__)

def md5_continue(state):
    md = md5.new()
    assert len(state) == md5.MD5Type.__basicsize__ - object.__basicsize__, \
           'invalid state'    
    ctypes.memmove(id(md) + object.__basicsize__,
                   ctypes.c_char_p(state),
                   len(state))
    return md

m1 = md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = md5.new()
m.update("hello, world!")
print m.hexdigest()

It is not Python 3 compatible, since it does not have an _md5/md5 module.

Unfortunately hashlib's openssl_md5 implementation is not suitable for such hacks, since OpenSSL EVP API does not provide any calls/methods to reliably serialize EVP_MD_CTX objects.

Pypy's pure python MD5 implementation is cool. But how about openssl_md5 shipped with standard CPython? — est, May 03 '11 at 07:27
@est, you can't reliably do this for openssl_md5, because openssl itself does not provide an API for EVP_MD_CTX serialization, so any implementation will be openssl version-dependent. But you can still make a hack for _md5 module from CPython. I will add it to my answer. — abbot, May 03 '11 at 11:32
that's awesome man! But why can't we using `cyptes` to call `libopenssl.so` and get `EVP_MD_CTX` directly? — est, May 04 '11 at 07:28
libssl.so won't help, because, as I've said, it does not provide an API for EVP_MD_CTX serialization, only for in-memory copying. So you can't "store" EVP_MD_CTX using documented API into some byte array, to "restore" it somewhere else, you only can copy it within your running process. And for copying you can just use the copy() method provided in Python hashlib's API. Of course, you can hack libssl.so and serialize the EVP_MD_CTX manually, but it will be openssl version dependent. — abbot, May 04 '11 at 09:30
Thanks. It looks like indeed need some hack http://stackoverflow.com/questions/5880456/expose-hashlib-pyd-internals-for-evp-md-ctx — est, May 04 '11 at 10:52

score 1 · Answer 3 · answered Jul 12 '17 at 18:37

I was facing this problem too, and found no existing solution, so I wrote a library that uses ctypes to deconstruct the OpenSSL data structure holding the hasher state: https://github.com/kislyuk/rehash. Example:

import pickle, rehash
hasher = rehash.sha256(b"foo")
state = pickle.dumps(hasher)

hasher2 = pickle.loads(state)
hasher2.update(b"bar")

assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()

Hash algorithm for dynamic growing/streaming data?

3 Answers3

Linked