How to access millions of bits for hashing

Question

I'm doing MD5 hashing on an executable. I've used a python script to read binary from the executable into a text file, but if I were to read in this constructed file to a C program, I would be handling MBs of data, as the ones and zeroes are being treated as chars, taking 8 bits for each 1 bit number. Would it be possible to read these in as single bits each? How badly would a program perform if I made, say, a 10MB array to hold all the characters I might need for the length of the binary conversion and padding for the hash? If this is unthinkable, would there be a better way to manipulate the data?

First and foremost, don't use MD5 -- there's just no reasonable excuse for using it in this day and age. — Jerry Coffin, Mar 29 '13 at 18:29
It's a first step for research I'm doing. The most important thing is familiarizing myself with hashing. We'll be switching to a better hashing algorithm afterwards. — Dolphiniac, Mar 29 '13 at 18:30
Doesn't really make sense -- basically all you're doing (or all you've mentioned, anyway) is hashing. So basically, you're talking about doing something, then throwing it out completely and starting over from day one. And that'd doubly true if you're starting from a string of 1's and 0's as characters, as you seem to be describing. — Jerry Coffin, Mar 29 '13 at 18:33
@Dolphiniac are you saying that the file you're reading in uses one byte for each bit? Like a text file with only the characters `0` and `1`? — Drew Dormann, Mar 29 '13 at 18:34
You've mentioned handling MBs of data and wondered how the program would perform... if you have a lot of data that is initially in binary, and you want it to perform well, then don't convert it to text, handle it in binary. — amdn, Mar 29 '13 at 18:37
@Jerry, that's what we've been instructed to do. There's no use dwelling on why. @Drew, yes. The text file consists of binary in `char` form. — Dolphiniac, Mar 29 '13 at 18:37
@amdn, can a C program do that? Read in the binary straight from the executable? That would be great, if so. — Dolphiniac, Mar 29 '13 at 18:38
Yes it can... an executable is "just" a file... with execute permission. — amdn, Mar 29 '13 at 18:39
@Dolphiniac If you mean read the in an executable, then yes, that's what would happen if you open the executable file and read it. — nos, Mar 29 '13 at 18:39
Okay, then how would you do this? Standard `FILE*` to start with? What about handling the binary? Would an array be the best way to store the bits? How should I format structures to fit the data? — Dolphiniac, Mar 29 '13 at 18:43
Here's one idea of a basic starting point: http://codereview.stackexchange.com/questions/13288/code-for-sha-256 — Jerry Coffin, Mar 29 '13 at 18:44
Personally I like storing my bits in *bytes*. Call me crazy. — WhozCraig, Mar 29 '13 at 18:45
@JerryCoffin haha, that's an *awesome* analogy... mouthfuls of sawdust. I'm dying here. — Nik Bougalis, Mar 29 '13 at 19:13

user123 · Accepted Answer · 2013-03-29T19:47:21.737

1

Since you tagged the question C and C++, I'll go for C.

Would it be possible to read these in as single bits each?

Yes, just read 8 bytes at a time from the file and concatenate those 1s and 0s to make a new byte. You don't need to make a 10MB array for this.

First, read 8 bytes from the file. The read char values would be converted to integral values (0 and 1) and then bitshifted to make a new byte.

unsigned char bits[8];
while (fread(bits, 1, 8, file) == 8) {
    for (unsigned int i = 0; i < 8; i++) {
        bits[i] -= '0';
    }

    char byte = (bits[0] << 7) | (bits[1] << 6) |
                (bits[2] << 5) | (bits[3] << 4) |
                (bits[4] << 3) | (bits[5] << 2) |
                (bits[6] << 1) | (bits[7]     );

    /* update MD5 Hash here */
}

Then, you would update your MD5 hash with the newly read byte.

Edit: Since a typical MD5 implementation would have to break the input into chunks of 512 bits before processing, you can get rid of that overhead in the implementation itself (not recommended though), and just read 512 bits (64 bytes) from the file and update the hash afterwards directly.

unsigned char buffer[64];
unsigned char bits[8];
unsigned int index = 0;

while (fread(bits, 1, 8, file) == 8) {
    for (unsigned int i = 0; i < 8; i++) {
        bits[i] -= '0';
    }

    buffer[index++] = (bits[0] << 7) | (bits[1] << 6) |
                      (bits[2] << 5) | (bits[3] << 4) |
                      (bits[4] << 3) | (bits[5] << 2) |
                      (bits[6] << 1) | (bits[7]     );

    if (index == 64) {
        index = 0;
        /* update MD5 hash with 64 byte buffer */
    }
}

/* This sends the remaining data to the MD5 hash function */
/* It's not likely that your file has exactly 512N chars */
if (index != 0) {
    while (index != 64) {
        buffer[index++] = 0;
    }
    /* update MD5 hash with the padded buffer. */
}

edited Mar 29 '13 at 19:47

answered Mar 29 '13 at 18:48

user123

8,613
2
25
50

Personally I'd shoot for (8 * N) chars, where N is the block size of the underlying hash algorithm. SHA-1, for example has a 512-bit block size, SHA-2's (224,256) is likewise 512 bits, SHA-2(384/512) is 1024 bits, etc... – WhozCraig Mar 29 '13 at 19:00
That's a pretty good idea. In this case, he'd have to read 64 bytes before updating the hash. This way, he won't need the overhead of processing the input into 512-bit blocks. – user123 Mar 29 '13 at 19:06
He wouldn't *have* to, but any self-respecting hash algorithm implementation is just going to sit on the data until the block size is full or the finalizer is fired. For SHA-1, for example, reading in 8*512 chars, xlating them to bytes, then submitting the block would likely at least assist in reducing the number of hash-api-calls. Of course, taking that to its extreme if memory was reasonable you could just stream the data through a filter that fills a std::vector with the real bytes, and once you're done with your bit-file, send the vector in a single hash+finalize. – WhozCraig Mar 29 '13 at 19:10
Yes, that _would_ optimize things, but I was more concerned about memory usage when I was typing my answer :/ – user123 Mar 29 '13 at 19:12
May I ask what will happen in the above edited algorithm when there is not a 512-byte chunk remaining in the file? Also, how would an EOF affect it? – Dolphiniac Mar 29 '13 at 19:33
Oh, I forgot about padding. If the number of bits wasn't exactly a multiple of 512, you would have to pad the buffer with 0s before updating the hash again. (I'll update the code in the second case) – user123 Mar 29 '13 at 19:44
Thanks. I'll need to make some changes (padding starts with `1` bit and leaves 64 bits for length of message) but overall, great answer. – Dolphiniac Mar 29 '13 at 19:56
@Magtheridon96 You don't pad the block, just submit the last chunk of however many bytes you have left, then finalize(). The APi will take care of the rest. The number of bits *better* be a multiple of 8, however. – WhozCraig Mar 29 '13 at 23:24

How to access millions of bits for hashing

1 Answers1