How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

Question

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.

In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:

The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.

Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.

So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?

For error detection, you want a high likelihood that flipping any bit in the input produces a different output. Ideally, you want comparably high likelihood for combinations of two or more bit flips. The tests you've performed do not seem to address that. The per-bit probability of that bit being 1 in the result ignores the possibility of correlations. — John Bollinger, Feb 09 '18 at 20:45
Concerning "can't we design anything better/faster?" is `simphash()` faster? — chux - Reinstate Monica, Feb 09 '18 at 20:46
@chux simphash is quite fast due to its simplicity, at least in JavaScript: https://jsperf.com/hashperftest/1. But I imagine algorithms which read more than one byte at a time can be faster in native C/C++. — bryc, Feb 09 '18 at 20:50
For example, consider this algorithm: (1) set the result to zero; (2) for each byte of input, if the parity of the most-significant bit is the same as the parity of the byte's position in the input, then flip all the bits of the result. If I understand the tests correctly, that will produce near-ideal results in *all* of them (and it could be made very fast!), but it would be extremely ineffective for error detection. — John Bollinger, Feb 09 '18 at 20:51
@JohnBollinger I'm not entirely sure how that test works. I did something similar though: In these tests, I tallied all the set bits (1) for each of the 32 output bits, and computed the average over the entire test values, which would give 32 percentage values such as (0, 32, 40, 60, 80, 100) of how often the bit is set to 1. In the table, I show just the min and max value. In good hashes, these hover around the 50% mark of a large sample size. Edit: right, so it is not the best way to test this. — bryc, Feb 09 '18 at 20:58
Yes. Good hashes will perform well in your tests, but hashes that perform well in your tests are not necessarily good. — John Bollinger, Feb 09 '18 at 21:04
I don't understand your Bytes1to255 test. Can you expound on that? — Mark Adler, Apr 16 '18 at 18:10
@MarkAdler Starting at index 0, it increments each byte from 1-255 and moves to the next. e.g. `{255}, {255, 1}, {255, 255, 1...}` 72 times. Which produces `72 * 255` unique input messages in 1-72 bytes. The idea is that the sum of the bytes is always higher than the previous message (which is why it produces no collisions for a sum checksum). Also, **RollingBit** test is simply toggling **bit 0** of each byte of a fixed **10240**-byte array e.g. `{1, 0, 0...} -> {0, 0, 1..}. Only one bit is set at any time. There's a typo there, so worth mentioning. — bryc, Apr 16 '18 at 21:05
Your Bytes1to255 test tickles a particular property of a CRC whose register is initialized to all 1's and is then fed all 1's, i.e. your sequence of 255's. The collisions you count are _not_ representative of the average behavior of the CRC. The 254 collisions occur after only _five_ bytes (you don't need to go out to 72 to see them). All of the collisions are of the form CRC(3*255+n) == CRC(4*255+~n), where by "*" I mean repeat that many, and by "+" I mean concatenate. "~" means bit-wise inverse. — Mark Adler, Apr 17 '18 at 01:59

Mark Adler · Answer 1 · 2018-02-09T22:18:55.753

3

Is there any reason this function would be inferior to CRC-32 or Adler-32 for detecting errors in data?

Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.

For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)

By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.

edited Feb 09 '18 at 22:18

answered Feb 09 '18 at 20:58

Mark Adler

79,438
12
96
137

2

And I guess we can accept you as an authority on Adler-32 design decisions! :-) – John Bollinger Feb 09 '18 at 21:08
So if I understand how CRC works - it will _always_ detect the errors that it was designed to check. Meaning that if you follow the rules dictated by mathematical proofs over an infinite amount of time, you could never cause a collision. That is, certain burst errors under `n` amount of bits of the polynomial. And hash functions lack this guarantee, and simply 'avoid' collisions with sufficient mixing functions. – bryc Feb 09 '18 at 23:48
1

The guarantees are for specific message lengths and numbers of error bits. Look at [Koopman's work](https://users.ece.cmu.edu/~koopman/crc/) for examples. – Mark Adler Feb 10 '18 at 00:41
Testing is done for the specific error characteristics and packet sizes of interest. You simply generate many, many random errors that match your characteristics and see how often your check value is unchanged by the errors. – Mark Adler Feb 10 '18 at 00:43

How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

1 Answers1