3

I've been testing the randomness of generated values in PHP, and have been considering 32bit hexadecimal to represent a unique state within a given time frame.

I wrote this simple test script:

$checks = [];
$i = 0;

while (true) {
    $hash = hash('crc32b', openssl_random_pseudo_bytes(4));

    echo $hash . PHP_EOL;

    if (in_array($hash, $checks)) {
        echo 'Copy: ' . $i . PHP_EOL;
        break;
    }

    $i++;

    $checks[] = $hash;
}

Surprisingly (to me) this script generates a copy in less than 100,000 iterations, and as low as 1000 iterations.

My question is, am I doing something wrong here? Out of 4 billion possibilities, this level of frequency seems too unlikely.

Flosculus
  • 6,660
  • 3
  • 15
  • 39

1 Answers1

2

No, this is not surprising, and there is nothing wrong with the random number generator. This is the birthday problem. With just 23 people in a room, the probability that two of them have the same birthday is 50%. This is perhaps counter-intuitive until you realize that there are 253 possible pairings of 23 people, so you get 253 shots at two people having the same birthday.

You are doing the same thing here. You are not looking to see when you hit a particular 32-bit value. Instead you are looking for a match between any two values you have created so far, which gives you a lot more chances. If you consider step 100,000, you have a 1 in 43,000 chance of matching one of the numbers you have created so far, as opposed to a 1 in 4,300,000,000 chance of matching a particular number. In the run up to 100,000, you have added up a lot of those chances.

See this answer here on stackoverflow for the calculation for a 32-bit value. On average you only need about 93,000 values to get a hit.

By the way, the use of a CRC-32 on the four-byte random value has no bearing here. The result would be the same with or without it. All you're doing is mapping each 32-bit number uniquely (one-to-one and onto) to another 32-bit number.

Community
  • 1
  • 1
Mark Adler
  • 79,438
  • 12
  • 96
  • 137
  • I know, I shouldn't have mentioned the algorithm in the title. It is just how I represent the bytes as something readable. Luckily, I don't use this method with unique indexes when storing to a DB, instead I am able to compare timestamped records sequentially, so it is only ever a comparison of 2 values. I was trying to gauge the limitations of using 32bit values, and this explains it perfectly, thanks. – Flosculus May 01 '15 at 16:26
  • Note that this is why cryptographic hashes require double the output compared with block ciphers to be considered secure. With ciphers you don't have to worry about *collisions* as these identical values are called - but for hash algorithms you do. – Maarten Bodewes May 01 '15 at 21:24