0

If I want to generate a checksum, for some product attributes concatanated with a custom string, and use this checksum later on, to see if any attribute in the same product changed over time, is php's crc32 method suitable?

For example, let's say I have a product with the following attributes:

color: red
size: xl

I am trying to get the checksum for this product by creating the following string: red||xl and then running the crc32 function on this string. If later on, the size is identified differently, or the product gets a new attribute, I want to identify this difference by the changing of the checksum on this product.

The baseline is, am I safe in using the crc32 method for this, or I should opt for a slower but more secure hashing algorithm, with less collission?

Adam Baranyai
  • 2,837
  • 2
  • 19
  • 53
  • Does this answer your question? [Probability of collision when using a 32 bit hash](https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash) – iainn Dec 30 '20 at 21:27
  • 1
    Short version: it depends how many inputs you're going to have. But the difference in computational effort between crc32 and a more complex hashing function (even md5 would provide a huge amount of benefit, for a non-secure application) is probably negligble. Something with a 128-bit output is going to be *drastically* more reliable. – iainn Dec 30 '20 at 21:28

1 Answers1

1

Define "suitable".

I see little reason to not use a cryptographic hash here. If you have a) millions of strings, or b) malicious users who would like to trick you into thinking the attributes are the same, you should use a cryptographic hash. SHA-256 should be readily available. Feel free to use only 128 bits of it if you need to save the space. You'll still have a negligible probability of collision.

Update:

Based on the comments, a cryptographic hash would not be of benefit. I would recommend XXH128 as an extremely fast hash that would effectively eliminate the possibility of collision.

Mark Adler
  • 79,438
  • 12
  • 96
  • 137
  • I am not worried about the space (but indeed that could also cause a problem), I am trying to save time. At the moment, I am talking about potentially hashing ~3 million products, 12 times a day, and while this is still not so big of a time, considering all the other things that should run on the server, and the possiblity of later on having 10 or 100 times the number of items in the database, I am trying to opt for the fastest hashing algorithm, that still suits my needs. Users won't be able to input data here, only administrators, so tricking the system is not a problem imo. – Adam Baranyai Dec 31 '20 at 14:06