2

I am using a hash algorithm to create a primary key for a database table. I use the SHA-1 algorithm which is more than fine for my purposes. The database even ships an implementation for SHA-1. The function computing the hash is returning a hex value as 40 characters. Therefore I am storing the hex characters in a char(40) column.

The table will have lots of rows, >= 200 Mio. rows which is why I am looking for less data intensive ways of storing the hash. 40 characters times ~200 Mio. rows will require some GB of storage... Since hex is base16 I thought I could try to store it in base 256 in hope to reduce the amount of characters needed to around 20 characters. Do you have tips or papers on implementations of compression with base 256?

MaNa
  • 41
  • 4
  • 2
    I think this is a programming question, which is off-topic. The object you're trying to store is a 160-bit number and you should just store it as a 160-bit number, instead of storing its hex digits as a string. (Analogously, if you were trying to store a 5-digit decimal number, you'd store it in an `int`, not in a `char[5]`.) – David Richerby Feb 09 '17 at 12:57
  • @DavidRicherby I cannot store it a `int` since I need a collision free key and the output of the hashing function is `char(40)`. I believe that storing a hex string to an int would require more bytes since the number would be very large. –  Feb 09 '17 at 14:43
  • Sure, an `int` won't hold a 160-bit number. But storing the data in some kind of integer format will use less space, not more. will let you store 8 bits per byte, whereas storing it as a string only gives you four bits of useful data per byte (and less than that if the underlying system is using a 16-bit character set). – David Richerby Feb 09 '17 at 14:52

2 Answers2

2
  • Store it as a blob: storing 8 bits of data per character instead of 4 is a 2x compression (you need some way to convert it though),
  • Cut off some characters: you have 160 bits, but 128 bits is enough for unique keys even if the universe ends, and for most purposes 80 bits would even be enough (you don't need cryptographic protection). If you have an anti-collision algorithm, use 36 or 40 bits is enough.
Mark Jeronimus
  • 8,223
  • 2
  • 32
  • 44
  • But keep things in perspective: is 20 bytes saved significant compared to the typical total row size? – TripeHound Feb 09 '17 at 14:27
  • Sadly, I cannot store it as BLOB. I have CHAR, VARCHAR, DECIMAL, DATE, TIMESTAMP, BOOLEAN and GEOMETRY. –  Feb 09 '17 at 14:49
1

A SHA-1 value is 20 bytes. All the bits in these 20 bytes are significant, there's no way to compress them. By storing the bytes in their hexadecimal notation, you're wasting half the space — it takes exactly two hexadecimal digits to store a byte. So you can't compress the underlying value, but you can use a better encoding than hexadecimal.

Storing as a blob is the right answer. That's base 256. You're storing each byte as that byte with no encoding that would create some overhead. Wasted space: 0.

If for some reason you can't do that and you need to use a printable string, then you can do better than hexadecimal by using a more compact encoding. With hexadecimal, the storage requirement is twice the minimum (assuming that each character is stored as one byte). You can use Base64 to bring the storage requirements to 4 characters per 3 bytes, i.e. you would need 28 characters to store the value. In fact, given that you know that the length is 20 bytes and not 21, the base64 encoding will always end with a =, so you only need to store 27 characters and restore the trailing = before decoding.

You could improve the encoding further by using more characters. Base64 uses 64 code points out of the available 256 byte values. ASCII (the de facto portable) has 95 printable characters (including space), but there's no common “base95” encoding, you'd have to roll your own. Base85 is an intermediate choice, it does get some use in practice, and lets you store the 20-byte value in 25 printable ASCII characters.

Community
  • 1
  • 1
Gilles 'SO- stop being evil'
  • 92,660
  • 35
  • 189
  • 229