117

Is there a way of encryption that can take a string of any length and produce a sub-10-character hash? I want to produce reasonably unique ID's but based on message contents, rather than randomly.

I can live with constraining the messages to integer values, though, if arbitrary-length strings are impossible. However, the hash must not be similar for two consecutive integers, in that case.

rath3r
  • 325
  • 1
  • 6
  • 18

10 Answers10

92

You can use any commonly available hash algorithm (eg. SHA-1), which will give you a slightly longer result than what you need. Simply truncate the result to the desired length, which may be good enough.

For example, in Python:

>>> import hashlib
>>> hash = hashlib.sha1("my message".encode("UTF-8")).hexdigest()
>>> hash
'104ab42f1193c336aa2cf08a2c946d5c6fd0fcdb'
>>> hash[:10]
'104ab42f11'
phuclv
  • 27,258
  • 11
  • 104
  • 360
Greg Hewgill
  • 828,234
  • 170
  • 1,097
  • 1,237
  • Hm, I wasn't aware SHA hexdigests could be truncated. –  Dec 30 '10 at 23:47
  • 5
    Any reasonable hash function can be truncated. – President James K. Polk Dec 31 '10 at 00:48
  • 105
    wouldn't this rise the risk of collision to a much higher extent? – Gabriel Sanmartin Apr 30 '13 at 09:39
  • 151
    @erasmospunk: encoding with base64 does nothing for collision resistance, since if `hash(a)` collides with `hash(b)` then `base64(hash(a))` also collides with `base64(hash(b))`. – Greg Hewgill Nov 12 '13 at 18:37
  • 62
    @GregHewgill you are right, but we are not speaking about the original hash algorithm colliding (yes, `sha1` collides but this is another story). If you have a 10 characters hash you get higher entropy if it is encoded with `base64` vs `base16` (or hex). How higher? With `base16` you get 4 bits of information per character, with `base64` this figure is 6bits/char. Totaly a 10 char "hex" hash will have 40bits of entropy while a base64 60bits. So it is _slightly_ more resistant, sorry if I was not super clear. – John L. Jegutanis Nov 13 '13 at 14:35
  • 23
    @erasmospunk: Oh I see what you mean, yes if you have a limited fixed size for your result then you can pack more significant bits in with base64 encoding vs. hex encoding. – Greg Hewgill Nov 13 '13 at 18:46
  • 1
    Or you can not encode as ASCII at all and use the raw bytes if you're that concerned about space. –  Sep 05 '18 at 06:28
47

If you don't need an algorithm that's strong against intentional modification, I've found an algorithm called adler32 that produces pretty short (~8 character) results. Choose it from the dropdown here to try it out:

http://www.sha1-online.com/

B T
  • 46,771
  • 31
  • 164
  • 191
  • 2
    it's very old, not very reliable. – Mascarpone Jun 19 '18 at 10:37
  • 5
    @Mascarpone "not very reliable" - source? It has limitations, if you know them it doesn't matter how old it is. – B T Jun 22 '18 at 07:13
  • https://en.wikipedia.org/wiki/Adler-32 , it's better to choose a more modern algorithm, which has fewer weaknesses and maybe uses a sponge function – Mascarpone Jun 22 '18 at 12:28
  • 13
    @Mascarpone "fewer weaknesses" - again, *what* weaknesses? Why do you think this algorithm isn't 100% perfect for the OP's usage? – B T Jul 12 '18 at 17:14
  • 3
    @Mascarpone The OP doesn't say that they want a crypto-grade hash. OTOH, Adler32 is a checksum, not a hash, so it may not be suitable, depending on what the OP is actually doing with it. – PM 2Ring Aug 23 '18 at 17:35
  • 2
    There is one caveat to Adler32, quoting [Wikipedia](https://en.wikipedia.org/wiki/Adler-32): *Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits.* – Basil Bourque Nov 24 '18 at 20:44
  • adler16 or adler8 also exist which can yield even smaller results – justin.m.chase Oct 21 '19 at 18:57
15

You need to hash the contents to come up with a digest. There are many hashes available but 10-characters is pretty small for the result set. Way back, people used CRC-32, which produces a 33-bit hash (basically 4 characters plus one bit). There is also CRC-64 which produces a 65-bit hash. MD5, which produces a 128-bit hash (16 bytes/characters) is considered broken for cryptographic purposes because two messages can be found which have the same hash. It should go without saying that any time you create a 16-byte digest out of an arbitrary length message you're going to end up with duplicates. The shorter the digest, the greater the risk of collisions.

However, your concern that the hash not be similar for two consecutive messages (whether integers or not) should be true with all hashes. Even a single bit change in the original message should produce a vastly different resulting digest.

So, using something like CRC-64 (and base-64'ing the result) should get you in the neighborhood you're looking for.

Andrii Abramov
  • 7,967
  • 8
  • 55
  • 79
John
  • 1,093
  • 1
  • 11
  • 26
  • 1
    Does CRC'ing a SHA-1 hash and then base-64'ing the result make the resulting ID more resistant to collision? –  Dec 30 '10 at 23:58
  • 6
    "However, your concern that the hash not be similar for two consecutive messages [...] should be true with all hashes." -- That's not necessarily true. For example, for hash functions which are used for clustering or clone detection, the exact opposite is true, actually: you *want* similar documents to yield similar (or even the same) hash values. A well-known example of a hash algorithm that is *specifically* designed to yield identical values for similar input is Soundex. – Jörg W Mittag Dec 31 '10 at 01:08
  • I am using the hashes for authenticating the signature of the message. So basically, for a known message, and specified signature, the hash must be correct. I don't care if there would be a small percentage of false positives, though. It's totally accceptable. I currently use the truncated SHA-512 hash compressed with base62 (something I whipped up quickly) for convenience. –  Jan 02 '11 at 23:20
  • @JörgWMittag Excellent point on SoundEx. I stand corrected. Not _all_ hashes have the same characteristics. – John Aug 03 '13 at 04:09
13

Just summarizing an answer that was helpful to me (noting @erasmospunk's comment about using base-64 encoding). My goal was to have a short string that was mostly unique...

I'm no expert, so please correct this if it has any glaring errors (in Python again like the accepted answer):

import base64
import hashlib
import uuid

unique_id = uuid.uuid4()
# unique_id = UUID('8da617a7-0bd6-4cce-ae49-5d31f2a5a35f')

hash = hashlib.sha1(str(unique_id).encode("UTF-8"))
# hash.hexdigest() = '882efb0f24a03938e5898aa6b69df2038a2c3f0e'

result = base64.b64encode(hash.digest())
# result = b'iC77DySgOTjliYqmtp3yA4osPw4='

The result here is using more than just hex characters (what you'd get if you used hash.hexdigest()) so it's less likely to have a collision (that is, should be safer to truncate than a hex digest).

Note: Using UUID4 (random). See http://en.wikipedia.org/wiki/Universally_unique_identifier for the other types.

phuclv
  • 27,258
  • 11
  • 104
  • 360
JJ Geewax
  • 9,454
  • 1
  • 33
  • 47
11

If you need "sub-10-character hash" you could use Fletcher-32 algorithm which produces 8 character hash (32 bits), CRC-32 or Adler-32.

CRC-32 is slower than Adler32 by a factor of 20% - 100%.

Fletcher-32 is slightly more reliable than Adler-32. It has a lower computational cost than the Adler checksum: Fletcher vs Adler comparison.

A sample program with a few Fletcher implementations is given below:

    #include <stdio.h>
    #include <string.h>
    #include <stdint.h> // for uint32_t

    uint32_t fletcher32_1(const uint16_t *data, size_t len)
    {
            uint32_t c0, c1;
            unsigned int i;

            for (c0 = c1 = 0; len >= 360; len -= 360) {
                    for (i = 0; i < 360; ++i) {
                            c0 = c0 + *data++;
                            c1 = c1 + c0;
                    }
                    c0 = c0 % 65535;
                    c1 = c1 % 65535;
            }
            for (i = 0; i < len; ++i) {
                    c0 = c0 + *data++;
                    c1 = c1 + c0;
            }
            c0 = c0 % 65535;
            c1 = c1 % 65535;
            return (c1 << 16 | c0);
    }

    uint32_t fletcher32_2(const uint16_t *data, size_t l)
    {
        uint32_t sum1 = 0xffff, sum2 = 0xffff;

        while (l) {
            unsigned tlen = l > 359 ? 359 : l;
            l -= tlen;
            do {
                sum2 += sum1 += *data++;
            } while (--tlen);
            sum1 = (sum1 & 0xffff) + (sum1 >> 16);
            sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        }
        /* Second reduction step to reduce sums to 16 bits */
        sum1 = (sum1 & 0xffff) + (sum1 >> 16);
        sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        return (sum2 << 16) | sum1;
    }

    int main()
    {
        char *str1 = "abcde";  
        char *str2 = "abcdef";

        size_t len1 = (strlen(str1)+1) / 2; //  '\0' will be used for padding 
        size_t len2 = (strlen(str2)+1) / 2; // 

        uint32_t f1 = fletcher32_1(str1,  len1);
        uint32_t f2 = fletcher32_2(str1,  len1);

        printf("%u %X \n",    f1,f1);
        printf("%u %X \n\n",  f2,f2);

        f1 = fletcher32_1(str2,  len2);
        f2 = fletcher32_2(str2,  len2);

        printf("%u %X \n",f1,f1);
        printf("%u %X \n",f2,f2);

        return 0;
    }

Output:

4031760169 F04FC729                                                                                                                                                                                                                              
4031760169 F04FC729                                                                                                                                                                                                                              

1448095018 56502D2A                                                                                                                                                                                                                              
1448095018 56502D2A                                                                                                                                                                                                                              

Agrees with Test vectors:

"abcde"  -> 4031760169 (0xF04FC729)
"abcdef" -> 1448095018 (0x56502D2A)

Adler-32 has a weakness for short messages with few hundred bytes, because the checksums for these messages have a poor coverage of the 32 available bits. Check this:

The Adler32 algorithm is not complex enough to compete with comparable checksums.

phuclv
  • 27,258
  • 11
  • 104
  • 360
sg7
  • 5,365
  • 1
  • 30
  • 39
10

You can use the hashlib library for Python. The shake_128 and shake_256 algorithms provide variable length hashes. Here's some working code (Python3):

import hashlib
>>> my_string = 'hello shake'
>>> hashlib.shake_256(my_string.encode()).hexdigest(5)
'34177f6a0a'

Notice that with a length parameter x (5 in example) the function returns a hash value of length 2x.

feran
  • 123
  • 1
  • 9
7

You could use an existing hash algorithm that produces something short, like MD5 (128 bits) or SHA1 (160). Then you can shorten that further by XORing sections of the digest with other sections. This will increase the chance of collisions, but not as bad as simply truncating the digest.

Also, you could include the length of the original data as part of the result to make it more unique. For example, XORing the first half of an MD5 digest with the second half would result in 64 bits. Add 32 bits for the length of the data (or lower if you know that length will always fit into fewer bits). That would result in a 96-bit (12-byte) result that you could then turn into a 24-character hex string. Alternately, you could use base 64 encoding to make it even shorter.

dynamichael
  • 566
  • 7
  • 8
7

Simply run this in a terminal (on MacOS or Linux):

crc32 <(echo "some string")

8 characters long.

sgon00
  • 3,165
  • 1
  • 25
  • 36
5

It is now 2019 and there are better options. Namely, xxhash.

~ echo test | xxhsum                                                           
2d7f1808da1fa63c  stdin
sorbet
  • 1,981
  • 24
  • 37
0

I needed something along the lines of a simple string reduction function recently. Basically, the code looked something like this (C/C++ code ahead):

size_t ReduceString(char *Dest, size_t DestSize, const char *Src, size_t SrcSize, bool Normalize)
{
    size_t x, x2 = 0, z = 0;

    memset(Dest, 0, DestSize);

    for (x = 0; x < SrcSize; x++)
    {
        Dest[x2] = (char)(((unsigned int)(unsigned char)Dest[x2]) * 37 + ((unsigned int)(unsigned char)Src[x]));
        x2++;

        if (x2 == DestSize - 1)
        {
            x2 = 0;
            z++;
        }
    }

    // Normalize the alphabet if it looped.
    if (z && Normalize)
    {
        unsigned char TempChr;
        y = (z > 1 ? DestSize - 1 : x2);
        for (x = 1; x < y; x++)
        {
            TempChr = ((unsigned char)Dest[x]) & 0x3F;

            if (TempChr < 10)  TempChr += '0';
            else if (TempChr < 36)  TempChr = TempChr - 10 + 'A';
            else if (TempChr < 62)  TempChr = TempChr - 36 + 'a';
            else if (TempChr == 62)  TempChr = '_';
            else  TempChr = '-';

            Dest[x] = (char)TempChr;
        }
    }

    return (SrcSize < DestSize ? SrcSize : DestSize);
}

It probably has more collisions than might be desired but it isn't intended for use as a cryptographic hash function. You might try various multipliers (i.e. change the 37 to another prime number) if you get too many collisions. One of the interesting features of this snippet is that when Src is shorter than Dest, Dest ends up with the input string as-is (0 * 37 + value = value). If you want something "readable" at the end of the process, Normalize will adjust the transformed bytes at the cost of increasing collisions.

Source:

https://github.com/cubiclesoft/cross-platform-cpp/blob/master/sync/sync_util.cpp

phuclv
  • 27,258
  • 11
  • 104
  • 360
CubicleSoft
  • 1,614
  • 14
  • 17
  • std::hash doesn't solve certain use-cases (e.g. avoiding dragging in the bloaty std:: templates when just a few extra lines of code will suffice). There's nothing silly here. It was carefully thought out to deal with major limitations in Mac OSX. I didn't want an integer. For that, I could've used djb2 and still avoided using std:: templates. – CubicleSoft Nov 24 '16 at 14:18
  • This still sounds silly. Why would you *ever* use a `DestSize` greater than 4 (32 bits) when the hash itself is so crappy? If you wanted the collision resistance provided by an output larger than an int, you'd use SHA. – Navin Nov 24 '16 at 16:57
  • Look, it's not really a traditional hash. It has useful properties where the user can declare string size in places where there is extremely limited buffer space on certain OSes (e.g. Mac OSX) AND the result has to fit within the limited domain of real filenames AND they don't want to just truncate the name because that WOULD cause collisions (but shorter strings are left alone). A cryptographic hash is not always the right answer and std::hash is also not always the right answer. – CubicleSoft Nov 24 '16 at 20:04