Unexpected collision with std::hash

Question

I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.

Isn't it weird that these 2 strings have the same hash?

size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1

I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?

@relaxxx: MSVC10 will probably be the last to provide a full c++11 implementation (if they ever will). if you want a working implementation, the most complete one is clang. you can also try the more popular gcc. — Dani, Nov 01 '11 at 15:35
The old standard didn't define specific specializations for types that are usually pointers, but the newer standard requires specializations for things like char*, std::string, etc. I was just saying to @Dani that there is a specialization implemented in VS2010 for std::string. — Joe, Nov 01 '11 at 16:04
@Dani: as much as I like clang, I think gcc is a bit ahead on C++11 features, though again, it's hard to say for sure as neither fully implement what the other covers. — Matthieu M., Nov 01 '11 at 17:57
@MatthieuM.: If you look at implementation status pages, you see that clang is far ahead. — Dani, Nov 02 '11 at 00:12
@Dani: Where did you get that strange idea ? Comparing [Clang 3.0](http://clang.llvm.org/cxx_status.html) with [GCC 4.7](http://gcc.gnu.org/gcc-4.7/cxx0x_status.html), out of 13 items that differentiate them, Clang implement 4 and gcc the other 9. (Clang: n2439, n2258, n2341 and n1986; gcc: n2672, n2927, n2764, n2235, n2170, n2765, n2253, n2544 and n2427) — Matthieu M., Nov 02 '11 at 07:51

score 27 · Accepted Answer · answered Nov 01 '11 at 16:24

There's nothing wrong with your usage of std::hash. The problem is that the specialization std::hash<std::string> provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.

As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.

For details, see the implementation in the header file xfunctional (starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).

If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.

Thank you, that is the answer I was looking for! – relaxxx Nov 01 '11 at 16:37 — relaxxx, Nov 01 '11 at 16:37

score 10 · Answer 2 · edited Nov 01 '11 at 18:30

The exact hash algorithm isn't specified by the standard, so the results will vary. The algorithm used by VC10 doesn't seem to take all of the characters into account if the string is longer than 10 characters; it advances with an increment of 1 + s.size() / 10. This is legal, albeit from a QoI point of view, rather disappointing; such hash codes are known to perform very poorly for some typical sets of data (like URLs). I'd strongly suggest you replace it with either a FNV hash or one based on a Mersenne prime:

FNV hash:

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = (16777619 * result)
                    ^ static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

Mersenne prime hash:

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = 127 * result
                   + static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

(The FNV hash is supposedly better, but the Mersenne prime hash will be faster on a lot of machines, because multiplying by 127 is often significantly faster than multiplying by 2166136261.)

thank you very much, I wish i can accept more than one correct answer :) — relaxxx, Nov 01 '11 at 17:34
@relaxxx: of late, CityHash and MurmurHash seem to be getting quite popular too. You might also want to give them a try. — Matthieu M., Nov 01 '11 at 18:00
@MatthieuM. I'll have to look into them if I get the chance. I did extensive measurements, with 20 or so popular hashs, but that was about 20 years ago. These two were the winners then, but obviously, things could easily have changed since then. — James Kanze, Nov 01 '11 at 18:22
Note that the 32-bit FNV Hash parameters are used in the above example, for the 64-bit ones see http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-param — dalle, Mar 18 '14 at 18:08
Sometimes I think Microsoft get their stuff made by interns. — v.oddou, Jun 28 '15 at 13:14
JFYI: The latest Visual Studio (2017, not sure about the 2015) has FNV hash for std::hash. — Elias Daler, Nov 21 '16 at 18:42

score 3 · Answer 3 · answered Nov 01 '11 at 15:37

You should likely get different hash values. I get different hash values (GCC 4.5):

hashtest.cpp

#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}

Output

# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978

he is using MSVC, unfortunately for him :) – Matthieu M. Nov 01 '11 at 17:59 — Matthieu M., Nov 01 '11 at 17:59

score 2 · Answer 4 · answered Nov 01 '11 at 15:26

You do not seed hashing function, you can just salt "them" at most.

The function is used in the right way and this collision could be just fortuitous.

You cannot tell whether the hashing function is not evenly distributed unless you perform a massive test with random keys.

score 0 · Answer 5 · answered Nov 01 '11 at 15:26

0

The TR1 hash function and the newest standard define proper overloads for things like strings. When I run this code using std::tr1::hash (g++ 4.1.2), I get different hash values for these two strings.

answered Nov 01 '11 at 15:26

Joe

38,368
16
103
119

Unexpected collision with std::hash

5 Answers5

hashtest.cpp

Output

Linked

Related