3

Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:

http://www.cplusplus.com/reference/functional/hash/

I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.

BlueTune
  • 855
  • 3
  • 12
Build Succeeded
  • 1,097
  • 1
  • 7
  • 21
  • It depends. What do you want to do? – JHBonarius Dec 07 '19 at 18:11
  • I want to generate hash value for multiple strings. If same value return by hash function I mark it as duplicate string found. – Build Succeeded Dec 07 '19 at 18:13
  • 9
    You're taking an arbitrary number of characters (bytes), which can be hundreds of bits long, and reducing that to a 32- or 64- bit integer. Yes, there **will** be duplicate hashes for different strings. You can check the hash first; if it matches, then compare the strings. – 1201ProgramAlarm Dec 07 '19 at 18:14
  • https://stackoverflow.com/questions/7968674/unexpected-collision-with-stdhash https://stackoverflow.com/questions/51145320/does-stdhash-give-same-result-for-same-input-for-different-compiled-builds-and I am not sure whether it is still problem in C++17? – Build Succeeded Dec 07 '19 at 18:19
  • 1
    C++ version can’t change the fact that you can’t have infinite different hashes of finite length. It’s math. – Sami Kuhmonen Dec 07 '19 at 19:49
  • You can use fuzzing to test it out, and locate string patterns to avoid for duplicate hashes. – Michaël Roy Dec 08 '19 at 02:19
  • 1
    ... you can also expect these tests to run a very long time. – Michaël Roy Dec 08 '19 at 02:34

4 Answers4

4

Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.

So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):

  1. Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
  2. Do a character-by-character comparison of your string with other strings having the same hash.

std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.

Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.

einpoklum
  • 86,754
  • 39
  • 223
  • 453
2

Yes, it is possible that two different strings will share the same hash. Simply put, let's imagine you have a hash stored in an 8bit type (unsigned char). That is 2^8 = 256 possible values. That means you can only have 256 unique hashes of arbitrary inputs.
Since you can definitely create more than 256 different strings, there is no way the hash would be unique for all possible strings.

std::size_t is a 64bit type, so if you used this as a storage for the hash value, you'd have 2^64 possible hashes, which is marginally more than 256 possible unique hashes, but it's still not enough to differentiate between all the possible strings you can create.

You just can't store an entire book in only 64 bits.

ProXicT
  • 1,802
  • 1
  • 19
  • 41
1

Yes it can return the same result for different strings. This is a natural consequence of reducing an infinite range of possibilities to a single 64-bit number.

There exist things called "perfect hash functions" which produce a hash function that will return unique results. However, this is only guaranteed for a known set of inputs. An unknown input from outside might produce a matching hash number. That possibility can be reduced by using a bloom filter.

However, at some point with all these hash calculations the program would have been better off doing simple string comparisons in an unsorted linear array. Who cares if the operation is O(1)+C if C is ridiculously big.

Zan Lynx
  • 49,393
  • 7
  • 74
  • 125
0

Yes, std::hash return same result for different std::string. The creation of buckets is different by different compiler.

Compiler based implementation found at link: hashing and rehashing for std::unordered_set

Build Succeeded
  • 1,097
  • 1
  • 7
  • 21