Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.
So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):
- Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
- Do a character-by-character comparison of your string with other strings having the same hash.
std::unordered_set
does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set
.
Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.