What is a minimal hash function for a pair of ints that has low chance of collisions?

Question

This is what I have so far:

struct pairhash {
public:
  inline std::size_t operator()(const std::pair<int, int> &c) const
  {
     int x = c.first;
     int y = c.second;
     return ((x+y)*(x+y+1)/2 + y); // Cantor's enumeration of pairs
  }
};

I need to use this a hash function so that I can place the pair of ints in an unordered_set like this:

std::unordered_set< std::pair<int, int>,  pairhash> mySet;

EDIT: Forgot to get the coords from the pair. Updated the code. EDIT: Removed the template code - added it by mistake.

EDIT: Changed the function based on another similar answer on SO, related to Cantor's enumeration of pairs: hash function providing unique uint from an integer coordinate pair

EDIT: Collision free is not a requirement (Thanks Petr).

I have a hunch that if `size_t` is an `int` there is no such function. — Peter - Reinstate Monica, Jun 20 '16 at 09:31
It all depends on the domain of your input w.r.t. the max/min value of `std::size_t` on your system, I believe. — Hiura, Jun 20 '16 at 09:32
I'm also unsure why the function is a template. Did you mean to have `T` and `U` as the types of the pair elements? — Peter - Reinstate Monica, Jun 20 '16 at 09:33
To be honest, I don't know why we need to return size_t. I've just been looking at code where other people wrote hash functions and it didn't occur to me until you pointed it out. — Rahul Iyer, Jun 20 '16 at 09:33
@PeterA.Schneider that was a blunder on my part - forgot to remove it. Removed it now. Sorry.... :) — Rahul Iyer, Jun 20 '16 at 09:34
@Hiura Can you explain ? Are you talking about the problem if I use two really large int's, approaching the max size size of an int ? For my use case it is unlikely to happen, but at the same time I'm not sure how I will handle it. — Rahul Iyer, Jun 20 '16 at 09:36
Generally spoken, the number of bits in the hash type needs to be the sum of bits of the element types in order to be collision-free (if both element values can be arbitrary bit patterns). The hash function can then be a trivial concatenation of the bits. You would have trouble with the hash table size though. — Peter - Reinstate Monica, Jun 20 '16 at 09:36
@Petr I don't know what the unordered_set would do if there is a collision. Just playing it safe. — Rahul Iyer, Jun 20 '16 at 09:41
@PeterA.Schneider I'm out of my depth here. But then what can I do if size_t is an int ? — Rahul Iyer, Jun 20 '16 at 09:45
By domain I mean the range of values. So if you have small numbers it should work fine. But as soon as your numbers grow a bit, the result will not fit in an `int` or `size_t`. And don't forget that overflow on `int`s is UB, especially with optimisations turned on. — Hiura, Jun 20 '16 at 09:51
@Hiura I didn't know that (about optimisations). I guess I need to scrap this function and try again - but this seems like a very common scenario since there are many questions on SO about people trying to create a hash function for a pair , but the actual functions aren't perfect.... — Rahul Iyer, Jun 20 '16 at 09:53
IIRC, over all possible distributions, every hash function is equally good. IOW, for every hash function there are good and bad distributions. — MSalters, Jun 21 '16 at 11:13

Petr · Answer 1 · 2016-06-20T09:51:57.963

4

You don't need the hash function to be collision free, if you intend to use it with unordered_set (as well as with most other containers and algorithms). Moreover, the general concept of hash tables and hash functions is that they allow collisions, they just expect collisions to be rare.

cppreference says about the requirements for hashing:

For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().

edited Jun 20 '16 at 09:51

answered Jun 20 '16 at 09:42

Petr

9,051
1
25
47

But what will happen in the unordered_set if there is a collision ? I guess this goes into the working of unordered_set, but if it didn't matter, then why do we have to implement the hash function ? Couldn't they have used some other generic way by reflecting on the user-defined data type etc... – Rahul Iyer Jun 20 '16 at 09:44
3

@John, you should read on hash tables for a full answer. Even [wikipedia link](https://en.wikipedia.org/wiki/Hash_table) sems to give good intoduction. In short — if hashes are equal, `unorderer_set` will compare elements using standard comparison operator. But we want to make this happen rarely, to avoid doing many comparisons. – Petr Jun 20 '16 at 09:47
If you have some idea on how many data you are going to store, you can perhaps set the bucket count to a reasonably high value. Still there would be chances of collision, but should be very less. – Arunmu Jun 20 '16 at 09:48
@John, and for "generic way" — for most standard types the hash functions do exist in standard library (and I guess that for `std::pair` you don't need to write your own hash function, you can use the standard one). But for general case for your own classes only you know how to calculate hash. – Petr Jun 20 '16 at 09:49
I know a little bit about hash tables, but I don't know how to implement one for an unordered_set, since each hash runs independently, and we don't store any results etc. – Rahul Iyer Jun 20 '16 at 09:52
@John, what do you mean -- each has runs independenly, and the result are not stored? You supply a hash-calculating function to `unordered_set`. The set itself will call this function whenever it needs, will store the results internally if needed, and so on. – Petr Jun 20 '16 at 09:53
@Petr - thats what I mean - The set will call the function - so its hard to tell if there is a collision or not, unless you can mathematically prove there won't be a collision. – Rahul Iyer Jun 20 '16 at 09:55
@John, once again — you do not need to completely avoid collisions. It's ok if your function generates collisions from time to time. – Petr Jun 20 '16 at 09:56

score 2 · Answer 2 · answered Jun 20 '16 at 10:30

Update

Posted the answer and saw that the question was already updated. Learned the established name for the hashing approach proposed in my answer below.

Generally speaking, such a function does not exist if 2*sizeof(int) > sizeof(size_t). However, assuming that you will not utilize the full range of the int type, you can attempt constructing a hash function that is free of collisions for sufficiently small values of your 2 integers. Assuming non-negative values for both a and b, I can propose the following function:

size_t hashRangeStart(size_t n)
{
    return n*(n+1)/2; // == 1 + 2 + ... + n
}

size_t intPairHash(int a, int b)
{
    return hashRangeStart(a+b)+a;
}

The idea behind this approach is quite simple:

pairs of integers {a, b} adding up to the same value n=a+b produce a contiguous range of hashes, i.e. intPairHash(a, b) == intPairHash(a+b, 0) + a.
hash ranges for adjacent values of the sum values n and n+1 abut, i.e. intPairHash(0, a+1) == intPairHash(a, 0) + 1.

Extending this approach to signed values should be not too difficult.

score 1 · Answer 3 · answered Oct 13 '16 at 18:49

1

A simple way to hash two integers is to use Knuth's hash:

size_t hash2(int i1, int i2)
{
    size_t ret = i1;
    ret *= 2654435761U;
    return ret ^ i2;
}

answered Oct 13 '16 at 18:49

David Schwartz

166,415
16
184
259

What is a minimal hash function for a pair of ints that has low chance of collisions?

3 Answers3