3

I'm looking for a pairing function f: ZxZ -> Z, with the following characteristics:

  • It doesn't need to be reversible. I only need it to be injecive (different pairs map to different integers), I never need to compute the pair back.
  • It is defined over Z (signed integers)
  • It is efficiently computable

At the moment, I'm using f(x,y) = x + (max(x)-min(x)+1) * y

It works, I'm just wondering whether there could be another function that uses the result space more efficiently, considering that:

  • x,y are signed integers up to 64bits
  • f(x,y) is an integer, at most 64 bits
  • len(f(x,y)) <= 64 bits is easily computable

I do know that this means I cannot map all x,y combinations without the result to overflow. I'm happy enough with being able to establish whether the conversion would fit in 64bits or not. For this reason, the ideal mapping function would use the available 64bits as efficiently as possible.

Any tip?

cornuz
  • 1,883
  • 13
  • 29
  • Harold, as I said, I know it can't exist for all values. But that depends on the values, not on the data types. E.g. f(4,5) can still be done, even when 4 and 5 are stored as 64bit integers. It's easy to check, depending on the function used, for overflows (in that case I wouldn't use the mapping). I was just wondering whether relaxing on the the reversibility could bring any benefit in terms of space usage – cornuz Mar 04 '12 at 20:36
  • You do realize that there are `(2^(2^128))^64` different functions that fulfill your requirements? p.s. not making up a big number - this is the number of functions from 128 bits to 64 bits. – amit Mar 04 '12 at 20:52
  • How about `((x + y)*(x + y) + x - y)/2` then, as long as it doesn't overflow anyway. – harold Mar 04 '12 at 21:10

2 Answers2

1

CRC polynomials are fast to compute with great diffusion. I am sure you will get libraries for your favorite language. Concat both integers in 128 bits and calculate CRC.

Keep in mind that you can not map 128 bits in 64 bits without collision.

andand
  • 15,638
  • 9
  • 48
  • 76
Luka Rahne
  • 9,760
  • 3
  • 30
  • 55
  • thanks for your tip. Collisions are not acceptable, I need to detect whether any of the give input values would overflow 64 bits and if so take different actions. – cornuz Mar 06 '12 at 15:05
  • And what are distributions of input? Which input values are most likley? – Luka Rahne Mar 06 '12 at 16:18
  • The algorithm I'm aiming at is motivated by and would be most useful in information retrieval applications, with `(x,y)` typically being `(term,doc)`. Both would be unsigned numeric identifiers, with term having a [Zipfian](http://en.wikipedia.org/wiki/Zipf's_law) distribution (few terms are very frequent). However, I cannot really assume any distribution nor unsigned numbers, as this is meant to be part of general relational processing. – cornuz Mar 06 '12 at 16:45
  • Again, no assumption can be really made, but in the motivating scenario the number of distinct values for both x and y would be a few millions. But the numeric identifiers would not necessarily be contiguous – cornuz Mar 06 '12 at 16:55
0

To encode two 64 bit integers into a unique single number, there are 2^64 * (2^64 -1) combinations of inputs possible, so by the obvious Pigeonhole Principle, we need an output of size at least 2^64 * (2^64 -1), which is equal to 2^128 - 2^64, or in other words, you need a capacity of 128 bits to hold all the possible outputs.


I know it can't exist for all values. But that depends on the values, not on the data types. E.g. f(4,5) can still be done, even when 4 and 5 are stored as 64bit integers. It's easy to check, depending on the function used, for overflows (in that case I wouldn't use the mapping).

You know that. That said, as you say you could have a cap on maximum values for your 64 bit inputs. The output then can be 64 bit signed or unsigned integer.

Output being signed, an implementation in C#:

public static long GetHashCode(long a, long b)
{
    if (a < int.MinValue || a > int.MaxValue || b < int.MinValue || b > int.MaxValue)
        throw new ArgumentOutOfRangeException();

    var A = (ulong)(a >= 0 ? 2 * a : -2 * a - 1);
    var B = (ulong)(b >= 0 ? 2 * b : -2 * b - 1);
    var C = (long)((A >= B ? A * A + A + B : A + B * B) / 2);
    return a < 0 && b < 0 || a >= 0 && b >= 0 ? C : -C - 1;
}

Output being unsigned, an implementation in C#:

public static ulong GetHashCode(long a, long b)
{
    if (a < int.MinValue || a > int.MaxValue || b < int.MinValue || b > int.MaxValue)
        throw new ArgumentOutOfRangeException();

    var A = (ulong)(a >= 0 ? 2 * a : -2 * a - 1);
    var B = (ulong)(b >= 0 ? 2 * b : -2 * b - 1);
    return A >= B ? A * A + A + B : A + B * B;
}

The unsigned implementation will be slightly faster because of the fewer calculations. The lower and upper bound to uniquely pair is int.MaxValue (-2147483648) and int.MaxValue(2147483647). The original function is taken from here. The Elegant Pairing function mentioned in the link is the most space efficient possible since it maps to every single point in the available space. For more on similar methods, see Mapping two integers to one, in a unique and deterministic way

Community
  • 1
  • 1
nawfal
  • 62,042
  • 48
  • 302
  • 339