24

The problem in general: I have a big 2d point space, sparsely populated with dots. Think of it as a big white canvas sprinkled with black dots. I have to iterate over and search through these dots a lot. The Canvas (point space) can be huge, bordering on the limits of int and its size is unknown before setting points in there.

That brought me to the idea of hashing:

Ideal: I need a hash function taking a 2D point, returning a unique uint32. So that no collisions can occur. You can assume that the number of dots on the Canvas is easily countable by uint32.

IMPORTANT: It is impossible to know the size of the canvas beforehand (it may even change), so things like

canvaswidth * y + x

are sadly out of the question.

I also tried a very naive

abs(x) + abs(y)

but that produces too many collisions.

Compromise: A hash function that provides keys with a very low probability of collision.

Any ideas anybody? Thanks for any help.

Best regards, Andreas T.

Edit: I had to change something in the question text: I changed the assumption "able to count the number of points of the canvas with uint32" into "able to count the dots on the canvas (or the number of coordinate pairs to store" by uint32. My original question didn't make much sense, because I would have had a sqrt(max(uint32))xsqrt(max(uint32)) sized canvas, which is uniquely representable by a 16 bit shift and OR.

I hope this is ok, since all answers still make most sense with the updated assumptions

Sorry for that.

AndreasT
  • 8,239
  • 11
  • 43
  • 58

11 Answers11

34

Cantor's enumeration of pairs

   n = ((x + y)*(x + y + 1)/2) + y

might be interesting, as it's closest to your original canvaswidth * y + x but will work for any x or y. But for a real world int32 hash, rather than a mapping of pairs of integers to integers, you're probably better off with a bit manipulation such as Bob Jenkin's mix and calling that with x,y and a salt.

Pete Kirkham
  • 46,814
  • 5
  • 86
  • 159
  • 2
    +1. My answer is faster to execute, but I tip my hat to your excellent answer! :-) – Jason Cohen Mar 25 '09 at 18:42
  • Thanks, this hash has the advantage that it works much better in practice than (y<<16)^x. The problem is that when you take it modulo some number to find out the real index in the hashmap, the upper bits are just discarded and you get a lot of collisions. – martinus Apr 08 '09 at 04:54
16

a hash function that is GUARANTEED collision-free is not a hash function :)

Instead of using a hash function, you could consider using binary space partition trees (BSPs) or XY-trees (closely related).

If you want to hash two uint32's into one uint32, do not use things like Y & 0xFFFF because that discards half of the bits. Do something like

(x * 0x1f1f1f1f) ^ y

(you need to transform one of the variables first to make sure the hash function is not commutative)

Antti Huima
  • 23,825
  • 2
  • 50
  • 67
  • Hi a hash function to my knowledge only has to provide a surjective mapping from a larger space to a smaller (hash) space. Your function looks interesting, unfortunately I do not understand what the ones help here? It looks like it runs at least 8 bit of x into the wall (?) – AndreasT Mar 25 '09 at 17:49
  • "hash function to my knowledge only has to provide a surjective mapping from a larger space to a smaller (hash) space". I assume this is a reply to "a hash function that is GUARANTEED collision-free is not a hash function". Both right - for proof, see http://en.wikipedia.org/wiki/Counting_argument – user9876 Mar 25 '09 at 18:17
  • 1
    x * 0x1f1f1f1f does not run any bits of x into wall, multiplying by an odd number modulo 2**32 is a bijective mapping! – Antti Huima Mar 25 '09 at 20:00
5

Like Emil, but handles 16-bit overflows in x in a way that produces fewer collisions, and takes fewer instructions to compute:

hash = ( y << 16 ) ^ x;
Jason Cohen
  • 75,915
  • 26
  • 104
  • 111
  • 1
    Thanks, I just wanted to comment why I didn't like this, but when I read this and operated my brain for a change, I understood, that under above assumption, I cannot have a bigger canvas than sqrt(MAX_UINT) to the power of 2 and thats uniquely storable in 2x16bit... – AndreasT Mar 25 '09 at 17:30
  • 1
    This is nicely reversible if you know the original x and y are 16 bit. The only catch is that since the sign bit is treated specially, you have to be a little careful when a negative x is a possibility. I guess exact implementation will depend on how your language treats sign bits, and how it defines `>>` and `<>16; x=(hash<<16)>>16; if (x<0) { y = y^-1 }`. – starwed Dec 03 '12 at 02:43
2

Your "ideal" is impossible.

You want a mapping (x, y) -> i where x, y, and i are all 32-bit quantities, which is guaranteed not to generate duplicate values of i.

Here's why: suppose there is a function hash() so that hash(x, y) gives different integer values. There are 2^32 (about 4 billion) values for x, and 2^32 values of y. So hash(x, y) has 2^64 (about 16 million trillion) possible results. But there are only 2^32 possible values in a 32-bit int, so the result of hash() won't fit in a 32-bit int.

See also http://en.wikipedia.org/wiki/Counting_argument

Generally, you should always design your data structures to deal with collisions. (Unless your hashes are very long (at least 128 bit), very good (use cryptographic hash functions), and you're feeling lucky).

user9876
  • 10,362
  • 6
  • 38
  • 64
  • Yeah thx. I am aware of that. My Problem is, though, that the number of dots on the canvas is very low compared to the canvas size, so I thought, there should be some way to stay collision free. Some ingeneous encoding trick, or something. – AndreasT Apr 07 '09 at 13:06
1

You can do

a >= b ? a * a + a + b : a + b * b

taken from here.

That works for points in positive plane. If your coordinates can be in negative axis too, then you will have to do:

A = a >= 0 ? 2 * a : -2 * a - 1;
B = b >= 0 ? 2 * b : -2 * b - 1;
A >= B ? A * A + A + B : A + B * B;

But to restrict the output to uint you will have to keep an upper bound for your inputs. and if so, then it turns out that you know the bounds. In other words in programming its impractical to write a function without having an idea on the integer type your inputs and output can be and if so there definitely will be a lower bound and upper bound for every integer type.

public uint GetHashCode(whatever a, whatever b)
{
    if (a > ushort.MaxValue || b > ushort.MaxValue || 
        a < ushort.MinValue || b < ushort.MinValue)
    {    
        throw new ArgumentOutOfRangeException();
    }

    return (uint)(a * short.MaxValue + b); //very good space/speed efficiency
    //or whatever your function is.
}

If you want output to be strictly uint for unknown range of inputs, then there will be reasonable amount of collisions depending upon that range. What I would suggest is to have a function that can overflow but unchecked. Emil's solution is great, in C#:

return unchecked((uint)((a & 0xffff) << 16 | (b & 0xffff))); 

See Mapping two integers to one, in a unique and deterministic way for a plethora of options..

Community
  • 1
  • 1
nawfal
  • 62,042
  • 48
  • 302
  • 339
1

According to your use case, it might be possible to use a Quadtree and replace points with the string of branch names. It is actually a sparse representation for points and will need a custom Quadtree structure that extends the canvas by adding branches when you add points off the canvas but it avoids collisions and you'll have benefits like quick nearest neighbor searches.

Emre Sahin
  • 449
  • 3
  • 12
1

If you can do a = ((y & 0xffff) << 16) | (x & 0xffff) then you could afterward apply a reversible 32-bit mix to a, such as Thomas Wang's

uint32_t hash( uint32_t a)
    a = (a ^ 61) ^ (a >> 16);
    a = a + (a << 3);
    a = a ^ (a >> 4);
    a = a * 0x27d4eb2d;
    a = a ^ (a >> 15);
    return a;
}

That way you get a random-looking result rather than high bits from one dimension and low bits from the other.

  • solution give here http://stackoverflow.com/a/3880895/661933 is slightly better as far as distribution goes.. – nawfal Dec 27 '12 at 08:38
1

You can recursively divide your XY plane into cells, then divide these cells into sub-cells, etc.

Gustavo Niemeyer invented in 2008 his Geohash geocoding system.

Amazon's open source Geo Library computes the hash for any longitude-latitude coordinate. The resulting Geohash value is a 63 bit number. The probability of collision depends of the hash's resolution: if two objects are closer than the intrinsic resolution, the calculated hash will be identical.

enter image description here

Read more:

https://en.wikipedia.org/wiki/Geohash https://aws.amazon.com/fr/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/ https://github.com/awslabs/dynamodb-geo

rjobidon
  • 2,839
  • 2
  • 28
  • 33
1

Perhaps?

hash = ((y & 0xFFFF) << 16) | (x & 0xFFFF);

Works as long as x and y can be stored as 16 bit integers. No idea about how many collisions this causes for larger integers, though. One idea might be to still use this scheme but combine it with a compression scheme, such as taking the modulus of 2^16.

Emil H
  • 37,947
  • 10
  • 72
  • 95
0

If you're already using languages or platforms that all objects (even primitive ones like integers) has built-in hash functions implemented (Java platform Languages like Java, .NET platform languages like C#. And others like Python, Ruby, etc ). You may use built-in hashing values as a building block and add your "hashing flavor" in to the mix. Like:

// C# code snippet 
public class SomeVerySimplePoint { 

public int X;
public int Y;

public override int GetHashCode() {
   return ( Y.GetHashCode() << 16 ) ^ X.GetHashCode();
}

}

And also having test cases like "predefined million point set" running against each possible hash generating algorithm comparison for different aspects like, computation time, memory required, key collision count, and edge cases (too big or too small values) may be handy.

underscore
  • 682
  • 11
  • 8
0

the Fibonacci hash works very well for integer pairs

multiplier 0x9E3779B9

other word sizes 1/phi = (sqrt(5)-1)/2 * 2^w round to odd

a1 + a2*multiplier

this will give very different values for close together pairs

I do not know about the result with all pairs

lkreinitz
  • 131
  • 2
  • 7