0

Given n integer id's, I wish to link all possible sets of up to k id's to a constant value. What I'm looking for is a way to translate sets (e.g. {1, 5}, {1, 3, 5} and {1, 2, 3, 4, 5, 6, 7}) to unique values.

Guarantees:

  • n < 100 and k < 10 (again: set sizes will range in [1, k]).
  • The order of id's doesn't matter: {1, 5} == {5, 1}.
  • All combinations are possible, but some may be excluded.
  • All sets and values are constant and made only once. No deletes or inserts, no value updates.
  • Once generated, the only operations taking place will be look-ups.
  • Look-ups will be frequent and one-directional (given set, look up value).
  • There is no need to sort (or otherwise organize) the values.

Additionally, it would be nice (but not obligatory) if "neighboring" sets (drop one id, add one id, swap one id, etc) are easy to reach, as well as "all sets that include at least this set".

Any ideas?

Gaminic
  • 561
  • 2
  • 9
  • The values inside a set are unique ? Use Zobrist hashing, or the product of i-th prime. – wildplasser Nov 07 '12 at 16:20
  • If I understood the question correctly, there are nchoosek(9,99) unique sets, which requires about 2^40 unique integers... – Aki Suihkonen Nov 07 '12 at 16:27
  • @Aki: Yes, filtering will be one of the main hurdles. The combination (9, 99) is a very unlikely scenario; higher values of 'k' will only appear for lower values of 'n' and vice versa. – Gaminic Nov 07 '12 at 16:37
  • Apparently I was under estimating... 99 over 9 is just one of the possible sets... there are about 1000 more, so the actual number of sets is more like 2^50. – Aki Suihkonen Nov 07 '12 at 16:41
  • @Aki: 2^44 for (10, 100), roughly. Again: filtering is a major requirement for this, for more reasons than the hashing alone. – Gaminic Nov 07 '12 at 16:51

3 Answers3

1

Enumerate using the product of primes.

  • a -> 2
  • b -> 3
  • c -> 5
  • d -> 7
  • et cetera

Now hash(ab) := 6, and hash (abc) := 30

And a nice side effect is that, if "ab" is a subset of "abc", then:

hash(abc) % hash(ab) == 0

and

hash(abc) / hash(ab) == hash(c)

The bad news: You might run into overflow, the 100th prime will probably be around 1000, and 64 bits cannot accomodate 1000**10. This will not affect the functioning as a hash function; only the subset thingy will fail to work. the same method applied to anagrams

The other option is Zobrist-hashing. It is equivalent to the the primes method, but instead of primes you use a fixed set of (random) numbers, and instead of multiplying you use XOR. For a fixed small (it needs << ~70 bits) set like yours, it might be possible to tune the zobrist tables to totally avoid collisions (yielding a perfect hash).

And the final (and simplest) way is to use a (100bit) bitmap, and treat that as a hashvalue (maybe after modulo table size)

And a totally unrelated method is to just build a decision tree on the bits of the bitmap. (the tree would have a maximal depth of k) a related kD tree on bit values

Community
  • 1
  • 1
wildplasser
  • 38,231
  • 6
  • 56
  • 94
  • Very elegant solution. I'll solve the overflow problem for higher ranges of 'n' and 'k' later, as they are a problem for more reasons than the hashing alone. – Gaminic Nov 07 '12 at 16:42
  • I once used the primes thing for detecting anagrams, and I did not ran into overflow (using a test set of 100K set of (Dutch) words). I did sort the letters by frequency, such that {e,t,n} get the lower primes, and {q,x,y} get the higher ones. BTW: the 26th prime is 101 – wildplasser Nov 07 '12 at 16:46
  • Am I correct to assume I can travel between hashes of neighboring sets by dividing by prime[X] (remove id X), multiplying by prime[X] (add id X) and combinations of both? – Gaminic Nov 07 '12 at 16:46
  • Yes. Unless you got struck by overflow. (I don't think modulo division works the same, after the top bits have fallen off) – wildplasser Nov 07 '12 at 16:48
0

May be not the best solution, but you can do the following:

  1. Sort the set from Lowest to highest with a simple IntegerComparator
  2. Add each item of the set to a String

so if you have {2,5,9,4} first Step->{2,4,5,9}; second->"2459"

This way you will get a unique String from a unique set. If you really need to map them to an integer value, you can hash the string after that.

A second way I can think of is to store them in a java Set and simply map it against a HashMap with set as keys

Rafael T
  • 14,504
  • 14
  • 72
  • 137
  • Requires zero-filling to distinguish {1, 2} from {12}. I'm not using Java, but I'll look at the implementation of both and see if it's an option. – Gaminic Nov 07 '12 at 16:56
0

Calculate a 'diff' from each set {1, 6, 87, 89} = {1,5,81,2,0,0,...} {1,2,3,4} = { 1,1,1,1,0,0,0,0... };

Then binary encode each number with a variable length encoding and concatenate the bits.

It's hard to compare the sets (except for the first few equal bits), but because there can't be many large intervals in a set, all possible values just might fit into 64 bits. (slack of 16 bits at least...)

Aki Suihkonen
  • 15,929
  • 1
  • 30
  • 50