1

I've come across an interesting problem which I would love to get some input on.

I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).

I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.

Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?

What I have currently is this:

Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.

For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.

Thanks!

Mick
  • 665
  • 2
  • 5
  • 17

3 Answers3

2

if you break it into pieces it seems to me that

  • creating a set (generate 6 numbers, sort, stringify) runs in O(1)
  • checking if this string exists in the hashset is O(1)
  • inserting into the hashset is O(1)

you do this n times, which gives you O(n). this is already optimal as you have to touch every element once anyways :)

you might run into problems depending on the range of your random numbers. e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.

one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );

p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.

kritzikratzi
  • 16,501
  • 1
  • 25
  • 38
  • Thanks Kritz, good to know I was on the right track. I'll definitely try it out now and see how it performs. The bit I was most worried about was the sorting, even if every set only has up to 6 elements. – Mick Jul 14 '12 at 15:07
  • well, it's only six numbers so it should be very fast. but if this turns out to be a troublemaker you can take a look at this stackoverflow question: http://stackoverflow.com/questions/1866031/generating-sorted-random-ints-without-the-sort-on . – kritzikratzi Jul 14 '12 at 15:18
2
  1. Create a class SetOfIntegers
  2. Implement a hashCode() method that will generate reasonably unique hash values
  3. Use HashMap to store your elements like put(hashValue,instance)
  4. Use containsKey(hashValue) to check if the same hashValue already present

This way you will avoid sorting and conversion/formatting of your sets.

mazaneicha
  • 6,760
  • 4
  • 26
  • 42
  • Thanks Mazaneicha, I'll have a look into hashCode(), I've never actually overridden this function before. Would you have any tips for defining a hash function that based on the scope of my question would not have collisions? – Mick Jul 14 '12 at 15:04
  • As always, StackOverflow is your friend :) Check out this post http://stackoverflow.com/questions/27581/overriding-equals-and-hashcode-in-java Good luck! – mazaneicha Jul 14 '12 at 15:10
2

Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.

  • Thanks Christophe, that's a very interesting solution. I had a quick look at the BitSet javadocs and I have a question. Can this method account for having duplicate values? It seems like a great solution for unique sets. – Mick Jul 14 '12 at 15:02
  • Yes, sorry, I completely missed the "that do not have to be unique" part (and didn't checked the sample values for the duplicated 67) so it won't work for you. Then use `int[]` arrays, `java.util.Arrays.sort(array)` and `java.util.Arrays.equals(array1, array2)` static methods, it's the next fastest/easiest thing to do. – Christophe Bouchon Jul 14 '12 at 17:07