2

Suppose we have m ordered sets and we want to find their intersection.

Which data structures should we use for the ordered sets and which algorithm would be the most efficient?

same question: Algorithm for N-way merge

It appears that the literature is huge. Thus a better question is this: Which good implementations are there?

Community
  • 1
  • 1
  • It depends. Of course you can limit the search to the min/max of the set, but that will only reduce the left/right boundaries of the tree tree-to-be-searched. "crawling up" a tree is rather hard (you'll need a stack), and you'll still need to look for an exact match for all of the elements in the set. Which could be cheap if the set is dense (eg: a range), but theoretically is still log(N) per element, with N the size of the bracketed subtree. IOW: YMMV. – wildplasser Nov 26 '12 at 23:32
  • when you do a binary search, the time complexity is logN. If we keep the previous comparisons for the next number we will ,as you say,limit the search to a bracket. Remembering those comparison requires logN space complexity per ordered set. this is as far as I have gone right now. – Apostolis Xekoukoulotakis Nov 27 '12 at 00:15
  • The complexity is LOG(N) *per element*; exactly the same as for a tree or a subtree. – wildplasser Nov 27 '12 at 00:17
  • Why not mask them if the domain isn't too big. Linear in the size of domain-- or the range of the union of sets at the cost of memory usage. – ashley Nov 27 '12 at 23:28
  • yea, I think thats a good answer if the domain is small. – Apostolis Xekoukoulotakis Nov 27 '12 at 23:50

2 Answers2

1

You can create binary tree with link to parent node and implement classic algorithm of intersection/union:

  1. Set iterA to left-most (smallest) node of the tree (i.e. descend over left-most branches to leaf).
  2. Set iterB to first (smallest) node of the ordered set (if it is implemented with ordered array or to the left-most node if as tree).
  3. Branch by comparison of items pointed by iterA and iterB
    • If lower: yield item of union and advance iterA
    • If equals: yield item of union and item of intersection and advance both iterA and itemB
    • If greater: yield item of union and advance iterB
  4. Repeat until one of iterator is unable to advance
  5. Rest of items accessible from other iterator yield as union items

Advance of binary tree iterator by:

  • If current node have right child descend to it and descend to the left-most child of it while possible. Yield that item.
  • If node have parent ascend over it and repeat that while we ascending from right child. Yield that item.
  • Otherwise: all items of tree is yielded already (end of collection).

Update: If you know that your ordered set (walked by iterB) is much smaller than the tree you can use a bit more sophisticated algorithm for intersection:

  1. Initially set iterB to beginning of ordered set (lower value).
  2. Set iterA to the node which is minimum upper bound of value iterB.
  3. Branch by comparison of items pointed by iterA and iterB
    • If equals: yield item of intersection.
  4. Advance itemB to the next value.
  5. Advance iterA to the minimum upper bound of value at itemB starting from current itemA.
  6. Repeat until itemB pass all items of ordered set.

Where advance to the minimum upper bound from specific node is:

  • If value of current node less than target
    • Find upper bound on right child by walking right-children of each node
    • If even right-most node of that branch is lower than target: ascend to parent while moving from right child and restart from that node.
    • Else from node where we found upper bound
    • Find first left-most children value of which is less than target
      • If not found: left-most leaf of that branch is minimum upper bound
      • Else restart from that node (more precisely will be to use sub-algorithm which walk over left-most and right-most nodes narrowing borders).

The main idea of searching bound is narrowing upper and minimum bound ("-" is ignored nodes, "..." is new search range):

for B < X < A
    U
   / \-
  L
-/ \...

for A < X < B
  L
-/ \
    U
.../ \-
ony
  • 9,517
  • 1
  • 30
  • 40
  • If I understand correctly, you are proposing to search all terms linearly. That forces us to search all terms ie. O(N^2) complexity. – Apostolis Xekoukoulotakis Nov 27 '12 at 15:17
  • No. We walk over both collection together in single pass. We are not walking over whole second collection for each element of first collection. That actually how [std::set_intersection](http://www.cplusplus.com/reference/algorithm/set_intersection/) works. And that's O(N) complexity since we walk in worse case for (N+M) steps (if we union disjoint sets). – ony Nov 27 '12 at 15:52
  • you are right. it is O(N) complexity. I hope i made myself clear that i am trying to find only the intersection. I use the term join because it is used in databases. But i have something else in mind, using the binary search to skip numbers is one idea. check this out:https://issues.apache.org/jira/browse/LUCENE-866 – Apostolis Xekoukoulotakis Nov 27 '12 at 16:18
0

This is only a sketch: please help me improve it.

This solution will be based on using binary search to limit the search to n/2^i number of elements for each set, and I will use efficient data structures to remember those comparisons for the next number.

The first thing to note is that the balanced binary tree is good at performing binary search, only when the interval of the search closely match that of a (sub)tree.

The other 2 structures that accept binary search is the array and the skip list. The array is inefficient for insertions and deletion, so the skip list seems the best choice.

We will need m arrays of size 64 that will contain the elements of each set per array that were compared in the binary search, inserted in the order of execution of the comparison.

We will also need a double linked list in which all the elements from all the sets that were used in the binary search will be inserted. Using a skip list here minimizes even more the number of comparisons needed.

The basic idea is this.

  1. We search the minimum element in each set with a binary search.
  2. In each binary search step, we add the new element in the array of the set and in the double linked list.
  3. Either there is a common minimum element or not.
  4. we remove the smallest element in the double linked list. New searches will start from the previous element in the binary search array of the set and will use half the distance than before. we use the previous binary search elements in the arrays to limit the new search to the smallest known interval.
  5. continue to 1.