11

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.

I've tried a bunch of simple options and am currently using this:

foreach (var index in firstSet)
{
    if (secondSet.BinarySearch(index) < 0)
        continue;

    //do stuff
}

Both firstSet and secondSet are of type List.

I've also tried using LINQ:

var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();

and then looping through intersection.

But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.

Please help me guys as I don't have a lot of time to get this thing done. Thanks.

NOTE: I'm doing this about 5.3 million times. So every microsecond counts.

gligoran
  • 3,019
  • 3
  • 28
  • 42
  • Decided to create the **UNION** question, considering that's what I implemented: http://stackoverflow.com/questions/7165152/c-fastest-union-of-2-sets-of-sorted-values – Jonathan Dickinson Aug 23 '11 at 17:35

5 Answers5

27

If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.

Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.

The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.

Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):

(There's more text at the end of the answer!)

static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
    IEnumerable<T> sequence2,
    IComparer<T> comparer)
{
    using (var cursor1 = sequence1.GetEnumerator())
    using (var cursor2 = sequence2.GetEnumerator())
    {
        if (!cursor1.MoveNext() || !cursor2.MoveNext())
        {
            yield break;
        }
        var value1 = cursor1.Current;
        var value2 = cursor2.Current;

        while (true)
        {
            int comparison = comparer.Compare(value1, value2);
            if (comparison < 0)
            {
                if (!cursor1.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
            }
            else if (comparison > 0)
            {
                if (!cursor2.MoveNext())
                {
                    yield break;
                }
                value2 = cursor2.Current;
            }
            else
            {
                yield return value1;
                if (!cursor1.MoveNext() || !cursor2.MoveNext())
                {
                    yield break;
                }
                value1 = cursor1.Current;
                value2 = cursor2.Current;
            }
        }
    }
}

EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • 4
    This is really similar to mergesort algorithm. – Gabe Aug 23 '11 at 16:58
  • It doesn't return the true intersection, you are assuming the lists have the same length and one (or both) isn't empty. – Jonathan Dickinson Aug 23 '11 at 17:16
  • NOOOO, my belief in Jon Skeet has been shaken, as I believe there is a bug. Inputting the sequences {1,2} and {0,2} returns the sequence {1,2} but should only return {2}. The minor bug is the line `int value2 = cursor1.Current;` Of course the `1` should be a `2` in that line. – JBSnorro Aug 23 '11 at 17:20
  • @Jonathan: I don't think I'm assuming either of those. If either is empty, it will stop immediately due to the first calls to `MoveNext()`. After that, each iteration only advances one cursor unless there's a match, in which case it advances both. Try it! – Jon Skeet Aug 23 '11 at 17:20
  • @JBSnorro: Fixed, thanks. I did say it was untested :) Does the rest look okay? – Jon Skeet Aug 23 '11 at 17:21
  • 1
    @Jon, yes, the rest seems perfect. How nice of you to ask for my confirmation – JBSnorro Aug 23 '11 at 17:23
  • Thank you very much Jon. This improves my the speed quite a bit. – gligoran Aug 23 '11 at 18:58
  • Would I be correct in saying that a binary search to the next adjacent element would be faster the count of the lower set is less than log2(N) of the other? – Spencer Rose Feb 05 '14 at 04:11
  • @SpencerRose: I don't really follow you, I'm afraid - but I suspect not. – Jon Skeet Feb 05 '14 at 06:46
  • The case I am trying to solve for is say 3 items in one list, and 80,000 in the other. There must be a point where iterating over the entire 80k collection is slower than binary searching the larger list for whats in the smaller list – Spencer Rose Feb 05 '14 at 23:18
  • @SpencerRose: Ah, I see what you mean. Sorry, I was confused by the "next adjacent element" part. Yes, in that case it would be faster to do a binary search - assuming you have random access to the larger collection. (My solution here works with arbitrary sequences, but in many cases you *would* have random access.) In fact, with random access you can do better than just doing a naive binary search each time. Will edit. – Jon Skeet Feb 06 '14 at 06:52
  • @JonSkeet Limiting the resultant searches would help a lot! I will add that to my solution :) cheers – Spencer Rose Feb 06 '14 at 08:03
8

Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).

This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.

var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();

while (firstIndex < firstCount && secondIndex < secondCount)
{
    var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
    if (comp < 0) {
        ++firstIndex;
    }
    else if (comp > 0) {
        ++secondIndex;
    }
    else {
        intersection.Add(firstSet[firstIndex]);
        ++firstIndex;
        ++secondIndex;
    }
}

The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.

Jon
  • 396,160
  • 71
  • 697
  • 768
  • Yup, this is basically the non-streaming version of my approach - although you're assuming that `CompareTo` always returns -1, 0 or 1, rather than the conditions being "less than 0", 0, and "more than 0". – Jon Skeet Aug 23 '11 at 17:22
  • @JonSkeet: True. C-style bug in there as well. :) Fixed it, thank you for spotting it. – Jon Aug 23 '11 at 17:25
  • +1 for a good approach. Thanks. Yours was a bit easier to understand and would be great if I wanted to fuse it with code. One can for example do the processing of some kind instead of adding the value to intersection list. – gligoran Aug 23 '11 at 23:13
5

You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.

var intersection = firstSet.Intersect(secondSet);

Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).

Anthony Pegram
  • 114,815
  • 25
  • 210
  • 245
  • I tried that before, but if I remember correctly it worked worst then the upper 2 versions. Anyway I think Jon Skeet's solution is as fast as it gets. – gligoran Aug 23 '11 at 19:00
  • @gligoran, that's curious. My own test of the Where/Binary search vs. Intersect observed a tremendous gain in performance for large data sets, but I'll leave it to you to recall your own observations using your own real data. – Anthony Pegram Aug 23 '11 at 19:10
  • I do have a large dataset, but the intersection occurs deep into the algorithm, so I always intersect 2 sets of about 50 elements, but I have to do it about 5.3 million times. AFAIK Intersect does not require sets to be sorted so it would do 2500 passes of the double loop, where Jon Skeet's version does it in 50 if I'm correct. The Where/Binary would probably need 50*log_2(50) which is just under 300. But I may be wrong. – gligoran Aug 23 '11 at 23:10
  • @gligoran, `Intersect` would be the equivalent of loading the first set into a HashSet (1 pass), and then checking it for the existence of each element of the second set (1 pass). Lookups in the HashSet would have O(1) complexity. It should be *relatively* optimal, but not quite as optimal as Jon's approach in this instance where you are dealing with already-sorted data. – Anthony Pegram Aug 23 '11 at 23:16
2

I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.

I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:

public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
        List<T> sequence2,
        IComparer<T> comparer)
{
    List<T> smallList = null;
    List<T> largeList = null;

    if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
    {
        smallList = sequence1;
        largeList = sequence2;
    }
    else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
    {
        smallList = sequence2;
        largeList = sequence1;
    }

    if (smallList != null)
    {
        foreach (var item in smallList)
        {
            if (largeList.BinarySearch(item, comparer) >= 0)
            {
                yield return item;
            }
        }
    }
    else
    {
        //Use Jon's method
    }
}

I am still unsure about the point at which you break even, need to do some more testing

Spencer Rose
  • 1,120
  • 11
  • 21
  • I think it's a reasonable approach, however: * you're calling `Count()` 4 times on `IEnumerables`, which could be quite expensive if they are not a List or similar; not to mention multiple enumeration. * This could be improved by making the second `if` an `else if`. * `BinarySearch()` is not available for `IEnumerable`, so this only works if `largeList` is a `List` or an `Array`; either change it's type or cast it when necessary. – elnigno Dec 09 '14 at 14:25
  • Updated to use lists instead – Spencer Rose Dec 11 '14 at 00:47
0

try

firstSet.InterSect (secondSet).ToList ()

or

firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

burning_LEGION
  • 12,532
  • 8
  • 36
  • 48
Yahia
  • 67,016
  • 7
  • 102
  • 131