68

In .NET 4.0+, a class SortedSet<T> has a method called GetViewBetween(l, r), which returns an interface view on a tree part containing all the values between the two specified. Given that SortedSet<T> is implemented as a red-black tree, I naturally expect it to run in O(log N) time. The similar method in C++ is std::set::lower_bound/upper_bound, in Java it's TreeSet.headSet/tailSet, and they are logarithmic.

However, that is not true. The following code runs in 32 sec, whereas the equivalent O(log N) version of GetViewBetween would make this code run in 1-2 sec.

var s = new SortedSet<int>();
int n = 100000;
var rand = new Random(1000000007);
int sum = 0;
for (int i = 0; i < n; ++i) {
    s.Add(rand.Next());
    if (rand.Next() % 2 == 0) {
        int l = rand.Next(int.MaxValue / 2 - 10);
        int r = l + rand.Next(int.MaxValue / 2 - 10);
        var t = s.GetViewBetween(l, r);
        sum += t.Min;
    }
}
Console.WriteLine(sum);

I decompiled System.dll using dotPeek and here's what I got:

public TreeSubSet(SortedSet<T> Underlying, T Min, T Max, bool lowerBoundActive, bool upperBoundActive)
    : base(Underlying.Comparer)
{
    this.underlying = Underlying;
    this.min = Min;
    this.max = Max;
    this.lBoundActive = lowerBoundActive;
    this.uBoundActive = upperBoundActive;
    this.root = this.underlying.FindRange(this.min, this.max, this.lBoundActive, this.uBoundActive);
    this.count = 0;
    this.version = -1;
    this.VersionCheckImpl();
}

internal SortedSet<T>.Node FindRange(T from, T to, bool lowerBoundActive, bool upperBoundActive)
{
  SortedSet<T>.Node node = this.root;
  while (node != null)
  {
    if (lowerBoundActive && this.comparer.Compare(from, node.Item) > 0)
    {
      node = node.Right;
    }
    else
    {
      if (!upperBoundActive || this.comparer.Compare(to, node.Item) >= 0)
        return node;
      node = node.Left;
    }
  }
  return (SortedSet<T>.Node) null;
}

private void VersionCheckImpl()
{
    if (this.version == this.underlying.version)
      return;
    this.root = this.underlying.FindRange(this.min, this.max, this.lBoundActive, this.uBoundActive);
    this.version = this.underlying.version;
    this.count = 0;
    base.InOrderTreeWalk((TreeWalkPredicate<T>) (n =>
    {
      SortedSet<T>.TreeSubSet temp_31 = this;
      int temp_34 = temp_31.count + 1;
      temp_31.count = temp_34;
      return true;
    }));
}

So, FindRange is obviously O(log N), but after that we call VersionCheckImpl... which does a linear-time traversal of the found subtree only for recounting its nodes!

  1. Why would you need to do that traversal all the time?
  2. Why .NET does not contain a O(log N) method for splitting a tree based on key, like C++ or Java? It is really helpful in lots of situations.
Skiminok
  • 2,521
  • 1
  • 21
  • 29
  • 16
    Well, you are right, it is VersionCheckImpl() that ruins it. The check is quite sacred in .NET collection classes, I can't think of a better way. You'll get O(log n) as long as you *use* the subset, with checking in place, creating it is however O(n). You can post to connect.microsoft.com to point this out and get a view from the insiders. Odds are however high that they'll close it as "By design". – Hans Passant Mar 24 '12 at 12:57
  • 3
    What an outrageous blunder in the BCL. The GetRange method is supposed to be more efficient than a linear filter! – usr Mar 26 '12 at 22:10
  • 2
    @Hans For my curiosity, what is the point of `VersionCheckImpl`? –  Mar 28 '12 at 18:50
  • 23
    It is a tragic consequence of the requirement that [`SortedSet.Count` be O(1)](http://msdn.microsoft.com/en-us/library/dd382345.aspx). – Raymond Chen Mar 29 '12 at 04:10
  • @RaymondChen What's wrong with Java approach? Whatever their implementation is, `GetViewBetween` is derivable in logarithmic time from 2 calls of `headSet` and `tailSet`. All other methods retain consistency, and `Count` retrieval is still constant-time. – Skiminok Mar 29 '12 at 22:21
  • 1
    Nothing's wrong with the Java approach. (Though I see no requirement that Collection.size be O(1) [in the documentation](http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html#size()).) – Raymond Chen Mar 29 '12 at 23:10
  • 3
    I just want to point out that it isn't quite as bad as linear filter. If I understand the code correctly, only the range is linearly traversed not the whole set. – hwiechers Mar 30 '12 at 05:46
  • 2
    That `TreeSubSet` code is just horrible. The constructor calls `FindRange` twice for no good reason. As @hwiechers said, the O(n) traversal is only based on the subset. Still 3 times as many traversals as needed... – leppie Mar 30 '12 at 08:25
  • 5
    Checking in from 2017 here, the dotnet core implementation of SortedSet.GetViewBetween(...) is a O(log(n)) implementation. Ofc. plus the elements you need to extract. – Kristoffer la Cour Mar 24 '17 at 13:27

1 Answers1

20

about the version field

UPDATE1:

In my memory, a lot of(maybe all?) collections in BCL have the field version.

First,about foreach:

according to this msdn link

The foreach statement repeats a group of embedded statements for each element in an array or an object collection. The foreach statement is used to iterate through the collection to get the desired information, but should not be used to change the contents of the collection to avoid unpredictable side effects.

In many other collections, version is protected the data is not modified during the foreach

For example, HashTable's MoveNext():

public virtual bool MoveNext()
{
    if (this.version != this.hashtable.version)
    {
        throw new InvalidOperationException(Environment.GetResourceString("InvalidOperation_EnumFailedVersion"));
    }
    ..........
}

But in the in the SortedSet<T>'s MoveNext() method:

public bool MoveNext()
{
    this.tree.VersionCheck();
    if (this.version != this.tree.version)
    {
        ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
    }       
    ....
}

UPDATE2:

But the O(N) loop maybe not only for version but also for the Count property.

Because the MSDN of GetViewBetween said:

This method returns a view of the range of elements that fall between lowerValue and upperValue, as defined by the comparer .... You can make changes in both the view and in the underlying SortedSet(Of T).

So for every update it should be sync the count field (key and value are already same). To make sure the Count is correct

There were two policies to reach the target:

  1. Microsoft's
  2. Mono's

First.MS's,in their code, they sacrifice the GetViewBetween()'s performance and win the Count Property's performance.

VersionCheckImpl() is one way to sync the Count property.

Second,Mono. In mono's code,GetViewBetween() is Faster, but in their GetCount()method:

internal override int GetCount ()
{
    int count = 0;
    using (var e = set.tree.GetSuffixEnumerator (lower)) {
        while (e.MoveNext () && set.helper.Compare (upper, e.Current) >= 0)
            ++count;
    }
    return count;
}

It is always an O(N) operation!

svick
  • 214,528
  • 47
  • 357
  • 477
llj098
  • 1,364
  • 10
  • 13
  • Wouldn't augmenting each tree node with a 'subtree count' make much more sense? I've built a Red Black tree this way with little extra effort. The count value is updated on insertion and removal, (including any affected ancestor nodes). This approach provides O(log n) sorted indexing into the tree, eg tree[0] returns the smallest node, tree[tree.Count-1] returns the largest. I didn't provide a 'view' concept however, so maybe this complicates it? – Ash Feb 06 '13 at 06:22