5

In an app that I'm writing I have two potentially large sets of data I need to map against each other. One is a List returned from a web service and one is a DataTable. I need to take the ANSI (or ISO) number for each item in the list and find the row of the DataTable containing that ANSI number and then do stuff with it.

Since DataTable.Select is pretty slow and I would have to do that for each item in the List, I experimented with faster alternatives. Keep in mind that there is no database for the DataTable object. So I can't leverage any SQL capabilities or anything like that.

I thought the fastest way might be to create a dictionary with a KeyValuePair (A:Ansi number or I:Iso number) and use that as a key. The value would be the rest of the row. Creating that dictionary would obviously take a little processing time, but then I could leverage the extremely fast search times of the dictionary to find each row I need and then add the rows back to the table afterwards. So within the foreach loop going for the list I would only have a complexity of O(1) with the dictionary instead of O(n) or whatever DataTable.Select has.

To my surprise it turned out the dictionary was incredibly slow. I couldn't figure out why until I found out that using a string (just ANSI number) instead of a KeyValuePair increased the performance dramatically. I'm talking hundreds of times faster. How on Earth is that possible? Here is how I test:

I generate a List that simulates the output from the web service. I create a dictionary based on that list with a key (either string or KeyValuePair) and the DataRow as value. I go through a foreach loop for that list and search each item in that list in my dictionary and then assign a value to the DataRow that is returned. That's it.

If I use KeyValuePair as a key to access the dictionary it takes seconds for 1,000 items, if I modify the dictionary to take only a string as a key it takes milliseconds for 10,000 items. FYI: I designed the test so that there would always be hits, so all keys are always found.

Here is the block of code for which I'm measuring the time:

foreach(ProductList.Products item in pList.Output.Products)
{
   //KeyValuePair<string, string> kv = new KeyValuePair<string, string>("A", item.Ansi);
   DataRow row = dict[item.Ansi];
   for (int i = 0; i < 10; i++)
   {
      row["Material"] = item.Material + "a"; //Do stuff just for debugging
   }
   hits++;
}

So how on Earth is it possible that the execution time suddenly becomes hundreds of times longer if I use a Dictionary(KeyValuePair,DataRow) instead of Dictionary(String,DataRow)?

croxy
  • 3,641
  • 8
  • 25
  • 42
user2696330
  • 111
  • 11
  • 2
    People here on SO used to be less critical. I bet the person who downvoted this question had not read it before hit the button of disapproval. What the hell people, stop doing this for god's sake... – Marcelo De Zen Nov 13 '15 at 15:33
  • Different questions, but the same answers, please see http://stackoverflow.com/a/251619/18797 – Binary Worrier Nov 13 '15 at 15:36

2 Answers2

11

KeyValuePair<TKey, TValue> doesn't implement the GetHashCode() method. This means that the only way to meaningfuly organize the dictionary is gone, and you're left with an inefficient linear search.

This shouldn't be surprising, since it's not what KeyValuePair<TKey, TValue> is designed for - it's an internal structure used by the dictionary, not a key. There's no requirement for .NET objects to be useful keys, and returning 0 from all GetHashCode() calls is perfectly valid.

If you don't want to use your own structures, use Tuple. But I would really just create my own structure for any kind of persistence, really.

As a side-note, DataTable.Select is actually pretty fast for what it's designed for - filtering data for output. It's not really designed for being called hundreds of times in a loop, though - the overhead dominates. This assumes that you have proper indices, of course. In your case, I think the indices are regenerated every time you call Select, which is a bit slow :)

Luaan
  • 57,516
  • 7
  • 84
  • 100
  • 1
    Tuple also tends to hash poorly. – paparazzo Nov 13 '15 at 15:35
  • 1
    Indeed, create your own structure, and hash it (often simply returning the hash of the most unique property is sufficient). – Binary Worrier Nov 13 '15 at 15:37
  • @Frisbee [citation needed]. Sure, there are better ways to hash if you know what biases are useful for you, but for unstructured data, `Tuple` works quite well. – Luaan Nov 13 '15 at 15:38
  • I am not making this up. I have had some situations where tuple performed very poorly. It was a comment. Take it or leave it. – paparazzo Nov 13 '15 at 15:41
  • 1
    @Frisbee The poor performance on `Tuple` is because when used with value types it will box them on every `GetHashCode` and `Equals` invocation. Some genius decided to use `EqualityComparer.Default` instead of `EqualityComparer.Default`. See http://stackoverflow.com/questions/21084412/net-tuple-and-equals-performance – Lukazoid Nov 13 '15 at 15:45
  • 1
    @Lukazoid I am not making this up. Tuple can hash poorly. http://stackoverflow.com/questions/12657348/new-keyvaluepairuint32-uint32i-j-gethashcode-high-rate-of-duplicates – paparazzo Nov 13 '15 at 15:46
  • @Frisbee Oh you mean the hash distribution? I was only commenting on the poor performance of Tuple, I have no idea about the distribution – Lukazoid Nov 13 '15 at 15:51
  • Thanks for the fast answer. It was so quick I Stackoverflow wouldn't even let me accept it as an answer until I wait. Anyway, is there any kind of pair that does hash well? From what I have heard Tupel and KVP won't work and I suspect there won't be any "pair" type that does hash well. So I guess I'm stuck using something like "A;Ansi number" as a string and then split the string later? Or is there something better? – user2696330 Nov 14 '15 at 09:27
  • @user2696330 You can always make your own type :) You probably know what's the best approach you want - depending on your keys, it might be perfectly fine to only return a hashcode of the string, or just the number - pick whatever will reduce collisions the most. There's not a lot of pain in having four collisions per key, you just don't want hundreds or thousands :) – Luaan Nov 14 '15 at 10:37
  • @user2696330 - what if you had two separate Dictionaries? You can put the same value object into one keyed by Ansi value as a string and one by ISO. – Bobson Nov 15 '15 at 03:58
  • @Bobson Dictionaries don't have trivial overhead - it might work if the top-level dictionary only has a few items, but it's definitely not a good solution when both keys are sparse. – Luaan Nov 15 '15 at 11:18
  • @Lukazoid Thanks for the warning, but the difference seems negligible. As you can see in my example, I have that little for loop that I go through 10 times to simulate actual work being done after the key is found. With that thing in place, both string and Tupel perform very close to each other. I think I'm just gonna go with that. – user2696330 Nov 16 '15 at 10:32
  • @Frisbee Your comment made my day! Take it or leave it. :-) – jpaugh Nov 19 '15 at 00:14
1

You are probably getting a high number of hash collisions with key value pair. You can test with GetHashCode.

The link below is tuple but I highly suspect you have the same thing going on with key value pair. gethashcode-high-rate-of-duplicates I would mark as a duplicate but you many have something else going on.

In this link Microsoft recommends not using value types for key. GetHashCode for KVP is inherited from value type.

Community
  • 1
  • 1
paparazzo
  • 42,665
  • 20
  • 93
  • 158