75

Given a data structure specification such as a purely functional map with known complexity bounds, one has to pick between several implementations. There is some folklore on how to pick the right one, for example Red-Black trees are considered to be generally faster, but AVL trees have better performance on work loads with many lookups.

  1. Is there a systematic presentation (published paper) of this knowledge (as relates to sets/maps)? Ideally I would like to see statistical analysis performed on actual software. It might conclude, for example, that there are N typical kinds of map usage, and list the input probability distribution for each.

  2. Are there systematic benchmarks that test map and set performance on different distributions of inputs?

  3. Are there implementations that use adaptive algorithms to change representation depending on actual usage?

t0yv0
  • 4,678
  • 16
  • 35
  • 20
    Have you had a look at Okasaki's book, [*Purely Functional Data Structures?*](http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504) – Robert Harvey Apr 05 '13 at 16:46
  • 3
    @RobertHarvey, yes, I have a copy. It has excellent material on designing PFDS and doing complexity analysis. It also has some hints to practitioners (folklore referred to above). I am looking for more empirical data though, and/or statistical analysis of real usage patterns. – t0yv0 Apr 05 '13 at 16:49
  • 7
    It seems like an interesting question, but I'm not sure it's on-topic here. It's extremely localized, in the sense that you're basically hoping that someone stumbles in here that just so happens to know about a paper that talks about this (it's a `canihazresourcez` question, in other words, a crowd-sourcing internet search). In any case, a cursory Google Search comes up with this: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.9196 – Robert Harvey Apr 05 '13 at 16:53
  • 1
    @RobertHarvey that is the only reference I found as well. I considered posting on cstheory but this is not exactly a theory question, theory typically only goes as far as complexity bounds. Do you think it should be moved there? Another site perhaps? – t0yv0 Apr 05 '13 at 16:56
  • 4
    See http://meta.stackexchange.com/questions/128158/where-should-looking-for-resources-questions-go – Robert Harvey Apr 05 '13 at 16:57
  • 5
    Also this: http://journals.cambridge.org/action/displayAbstract;jsessionid=C7E3C7816ECBE70E8FAED76A377DA528.journals?fromPage=online&aid=83365 – Robert Harvey Apr 05 '13 at 17:00
  • 2
    Concerning policy, here is a similar question on theory. http://cstheory.stackexchange.com/questions/1539/whats-new-in-purely-functional-data-structures-since-okasaki - somehow noone flagged it as "canihazresources". I do not see any deep difference. – t0yv0 Apr 05 '13 at 17:10
  • That's a canonical question, a resource for resources. It's there so that people don't keep asking the same question over and over again. Canonical questions are not really comparable to `canihazresourcez`; every site has slightly different rules, and even Stack Overflow has [a few of these](http://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list). – Robert Harvey Apr 05 '13 at 17:12
  • 1
    Well, usually you go with something that runs within the restraints that you have. If you have a lot of memory, you might favor one algorithm over one that runs slower by using less memory at a time. When I was worried about efficiency in C#, I found that efficiency of the C# C5 Library is well documented. http://www.itu.dk/research/c5/ might be a good starting place if you are interested. – bean5 Oct 14 '13 at 06:21
  • 1
    __Purely Functional Data Structures__ is Okasaki's dissertation; it's online [here](http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf), but it's a great book to have in dead-tree form. – WeaponsGrade Oct 29 '13 at 19:19
  • The answer to all of these optimisation kinds of problems is always the same. Test and measure. Theoretical knowledge of why things are faster means nothing in the face of a timer telling you otherwise. All kinds of software and hardware things can interfere and ruin your performance - from language details to cache utilisation – jheriko Nov 27 '13 at 03:22

1 Answers1

5

These are basically research topics, and the results are generally given in the form of conclusions, while the statistical data is hidden. One can have statistical analysis on their own data though.

For the benchmarks, better go through the implementation details.

The 3rd part of the question is a very subjective matter, and the actual intentions may never be known at the time of implementation. However, languages like perl do their best to implement highly optimized solutions to every operation.

Following might be of help: Purely Functional Data Structures by Chris Okasaki http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf

Sanjay Verma
  • 980
  • 14
  • 33