27

I understand that I should not optimize every single spot of my program so please consider this question to be "academic"

I have maximum 100 strings and integer number for each of them, something like that:

MSFT 1
DELL 2
HP   4
....
ABC  58

This set is preinitialized that means that once created it never changes. After set is initialized I use it pretty intensive so it nice to have fast lookup. Strings are pretty short, maximum 30 characters. Mapped int is also limited and between 1 and 100.

At least knowing that strings are preinitialized and never change it should be possible to "find" hash-function that results in "one basket-one item" mapping, but probably there are other hacks.

One optimization I can imagine - i can read first symbol only. For example if "DELL" is the only string starting with "D" and I have received something like "D***" than I do not need even to read the string! it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup". (well here I assumed that we receive only symbols that in hash, but it is not always the case)

Are there any ready to use or easy to implement solutions for my problem? i'm using c++ and boost.

upd I've check and found that for my exchange limit for ticker is 12 symbols, not 30 as mentioned above. However other exchanges may allow slighty longer symbols so it's interesting to have algorithm that will continue working on up to 20-charachters long tickers.

Oleg Vazhnev
  • 21,122
  • 47
  • 154
  • 286
  • 3
    *"...I should not optimize every single spot of my program..."* Of course you should, but only if you can afford. – Mark Garcia Apr 22 '13 at 07:07
  • 2
    The simplest approach would be to get rid of string processing and use integers instead. E.g. putting all strings in a vector and use the index internally. Only in cases where you really need the string (e.g. to display it on the screen) you take the string from the vector. – ogni42 Apr 22 '13 at 07:13
  • "...should not optimize every single spot of my program..." - Happy line. Consider the future we all face when "Tailoring thread behavior *to a particular runtime environment* is often overlooked in multithreaded programs." (said by Intel in 2005 on "Developing Multithreaded Applications...") – SChepurin Apr 22 '13 at 07:47
  • look up [prefix trie](http://en.wikipedia.org/wiki/Trie) – ratchet freak Apr 22 '13 at 08:46
  • @ogni42 of course, but I receive "string" from third party. so I need to map this string to int cause in my program i'm using only ints. – Oleg Vazhnev Apr 22 '13 at 10:07
  • When you're really squeezing out the last microsecond, it starts to matter how you receive those strings. Do you have the string _length_ available in O(1) ? There's no need to compare `"DELL"` and `"DELLX"`, since their lengths differ. And when you only compare strings of the same known length, you can simplify the loop condition. – MSalters Apr 22 '13 at 11:17
  • Off topic, but assuming stock symbols never change can get you into trouble depending on how they're used in your data model. One high profile example is Santander changing their symbol from 'STD' to 'SAN' because the combination of being a Spanish Bank and the colloquial meaning of 'STD' was a bit too much. – Chuu Apr 22 '13 at 12:43
  • @Chuu stock symbol never change during session. between sessions stock symbol may chage. – Oleg Vazhnev Apr 22 '13 at 13:06

7 Answers7

36

A hashtable[1] is in principle the fastest way.

You could however compile a Perfect Hash Function given the fact that you know the full domain ahead of time.

With a perfect hash, there need not be a collision, so you can store the hash table in a linear array!

With proper tweaking you can then

  • fit all of the hash elements in a limited space, making direct addressing a potential option
  • have a reverse lookup in O(1)

The 'old-school' tool for generating Perfect Hash functions would be gperf(1). The wikipedia lists more resources on the subject.

Because of all the debate I ran a demo:

Downloading NASDAQ ticker symbols and getting 100 random samples from that set, applying gperf as follows:

gperf -e ' \015' -L C++ -7 -C -E -k '*,1,$' -m 100 selection > perfhash.cpp

Results in a hash-value MAX_HASH_VALUE of 157 and a direct string lookup table of as many items. Here's just the hash function for demonstration purposes:

inline unsigned int Perfect_Hash::hash (register const char *str, register unsigned int len) {
  static const unsigned char asso_values[] = {
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156,  64,  40,   1,  62,   1,
       41,  18,  47,   0,   1,  11,  10,  57,  21,   7,
       14,  13,  24,   3,  33,  89,  11,   0,  19,   5,
       12,   0, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156, 156,
      156, 156, 156, 156, 156, 156, 156, 156, 156
    };
  register int hval = len;

  switch (hval) {
      default: hval += asso_values[(unsigned char)str[4]];   /*FALLTHROUGH*/
      case 4:  hval += asso_values[(unsigned char)str[3]];   /*FALLTHROUGH*/
      case 3:  hval += asso_values[(unsigned char)str[2]+1]; /*FALLTHROUGH*/
      case 2:  hval += asso_values[(unsigned char)str[1]];   /*FALLTHROUGH*/
      case 1:  hval += asso_values[(unsigned char)str[0]];   break;
  }
  return hval;
}

It really doesn't get much more efficient. Do have a look at the full source at github: https://gist.github.com/sehe/5433535

Mind you, this is a perfect hash, too, so there will be no collisions


Q. [...] it's obviosly "DELL". Such lookup must be significantly faster than "hashmap lookup".

A: If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.


[1] PS. For 100 strings, a sorted array of string with std::search or std::lower_bound would potentially be as fast/faster due to the improved Locality of Reference. Consult your profile results to see whether this applies.

sehe
  • 328,274
  • 43
  • 416
  • 565
  • The more I think of it, my point about cache locality is probably gonna win here. We are talking 100*short strings, this would likely fit in ~800 bytes. Linear search might beat a hash table there... See also [SSO](http://stackoverflow.com/questions/10315041/meaning-of-acronym-sso-in-the-context-of-stdstring) – sehe Apr 22 '13 at 07:42
  • 1
    Searching through 100 short strings is by no means a cheap task and even the worst of cache misses shouldn't be *that* bad. Assuming the hash function is cheap, I don't think you can do much better than that. Both hash tables and binary trees will suffer from cache misses if the data doesn't fit into cache. I figure binary trees have it worse because there's memory overhead for each node. – Mysticial Apr 22 '13 at 07:46
  • 2
    An alternate approach (if you know the set of words ahead of time) is to keep all the words sorted in an array. Then you binary search it for the index. That might even be faster than a hash table. – Mysticial Apr 22 '13 at 07:49
  • @Mysticial Thanks for confirming the hunch I expressed as my PS. Your experience with CPU characteristics here weighs in for me. (When using `gperf`, the hash function will come out very lightweight. It has options to further tune the code generated, IIRC) – sehe Apr 22 '13 at 07:51
  • -1 "a hashtable is in principle the fastest way" is just wrong for small N, and here N is <= 100. Direct indexing may be possible, or direct indexing and ad-hoc disambiguation. I hope gperf would generate equivalent code, but then not a hash table in any meaningful sense. Your advice re binary search of sorted array is good, but for keys known at compile time and consistently short (per question), storing textual data directly in the elements is better than std::string's likely heap indirection - compromising the Locality of Reference benefit you cite. – Tony Delroy Apr 22 '13 at 07:52
  • 6
    My own measurements show that even the standard `std::map` will beat most typical hashing algorithms for less than about 200 elements. `std::vector` with `std::lower_bound` could be even less, but note that is you put `std::string` in the table, you may loose the locality advantage, at least partially, because the implementation of `std::string` may (and probably does) have an indirection and dynamically allocated memory. It might actually pay to use something like `struct { char key[30]; int value }`. – James Kanze Apr 22 '13 at 07:52
  • @TonyD I can confirm that gperf will result in equivalent code (it is basically all about minimum decision trees). And, yes, if you choose a small hash-value domain you can easily store in an array, like I tried to suggest. The important things here are: "perfect hash" with ***use profiler***. – sehe Apr 22 '13 at 07:54
  • @JamesKanze looking at the sample strings form the OP, I'd trust [SSO](http://stackoverflow.com/questions/1466073/how-is-stdstring-implemented) before doing any premature manual optimization. – sehe Apr 22 '13 at 07:55
  • @sehe: "important things [...] perfect hash with use profiler" - you could have a perfect hash function that was overly expensive to calculate; _perfect_ doesn't mean it's fast - it means there are no bucket collisions. Anyway, direct indexing is ideal, a very simple hash function likely second (e.g. if you can create a perfect hash from an XOR or bitshift or two), sorted binary lookup third. If you are happy with implementation defined behaviour and have it, relying on SSO is convenient. – Tony Delroy Apr 22 '13 at 08:11
  • `a hashtable is in principle the fastest way.` => wrong, Trie is faster. – Roee Gavirel Apr 22 '13 at 08:16
  • 3
    @Roee Is it? Show me a benchmark for OP’s particular data set. – Konrad Rudolph Apr 22 '13 at 08:17
  • @KonradRudolph - We actually had a task for that at school (a few years ago). I'll see if I still have the code. – Roee Gavirel Apr 22 '13 at 08:20
  • 3
    I included a sample of a perfect hash function including a 157-element reverse-lookup array (using direct indexing) for that (based on 100 randomly selected ticker symbols). I included the instructions on how I used `gperf` to get this. Hope this helps. – sehe Apr 22 '13 at 09:32
  • @sehe SSO isn't widely used. And he said strings up to 30 characters; the only implementation of SSO I've seen uses 8 as a cut-off. (But you're right that the only way to know, one way or another, is to profile with real data.) – James Kanze Apr 22 '13 at 09:44
  • 30 is theoretical limit and i think even it is lower, probaly about 12-15. 99% of symbols is about 3-7 character. sorry not to mention it in the original question. – Oleg Vazhnev Apr 22 '13 at 10:30
  • @JamesKanze you might be interested in facebook's folly library. It has a 23 Byte SSO string class. – Stephan Dollberg Apr 22 '13 at 10:39
19

Small addition to sehe’s post:

If you use a simple std::map the net effect is prefix search (because lexicographical string comparison shortcuts on the first character mismatch). The same thing goes for binary search in a sorted container.

You can harness the prefix search to be much more efficient. The problem with both std::map and naive binary search is that they will read the same prefix redundantly for each individual comparison, making the overall search O(m log n) where m is the length of the search string.

This is the reason why a hashmap outcompetes these two methods for large sets. However, there is a data structure which does not perform redundant prefix comparisons, and in fact needs to compare each prefix exactly once: a prefix (search) tree, more commonly known as trie, and looking up a single string of length m is feasible in O(m), the same asymptotic runtime you get for a hash table with perfect hashing.

Whether a trie or a (direct lookup) hash table with perfect hashing is more efficient for your purpose is a question of profiling.

Konrad Rudolph
  • 482,603
  • 120
  • 884
  • 1,141
  • This post related to tries might also be interesting: http://stackoverflow.com/a/1280686/893693 – Stephan Dollberg Apr 22 '13 at 08:20
  • Just so we can do some informed comparisons, I have included a demonstration of using `gperf` to hash 100 ticker symbols into my answer, just now. – sehe Apr 22 '13 at 09:33
  • 1
    Often you can avoid the brunt of the inefficiency by simply changing the comparison function to sort by string length first. Of course, then the strings are no longer sorted lexicographically, but this doesn't matter if all you do are lookups/inserts. – Cameron Jun 30 '15 at 13:03
  • @Cameron Thanks, that’s really good advice in general. Unfortunately, many scenarios that require a very efficient string lookup deal with many strings of approximately the same length, such that this strategy won’t help much (or at all; for example, many genomic applications require extremely fast lookup of millions of strings of equal length). – Konrad Rudolph Jun 30 '15 at 14:04
1

Yes!

Hash must go over your string and build a hash value. while using trie as explained in the link [Wiki:Trie] only need to follow a path on a linked structure without any over calculations. and if it's compressed trie, as explained in the end of the page it takes into a count a case when an initial is for one word (the DELL case you spoke about). the pre-processing is a little higher but give the best performance in run time.

some more advantages:
1. if the string you are looking for doesn't exists you know that in the first char which is different from the existing strings (don't need to continue the calculation)
2. after implemented, adding more string to the trie is straight forward.

Roee Gavirel
  • 17,192
  • 12
  • 58
  • 82
0

Well, you could store the strings in a binary tree and search there. While this has O(log n) theoretical performance, it may be a lot faster in practise if you only have a few keys, that are really long, and that already differ in the first few characters.

I.e. when comparing keys is cheaper than computing the hash function.

Furthermore, there are CPU caching effects and such that may (or may not) be beneficial.

However, with a fairly cheap hash function, the hash table will be hard to beat.

Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184
0

The standard hash map as well as a perfect hash function as mentioned above suffer from the relatively slow execution of the hash function itself. The sketched perfect hash function e.g. has up to 5 random accesses to an array.

It makes sense to measure or calculate the speed of the hash function and of string comparisons assuming that the functionality is done by one hash function evaluation, one lookup in a table and a linear search though a (linked) list containing the strings and their index to resolve the hash collisions. In many cases it is better to use a simpler but faster hash function and accept more string comparisons than using a better but slower hash function and have less (standard hashmap) or even only one (perfect hash) comparison.

You will find a discussion of the related theme "switch on string" on my site as well as a bunch of solutions using a common test bed using macros as free C / C++ sources which solve the problem at runtime. I'm also thinking about a precompiler.

Sirrida
  • 11
  • 3
0

(Yet) Another small addition to sehe's answer:

Apart from Perfect Hash Functions, there's this Minimal Perfect Hash Function thing, and respectively C Minimal Perfect Hash Function. It is almost the same as gperf, except that:

gperf is a bit different, since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys

The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys

source: http://cmph.sourceforge.net/

Community
  • 1
  • 1
Lu4
  • 13,853
  • 12
  • 68
  • 122
-2

If the strings are known at compile-time you can just use an enumeration:

enum
{
  Str1,
  Str2
};

const char *Strings = {
  "Str1",
  "Str2"
};

Using some macro tricks you can remove the redundancy of re-creating the table in two locations (using file inclusion and #undef).

Then lookup can be achieved as fast as indexing an array:

const char *string = Strings[Str1]; // set to "Str1"

This would have optimal lookup time and locality of reference.

RandyGaul
  • 1,805
  • 11
  • 21