17

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?

Edit (December 2011):

The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).

templatetypedef
  • 328,018
  • 92
  • 813
  • 992
MAK
  • 24,585
  • 9
  • 50
  • 82
  • It isn't the exact answer since it only deals with searching for one string at a time, not multiple, but there is some possibly useful information (under the heading 'Karp Rabin') that may help you at: http://www-igm.univ-mlv.fr/~lecroq/string/index.html – Jonathan Leffler Aug 23 '09 at 11:07
  • The wikipedia article claims it can find multiple patterns in O(n) time. – MAK Aug 23 '09 at 11:25

2 Answers2

5

I'm not sure if this is the correct answer, but anyway:

While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.

Of course, we must pick m to have the maximum string length from the set of strings.

Update: From Wikipedia,

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.

Update 2: an amortized (average) O(n) time complexity ?

Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.

If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.

This technique will increase the false alarms but on average it has O(n) time complexity.

Nick Dandoulakis
  • 40,422
  • 14
  • 98
  • 136
  • Could you please elaborate? As far as I can understand, you are suggesting keeping multiple hashes (one for each length of pattern) and using those to query a hashtable/BST. But, doesn't computing more than a constant number if hashes each iteration make the complexity more than linear? – MAK Aug 23 '09 at 10:49
  • Thanks for explaining. But that is exactly the source of my confusion. If we compute the current hash value in m steps, our overall complexity is no longer linear. It becomes O(n*m) (n is the length of the string, m is length of the longest pattern). – MAK Aug 23 '09 at 11:21
  • @MAK, O(n) is feasible if all strings are `m` size and we calc the hash as mentioned in section *Use of hashing for shifting substring search*. But for variable length strings, IMO, O(nm) is the best we can do with Rabin-Karp algorithm. Maybe that Wiki entry isn't clear. – Nick Dandoulakis Aug 23 '09 at 11:58
  • I'm beginning to think that the wikipedia entry is simply wrong on this. It claims the complexity should be O(n+k). Thanks again. – MAK Aug 23 '09 at 12:17
  • 2
    @MAK, see update2. If I come up with a better solution, I'll post a new update. – Nick Dandoulakis Aug 23 '09 at 17:46
  • @Nick D., Thank you very much. That solution is probably the correct O(n) algorithm (or very close). But suppose the smallest pattern length is 1 (i.e. single character pattern). That would still cause a very high rate false alarms if the input alphabet is small (e.g. 'A'-'Z'), probably enough to cause near O(nm) performance in practice. But thanks, this is a brilliant solution nonetheless. – MAK Aug 24 '09 at 16:11
  • @Nick D., Had a talk about this with a friend, he came up with an input case that causes this approach to fail. Try your algorithm against this: pattern 1: blah pattern 2: aaaabcd Text to search: aaaaaaaaaaa.........aa (all 'a's) That would still cause O(nm) performance. – MAK Aug 24 '09 at 16:28
  • I'm aware of such cases. That kind of data show off the worst part of the algorithm. There is probably a way to minimize the false alarms but I haven't figured it out, yet. – Nick Dandoulakis Aug 24 '09 at 16:52
  • note that even quicksort in the worst case makes Θ(n2) comparisons ;) (excuses, excuses, excuses) – Nick Dandoulakis Aug 24 '09 at 17:16
0

It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).

O(m) + O(1) => O(m), i.e. linear time.

Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")

ire_and_curses
  • 64,177
  • 22
  • 110
  • 139
  • I get why Rabin-Karp is fast. I've used to before to find single patterns in a string. I'm trying to figure out how it can be used to find multiple patterns in a string simultaneously in O(n) time (as opposed to O(n*k) if you search for k patterns one by one). – MAK Aug 23 '09 at 10:42
  • @MAK: Sorry, I misunderstood your question. Isn't the answer to that at the bottom of the wikipedia article? "In contrast, the variant Rabin-Karp above can find all k patterns in O(n+k) time in expectation, because a hash table checks whether a substring hash equals any of the pattern hashes in O(1) time." Creating the hash is O(k). Looking for a match in a hash table is an O(1) operation. If any match, you win. – ire_and_curses Aug 23 '09 at 11:04