21

I read http://swtch.com/~rsc/regexp/regexp1.html and in it the author says that in order to have backreferences in regexs, one needs backtracking when matching, and that makes the worst-case complexity exponential. But I don't see exactly why backreferences introduce the need for backtracking. Can someone explain why, and perhaps provide an example (regex and input)?

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
oskarkv
  • 2,290
  • 1
  • 16
  • 29
  • 1
    The article kind of answers that right there, regex with backrefs is it's not a regular expression, by it's formal definition. Altho this doesn't answer why such a fast algorithm can't be made for a regex with backrefs. – Qtax Jun 19 '12 at 13:12

4 Answers4

21

To get directly at your question, you should make a short study of the Chomsky Hierarchy. This is an old and beautiful way of organizing formal languages in sets of increasing complexity. The lowest rung of the hierarchy is the Regular Languages. You might guess - and you'd be right - that the RL's are exactly those that can be represented with "pure" regular expressions: Those with only the alphabet, empty string, concatenation, alternation |, and Kleene star * (look Ma, no back references). A classic theorem of formal language theory - Kleene's Theorem - is that DFAs, NFAs (as described in the article you cited), and regular expressions all have exactly the same power to represent and recognize languages. Thompson's construction given in the article is a part of the theorem's proof.

Every RL is also a CFL. But there are infinitely many CFLs that aren't regular. A feature that can exist in CFL's that makes them too complex to be regular is balanced pairs of things: parentheses, begin-end blocks, etc. Nearly all programming languages are CFLs. CFLs can be efficiently recognized by what's called a pushdown automaton This essentially a NFA with a stack glued on. The stack grows to be as big as needed, so it's no longer a finite automaton. Parsers of real programming languages are nearly all variations on pushdown automata.

Consider the regex with backreference

^(b*a)\1$

In words, this represents strings of length 2n for some n, where both the n'th and 2n'th characters are a and all other characters are b. This is a perfect example of a a CFL that's not regular. You can rigorously prove this with another cool formal language tool called the pumping lemma.

This is exactly why back references cause problems! They allow "regular expressions" that represent languages that aren't regular. Therefore there is no NFA or DFA that can ever recognize them.

But wait, it's even worse than I've made it out to be so far. Consider

^(b*a)\1\1$

We now have a string of length 3n where the n'th, 2n'th, and 3n'th elements are a and all others are b. There is another flavor of the pumping lemma that allows a proof that this language is even too complex to be a CFL! No pushdown automaton can recognize this one.

Back references allow these supercharged regexes to represent languages that are three rungs up the Chomsky Hierarchy: the Context Sensitive Languages. Roughly speaking, the only way to recognize a CSL is to check all strings in the language of equal length (at least if P!=NP, but that's true for all practical purposes and a different story altogether). The number of such strings is exponential in the length of the one you're matching.

This is why the searching regex matcher is needed. You can be very clever in the way you design the search. But there will always be some input that drives it to take exponential time.

So I agree with the author of the paper you cited. It's possible to write perfectly innocent looking regexes with no back refs that will be efficiently recognized for nearly all inputs, but where there exists some input that causes a Perl or Java or Python regex matcher - because it is a backtracking search - to require millions of years to complete the match. This is crazy. You can have a script that's correct and works fine for years and then locks up one day merely because it stumbled onto one of the bad inputs. Suppose the regex is buried in the message parser of the navigation system in the airplane you're riding...

Edit

By request, I'll sketch how the Pumping lemma can be used to prove the language a^k b a^k b is not regular. Here a^k is a shorthand for a repeated k times. The PL says that there must exist a positive integer N such that every string in a regular language of length at least N must be of the form R S T such that R S^k T are also in the language for all natural k. Here R, S, T are strings and S may not be empty.

Proof of the PL depends on the fact that every regular language corresponds to some DFA. An accepted input to this DFA longer than its number of states (which equates to L in the lemma) must cause it to "loop:" to repeat a state. Call this state X. The machine consumes some string R to get from the start to X, then S to loop back to X, then T to get to an accepting state. Well, adding extra copies of S (or else deleting S) in the input correspond only to a different number of "loops" from X back to X. Consequently, the new string with additional (or deleted) copies of S will also be accepted.

Since every RL must satisfy the PL, a proof that a language is not regular proceeds by showing that it contradicts the PL. For our language, this is not hard. Suppose you are trying to convince me the language L = a^k b a^k b satisfies the PL. Because it does so, you must be able to give me some value of N (see above): the number of states in a hypothetical DFA that recognizes L. At that point, I will say, "Okay Mr. Regular Guy, consider the string B=a^N b a^N b." If L is regular, B must cause this DFA (no matter what it looks like) to loop within the first N characters, which must be all as! So the loop (string S above) consists of all as, also. With this I can immediately show that your claim about L being regular is false. I just choose to go around the loop a second time. This will cause this hypothetical DFA of yours to accept a new string a^M b a^N b, where M>N because I added as to its first half. Ouch! This new string is not in L, so the PL is not true after all. Since I can do this trick every time no matter what N you provided, the PL cannot hold for L, and L cannot be regular after all.

Since it's not regular, Kleene's theorem tells us there is no DFA nor FA nor "pure" regex that describes it.

The proof that back refs allow languages that aren't even context free has a very similar ring but needs background on pushdown automata that I'm not going to give here. Google will provide.

NB: Both of these fall short of proof that back refs make recognition NP complete. They merely say in a very rigorous way that back refs add real complexity to pure regular expressions. They allow languages that can't be recognized with any machine having finite memory, nor any with only an infinitely large LIFO memory. I will leave NP completeness proof to others.

Gene
  • 42,664
  • 4
  • 51
  • 82
  • So the answer here to the question *"why?"* is *"because it's not a regular expression"*, which by it self doesn't add much. A proof of why such an expression no longer represents a regular language would be of value. – Qtax Jun 29 '12 at 12:45
10

NFA and DFA are Finite Automata, aka finite-state machine which are "abstract machine that can be in one of a finite number of states"[1]. Note the finite number of states.

The fast NFA/DFA algorithms discussed in the linked article, Regular Expression Matching Can Be Simple And Fast, are fast because they can work with a finite number of states (independent of input length) as described in the article.

Introducing backreferences makes the number of states (almost) "infinite" (in worst case about 256n where n is the length of the input). The number of states grows because every possible value of every backreference becomes a state of the automata.

Thus using a finite-state machine is no longer fitting/possible, and backtracking algorithms have to be used instead.

Qtax
  • 31,392
  • 7
  • 73
  • 111
  • I can only add that it's possible to build a regex engine using DFA that may allow backreferences... if this engine will switch to NFA when faced with such task. ) At least Jeffrey Friedl talks about two examples of using such approach - POSIX grep and Tcl regex parser - in his [wonderful book](http://books.google.com.ua/books?id=sshKXlr32-AC&pg=PA150). – raina77ow Jun 22 '12 at 14:56
  • Note that if the number of values of the backref is limited, you could construct an NFA and use a non-backtracking algorithm. For example `([ab])[ab]+\1+` can be matched with an NFA. But you can't construct an NFA for `([ab]+)[ab]+\1+` because there is an infinite possible values (thus states) of the capturing group. – Qtax Jun 22 '12 at 15:02
  • You probably meant DFA, not NFA in your previous comment? ) – raina77ow Jun 22 '12 at 15:25
  • @raina, NFA can be converted to DFA, so it doesn't really matter here (afaics?), altho using "DFA" would be more clear of my intentions. :-p – Qtax Jun 22 '12 at 15:29
  • Well, I just did it with JavaScript: I've said `var rex = /([ab]+)[ab]+\1+/`, and `rex.test('aabaa')`, and it was good (actually it was `true`, but `good` is better). So it does matter here, I suppose. ) – raina77ow Jun 22 '12 at 15:53
  • @raina, how is this relevant? JS uses a backtracking algorithm anyway, so you can use any kind of backrefs. – Qtax Jun 22 '12 at 15:59
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12917/discussion-between-qtax-and-raina77ow) – Qtax Jun 22 '12 at 15:59
  • @Qtax: What do you mean? The regex you pasted will never match anything which is the actual correct behavior. That the first [ab] loop matches all the "ab" and the second fails to match because the first consumed all of them is nothing strange. It is easy to express regular expressions that are auto-logic in the sense that we already know they will never match by analyzing them. "[^a]+[^a]a" is another never-matching regex. To know that we should match "all ab except the last one that the other ab should match" requires a context aware language because we said "the last one". – Hannes Landeholm Oct 31 '12 at 06:07
  • @Hannes, that's not how regex and quantifiers work. All possibilities are examined until a match is found. Regex flavors like PCRE/Perl/.NET/Java use backtracking to archive this. – Qtax Nov 02 '12 at 23:51
  • @Qtax Then they are not truly implementing regular expressions since the term "regular expression" have a formal meaning in linguistics theory which excludes context aware grammar that requires backtracking to implement. – Hannes Landeholm Mar 22 '13 at 16:24
4

There's some excellent examples in this tutorial:
http://www.regular-expressions.info/brackets.html

The particular case that you will be interested in is shown in 'Backtracking Into Capturing Groups' - it's explained there how the whole match can be given up several times before the final one can be found that matches the whole regex. Also, it's worth noting that this might lead to unexpected matches.

Joanna Derks
  • 3,893
  • 3
  • 22
  • 31
2

Very interesting document: Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions, support back-references and counted occurrences efficiently with modified NFA.