55

I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.

I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)

Where could I go to learn more about implementing a regex engine?

avgvstvs
  • 5,556
  • 6
  • 38
  • 69
  • 6
    Because of backtracking, regex engines can exhibit [catastrophic time complexity](http://www.regular-expressions.info/catastrophic.html). I'm not sure where 'catastrophic' falls in `O()` notation :) but it sure isn't linear. – sarnold May 05 '11 at 02:58
  • 2
    Why is this being voted to close? – ocodo May 05 '11 at 04:39
  • possible duplicate of [Complexity of Regex substitution](http://stackoverflow.com/questions/21669/complexity-of-regex-substitution) – Kevin Jul 18 '14 at 18:32
  • While the original question in hindsight was too broad, I was originally asking for say, a side-by-side comparison of a linear search vs one of the NFA searches. The linked question was not really a duplicate. – avgvstvs Oct 27 '14 at 15:35

3 Answers3

62

This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast . Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).

porges
  • 28,750
  • 3
  • 83
  • 112
  • 1
    This comparison is fantastic... and gives me the inspiration to start looking into a way to implement a better regex library for Java... a tall goal, yes, but the current engine is so ugly. – avgvstvs May 05 '11 at 22:07
  • 1
    The time complexity is O(n), but please note, when performing a partial match on a string, you need roughly m*n steps, because if the regex engine can't match the pattern in the first character, it must try again starting with a 2nd, 3rd character and so on until it finds a matching sequence. – Calmarius Dec 21 '11 at 21:39
  • 1
    @Calmarius That does not make sense to me. Partial matching the expression A is just matching the expression .*(A).* and collecting the group. – jobermark Jul 29 '20 at 19:58
  • 2
    @jobermark I wrote that comment 9 years ago, when I didn't understood regex. – Calmarius Jul 30 '20 at 20:51
11

Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?

Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)

What you have to do is:

  1. Find a way to convert a regex into an automaton
  2. Implement the recognition of the automaton in the programming language of your preference

That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.

This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...

Oscar Mederos
  • 26,873
  • 20
  • 76
  • 120
  • Just a note: Lookahead and lookbehind don't add power to regular expressions, they're still regular. – porges May 05 '11 at 03:49
  • 1
    In fact, when you say "real" you mean "theoretical". `:)` – Kobi May 05 '11 at 06:59
  • Also, do you mean "back-referencing" instead of "backtracking"? – Kobi May 05 '11 at 07:00
  • I've learned a little bit about DFA and NFA in a programming languages class. (Precursor to compiler construction.) I was thinking it would be `O(n)`. – avgvstvs May 05 '11 at 22:18
5

The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.

Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.

If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.

mcdowella
  • 18,736
  • 2
  • 17
  • 24
  • I AM interested in this... one area of research that I'm interested in is techniques for input-validation for security. – avgvstvs May 05 '11 at 22:20
  • FYI: RE2 works with a guaranteed worst-case behavior because it does not support back-references. Note that supporting back-references requires us to solve an NP-hard problem in the worst case. See: https://perl.plover.com/NPC/NPC-3SAT.html – Bill Province Feb 09 '19 at 22:11