5

I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be:

((13th|(Executive |Residential)|((\w+) ){1,3})Floor)

Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue)

The issue is if the string is

on the 13th Floor

I don't get 13th Floor I get on the 13th Floor which seems to indicate it is matching the 3rd alternation. I'd have expected it to match 13th Floor. I set this up specifically (or so I thought) to prioritize the types of matches and leave the vague ones for last only if the others are missed. I guess they weren't kidding when they said Regex is greedy but I am unclear how to set this up to be 'greedy' and behave the way I want.

Unihedron
  • 10,251
  • 13
  • 53
  • 66
user3649739
  • 1,717
  • 2
  • 14
  • 26

2 Answers2

4

Well an automaton is worth a 1000 words:

Regular expression visualization

play with it

Your problem is that you're using a greedy \w+ sub-regex in your alternation. Because as @rigderunner is stating in his comment, a NFA is matching the longest leftmost substring, the \w+ will always match anything that comes before Floor, whether it is a series of words, or 13th or Executive or Residential or the three of them. The parenthesis are not changing how the alternation behaves.

So the worst case scenario it matches that you don't want it to match is:

xxxx yyyy zzz tttt Floor

The problem with your regex is that you expect to do something that actual regular expressions can't do: you're expecting it to match words if the alternatives did not work out. Because a regular language can't keep track of status, regular regex can't express this.

I'm actually not sure if using some kind of look ahead could help you do this in one regex, and even if you can, you'll end up with a very complicated, unreadable and maybe even not efficient regex.

So you may prefer to use two regex instead, and get the groups from the second regex in case the first one failed:

((13th|Executive|Residential) +Floor)

and if there's no match

((\w+ +){1:3}Floor)

N.B.: to avoid repeating myself, please have a look at that other answer where I give a list of interesting resources lecturing on regex and NFA. That will help you understand how regex actually works.

Community
  • 1
  • 1
zmo
  • 22,917
  • 4
  • 48
  • 82
  • 1
    No, the greediness of the `\w+` (or the `{1,3}`) quantifiers is not the problem. Its the fact that an NFA regex engine matches the _longest leftmost_ substring. As long as there are three words preceeding `floor`, the other two options will never get a chance to match regardless of the greediness/laziness of any of the quantifiers. – ridgerunner May 21 '14 at 12:42
  • err... thing is that `\w+ ` will always be matching the longest leftmost substring! – zmo May 21 '14 at 12:44
  • Well that makes sense then. It would be nice if there were some way to instruct the expression itself to stop on the first match left to right. But it seems as if you are saying the best solution is to in fact run the non-conflicting expressions (# vs explicit names) in one pass and then if not found run the greedy one in it's own pass. In fact just re-tested, if I make the w+ a single word then it won't over-ride the others. So I could have one pass that will ge #, known word, or unknown word, and if that fails have a single catch-all after. – user3649739 May 21 '14 at 13:16
  • `\w+` greedily matches a single word only. In this case it makes no difference to the overall match if were made to be lazy. Same goes for the `{1,3}` quantifier. The regex engine must try all possibilities before giving up and even if the all the quantifiers are lazy, the last alternative will always ba able match before the other alternatives. – ridgerunner May 21 '14 at 13:19
  • I don't get why you keep trying to find problems where there is no problems. The OP said he wants to get up to three words in case his alternative match does not work, so the second regex matches between one and three words separated by at least one space. – zmo May 21 '14 at 13:24
3

First, here is your regex in free-spacing mode:

tidied = re.compile(r"""
    (                   # $1: ...
      (                 # $2: One ... from 3 alternatives.
        13th            # Either a1of3.
      | (               # Or a2of3 $3: One ... from 2 alternatives.
          Executive[ ]  # Either a1of2.
        | Residential   # Or a2of2.
        )               # End $3: One ... from 2 alternatives.
      | (               # Or a3of3 $4: Last match from 1 to 3 ...
          (\w+)         # $5: ...
          [ ]           #
        ){1,3}          # End $4: Last match from 1 to 3 ...
      )                 # End $2: One ... from 3 alternatives.
      Floor             #
    )                   # End $1: ...
    """, re.VERBOSE)

Note that the above pattern has extra parentheses that are having no effect. Here is a simplified expression which is functionally equivalent:

tidied = re.compile(r"""
    (               # $1: One ... from 4 alternatives.
      13th          # Either a1of4.
    | Executive[ ]  # Or a2of4.
    | Residential   # Or a3of4.
    | (             # Or a4of4 $2: Last match from 1 to 3 ...
        (\w+)       # $3: ...
        [ ]         #
      ){1,3}        # End $2: Last match from 1 to 3 ...
    )               # End $1: One ... from 4 alternatives.
    Floor           #
    """, re.VERBOSE)

Longest leftmost matches

There are effectively four grouped alternatives preceding the required word: Floor. The first three alternatives are each one word only, but the fourth option matches three words. An NFA regex engine works from left to right and always attempts to find the longest leftmost match. In this case, as the regex marches through the regex one character at a time, it tests all four options at each character position. Since the fourth option can always match two words ahead of the other three, it will always match first (assuming there are three words preceding Floor in the given text.). If there are NOT three words preceding Floor, then one of the first three alternatives can match.

Note also that there is no required space following the 13th and Residential alternatives, thus it will only ever match when presented text having the concatenated text: ResidentialFloor or 13thFloor.

zmo
  • 22,917
  • 4
  • 48
  • 82
ridgerunner
  • 30,685
  • 4
  • 51
  • 68