4

I'm a little confused about regular expressions and greedy vs lazy. It's really very simple and it feels like I'm missing something obvious.

I've simplified my problem as much as I can to make it clear. Consider the following string and regex pattern.

string:
aaxxxb

pattern:
(?<=a)(.*?)(?=b)

result:
axxx

what I expected:
xxx

This result is what I would expect from using .* instead of .*?, what am I missing?

Obviously, same thing if I use a.*?b gives me aaxxxb. Why is this? Shouldn't lazy (like .*?) return as few characters as possible?

user1277327
  • 371
  • 3
  • 8
  • 1
    FYI the exact same question was [already asked here](http://stackoverflow.com/questions/23151710/pattern-matcher-using-greedy-and-reluctant/23151788) a few days ago (answers are offered but none marked as accepted, so not flagging this as duplicate) – Robin Apr 19 '14 at 22:17
  • Also, from the [SO regex FAQ](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075), you can see differences between lazy and greedy [discussed more in depth](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532) – Robin Apr 19 '14 at 22:24
  • possible duplicate of [Greedy vs. Reluctant vs. Possessive Quantifiers](http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers) – james.garriss Jul 01 '14 at 17:43

2 Answers2

6

You are missing the fact that a regex engine works from left to right, position by position, and succeeds as soon as it finds a match at the current position.

In your example, the first position where the pattern succeeds is at the second "a".

The laziness works only on the right side.

If you want to obtain "xxx", a better way is to use a negated character class [^ab]* instead of .*?

Note: not exactly related to the subject, but good to know: a DFA regex engine will try to get the largest result in case of alternation, a NFA gives you the first that succeeds.

dirkk
  • 6,214
  • 5
  • 30
  • 49
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • 2
    Always a pleasure to read you. As you say, grasping that "a regex engine works from left to right, position by position" is a common stumbling block... And an even bigger source of confusion is that we are working left to right *both* on the string and on the pattern. When we speak of left-to-right I wish there was a way to express that easily. BTW I know you know this, but .NET has a right-to-left option! :) – zx81 Apr 19 '14 at 22:32
  • @zw81: Thanks! Indeed, do not forget that part of the world reads and writes from right to left and that .NET gives this possibility for regexes. – Casimir et Hippolyte Apr 19 '14 at 22:36
2

user1277327, the (?<=a) part of your pattern means "preceded by an 'a'". When the regex engine starts on your string aaxxxb, the first "a" doesn't fulfill the assertion of that lookbehind, but the second "a" does. Fine, but can the engine match that "a"? Yes, the dot in your .* allows the engine to match this "a". The lazy modifier ? tells the dot star to eat up only as many characters as necessary until we are able to match what comes next. What comes next is a lookahead asserting that the next character is a "b". So the engine eats up the three x characters. The total match is axxx.

If you are finding greed / laziness confusing, you may want to have a look at the levels of regex greed. The accompanying tut on lookarounds may also help.

zx81
  • 38,175
  • 8
  • 76
  • 97