4

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.

Uriel Katz
  • 319
  • 1
  • 5
  • 20
  • `?` is short for `{0,1}` – NINCOMPOOP Jul 01 '13 at 08:23
  • 1
    @TheNewIdiot By itself, yes. But after a * or + then it has a different effect. I do wonder, however, if any regex engines recognise a `{n,m}?` syntax? – PP. Jul 01 '13 at 08:26
  • @PP It _seems_ to be working that way [regex101](http://www.regex101.com/r/mF4eC3) which I think bases on php. – Jerry Jul 01 '13 at 08:29
  • So would `*{0,1}` make the `*` operator lazy? If so, how? If the minimum is 0 then `.*{0,1}` = `.`, but that's not the case. – Uriel Katz Jul 01 '13 at 08:31
  • 2
    @Uriel, The New Idiot was talking about the `?` *quantifier* which is equivalent to the `{0,1}` quantifier, just as `*` is equivalent to `{0,}`, etc. You cannot chain quantifiers like that; it's a syntax error. What `?` in `*?` seems to be is either a quantifier modifier or (which is what most documentation seems to say), `*?` is just another quantifier that works like `*` except for laziness. – Joey Jul 01 '13 at 08:37
  • @Joey I can't imagine `?` being a quantifier modifier. It would mean in `*?` that the `*` doesn't have to exists. And this in searching `` with regex `<.>` would mean that either `*` existed or `*` didn't exist, which would mean `<.>` = or . In the search this would default to `<.>`, which would mean `` would be a valid match. – Uriel Katz Jul 01 '13 at 08:42
  • If what you're wondering is not so much "how does the syntax work" but rather "how does one implement regular expression matchers", see http://swtch.com/~rsc/regexp/ (which itself is just a long list of links to various papers). It's easy to code most RE engines if you don't need them to perform well, but it gets seriously hard if you want them to be fast. – torek Jul 01 '13 at 09:05
  • @UrielKatz plain and simple: `*?` is parsed as a separate operator. The `?` here has nothing to do with making something optional. – Martin Ender Jul 01 '13 at 10:20
  • 1
    @PP, I think all important engines that provide non-greedy quantifiers, provide `{n,m}?` as well. [See flavor comparison](http://www.regular-expressions.info/refflavors.html) – Martin Ender Jul 01 '13 at 10:21

4 Answers4

14

? can mean a lot of different things in different contexts.

  • Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
  • Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
  • Right after an opening parenthesis, it marks the start of a special construct like for example

    a) (?s): mode modifiers ("turn on dotall mode")
    b) (?:...): make the group non-capturing
    c) (?=...) or (?!...): lookahead assertion
    d) (?<=...) or (?<!...): lookbehind assertion
    e) (?>...): atomic group
    f) (?<foo>...): named capturing group
    g) (?#comment): inline comments, ignored by the regex engine
    h) (?(?=if)then|else): conditionals

and others. Not all constructs are available in all regex flavors.

  • Within a character class ([?]), it simply matches a verbatim ?.
Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • This answer contains what I think is the key to the OP's confusion. Specifically, it's up to whoever writes the regular expression recognizer (library routines in C and Python, language constructs in Perl, and so on) to decide how to interpret the question-mark. The answer is different in different implementations—some older regex libraries, for instance, have no special meaning at all for `?`. If the implementor provides "lazy" `.*?` matching, then `?` makes `.*` lazy by whatever means the implementor implemented. If the OP is asking "how do I implement regex", that's ... a big topic. – torek Jul 01 '13 at 09:00
  • 1
    Ah yes, the `??` is a non-greedy `?`! – PP. Jul 01 '13 at 12:38
5

I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.

What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).

A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.

I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.

Alan Moore
  • 68,531
  • 11
  • 88
  • 149
3

Imagine you have the following text:

BAAAAAAAAD

The following regexs will return:

/B(A+)/ => 'BAAAAAAAA'
/B(A+?)/ => 'BA'
/B(A*)/ => 'BAAAAAAAA'
/B(A*?)/ => 'B'

The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.

Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".

And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".

PP.
  • 10,329
  • 7
  • 43
  • 58
  • 1
    Thanks for the answer, but this is not what I am asking. I am asking how does the addition of the `?` to the `+` and `*` operators make them lazy. – Uriel Katz Jul 01 '13 at 08:27
  • 1
    @UrielKatz Your question does not make sense. This is how it is, you don't need to know why because it's the syntax to specify a lazy pattern. It's like you're asking us `Why do we use
    instead of newline`. The answer is simple: it's specified from the beginning by some people that the syntax to jump a line is `
    ` and not newline.
    – HamZa Jul 01 '13 at 08:42
0

This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

Joey
  • 316,376
  • 76
  • 642
  • 652
  • So then the question stands, how does `?` modify `*`. As I was explaining to Joey, in searching with regex <.> , `?` would mean that either `*` existed or `*` didn't exist, which would mean `<.>` = `<.>` or `<.>`. In the search this would default to `<.>`, which would mean `` would be a valid match. – Uriel Katz Jul 01 '13 at 08:48