25

EDIT: I selected ridgerunner's answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the specific question in case someone else wants to fully understand the example too. You will find it somewhere below.

This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)

a(?:(?R)|a?)a

This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:

aa*a

would work great. But the point is to use recursion.

Here is a piece of code you can run to test my failing pattern:

<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>

In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.

In my mind, for "aaaa", this should match "aaaa".

But here is the output:

a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa

Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.

I realise that

#(?:a|a(?R)a)*#

kind of works, but my question is why the other pattern doesn't.

Thanks heaps!

zx81
  • 38,175
  • 8
  • 76
  • 97
  • Add some anchors (like `^`, `$` or `\b`) and it should do what you want. I'm guessing that PCRE is doing some optimization that affect the results when it's not anchored. In Perl this pattern always matches full length for all lengths over 1. – Qtax Dec 09 '11 at 06:44
  • @Qtax, I should have added that I had also tried it with anchors. No joy there. EG, `$regex='#^(a(?:(?1)|a?)a)#';` In this expression, the (?1) is a recursive statement that accesses the expression (or match if Wiseguy is correct) in the first parenthesis, therefore excluding the caret anchor. – zx81 Dec 09 '11 at 06:58
  • This question has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Control Verbs and Recursion". – aliteralmind Apr 10 '14 at 01:10

4 Answers4

13

Excellent (and difficult) question!

First, with the PCRE regex engine, the (?R) behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R) expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa as the subject text, lets use: abcd. And lets change the regex from '#a(?:(?R)|a?)a#' to: '#.(?:(?R)|.?).#'. The regex engine matching behavior is the same.

Matching regex: /.(?:(?R)|.?)./ to: "abcd"

answer = r'''
Step Depth Regex          Subject  Comment
1    0     .(?:(?R)|.?).  abcd     Dot matches "a". Advance pointers.
           ^              ^
2    0     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 1).
                 ^         ^
3    1     .(?:(?R)|.?).  abcd     Dot matches "b". Advance pointers.
           ^               ^
4    1     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 2).
                 ^          ^
5    2     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
           ^                ^
6    2     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 3).
                 ^           ^
7    3     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
           ^                 ^
8    3     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 4).
                 ^            ^
9    4     .(?:(?R)|.?).  abcd     Dot fails to match end of string.
           ^                  ^    DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
                                   Give back text consumed by depth 4 (?R) = ""
10   3     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches EOS.
                    ^         ^    Advance regex pointer.
11   3     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    DEPTH 3 (?R) FAILS. Return to step 6 depth 2
                                   Give back text consumed by depth3 (?R) = "d"
12   2     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "d".
                    ^        ^     Advance pointers.
13   2     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to step 12 depth 2
14   2     .(?:(?R)|.?).  abcd     Match zero "d" (give it back).
                    ^        ^     Advance regex pointer.
15   2     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
                       ^     ^     DEPTH 2 (?R) SUCCEEDS.
                                   Return to step 4 depth 1
16   1     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 2 (?R) = "cd"
17   1     .(?:(?R)|.?).  abcd     Optional dot matches "c". Advance pointers.
                    ^       ^      
18   1     .(?:(?R)|.?).  abcd     Required dot matches "d". Advance pointers.
                       ^     ^     DEPTH 1 (?R) SUCCEEDS.
                                   Return to step 2 depth 0
19   0     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 1 (?R) = "bcd"
20   0     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "b".
                    ^      ^       Advance pointers.
21   0     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
                       ^    ^      SUCCESSFUL MATCH of "abc"
'''

There is nothing wrong with the regex engine. The correct match is abc (or aaa for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.

ridgerunner
  • 30,685
  • 4
  • 51
  • 68
  • "The (?R) behaves like an atomic group", not in Perl it doesn't. Never thought that PCREs implementation of `(?R)` would differ so dramatically. – Qtax Dec 09 '11 at 07:35
  • @Qtax I found this [here](http://www.tin.org/bin/man.cgi?section=3&topic=PCRE): "_it is a 'subroutine' call_" and "_A recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never reentered, even if it contains untried alternatives and there is a subsequent matching failure._" – Wiseguy Dec 09 '11 at 07:46
  • 3
    @Wiseguy, even more from [the PCRE manual](http://www.pcre.org/pcre.txt): *Subpatterns that are called as subroutines (whether or not recursively) are always treated as atomic groups in PCRE. This is like Python, but unlike Perl.* – Qtax Dec 09 '11 at 07:49
  • @Qtax "unlike Perl." Jackpot. Also, I like your official link better. :-) – Wiseguy Dec 09 '11 at 07:52
  • @Qtax - thanks for the heads up. I've modified the "atomic" text in my answer to refer to PCRE and not Perl. Good and relevant point. – ridgerunner Dec 09 '11 at 08:03
  • @ridgerunner: Your step 11 is wrong. A failure to match here should backtrack to the `.?` and try a zero-width match instead. I believe this may well be what the regex engine is doing, but it is equivalent to `/...?../` failing to match `"abcd"`. Did you get your information from actual regex engine trace, or is it conjecture? – Borodin Dec 09 '11 at 11:39
  • @Qtax et al: I am certain that (?R) behaves atomically in Perl as well. What makes you think otherwise? All the citations refer to PCRE and there is nothing to say that Perl is different. – Borodin Dec 09 '11 at 11:43
  • I take it back - the Perl documentation says this: `Note that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic.` But that leaves me wondering whether that explains the PCRE engine not backtracking to let `.?` match a zero-length string instead? – Borodin Dec 09 '11 at 11:56
  • @Borodin, the atomic non-backtracking behavior explains everything. PCREs `a(?:(?R)|a?)a` is equivalent to Perls `a(?:(?>(?R))|a?)a`, but not `a(?:(?R)|a?)a` which, as I said before, matches all lengths above 1. – Qtax Dec 09 '11 at 13:14
  • @Borodin - No, step 11 is correct. The zero width option of the `.?` expression was already used in the previous step 10, (where my comment says: _Optional dot matches EOS_). At that point the subject string pointer was already at the end of the string and there were no chars to be consumed. The zero-width option is the only one that can match at that point. – ridgerunner Dec 09 '11 at 18:52
  • @ridgerunner, this is exactly the kind of answer I was hoping for: precisely tracing the path of the regex engine. I will try to apply your answer to my expression and one more that I have in mind, and see if I can make it work. Thanks heaps. Do you have a tool that produces the regex engine path you posted, or did you do it all by hand? Such a tool would be a huge help right now. Wishing you a fun weekend. – zx81 Dec 09 '11 at 21:37
  • Or is that a "noisy" / debug mode in php itself? – zx81 Dec 09 '11 at 22:45
  • 1
    @playful: RegexBuddy has a debugger but unfortunately, it does not (yet) support recursive expressions. So to get a better handle on this (and other) recursive expressions under PHP, I utilize the: `preg_replace_callback()` function and call it recursively and keep track of the recursion depth with a static variable and print out match info at each depth level. I designed a [recursive BBCode parser](http://jmrware.com/articles/2011/fluxbb_jmr_dev/viewtopic.php?id=3) for the FluxBB forum project, which uses a similar recursive expression technique, so I've done this before. – ridgerunner Dec 10 '11 at 18:30
  • @ridgerunner Yes! I love RB. Jan says recursion is coming in a future version. But for debugging I find your output much more handsome than RB's debug. I ended up tracing the expression in a spreadsheet! I wish there was a "noisy mode" that could output the match tree in a nice format. Thanks for the tip about how you are using preg_replace_callback(). – zx81 Dec 10 '11 at 21:37
12

IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.

Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.

Since your outer as are explicitly there, it will match an a between two as, or a previous recursion's match of the entire pattern between two as. As a result, it will only match odd numbers of as (middle one plus multiples of two).

At length of three, aaa is the current recursion's matching pattern, so on the fourth recursion it's looking for an a between two as (i.e., aaa) or the previous recursion's matched pattern between two as (i.e., a+aaa+a). Obviously it can't match five as when the string isn't that long, so the longest match it can make is three.

Similar deal with a length of six, as it can only match the "default" aaa or the previous recursion's match surrounded by as (i.e., a+aaaaa+a).


However, it does not match all odd lengths.

Since you're matching recursively, you can only match the literal aaa or a+(prev recurs match)+a. Each successive match will therefore always be two as longer than the previous match, or it will punt and fall back to aaa.

At a length of seven (matching against aaaaaaa), the previous recursion's match was the fallback aaa. So this time, even though there are seven as, it will only match three (aaa) or five (a+aaa+a).


When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):

no match
aa
aaa
aaa
aaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa

What's going on here? Well, I'll tell you! :-)

When a recursive match would be one character longer than the input string, it punts back to aaa, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back to aaa. And so on.

Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:

(input len.)  -  (match len.)  =  (difference)

 1   -    0   =    1
 2   -    2   =    0
 3   -    3   =    0
 4   -    3   =    1
 5   -    5   =    0
 6   -    3   =    3
 7   -    5   =    2
 8   -    7   =    1
 9   -    9   =    0
10   -    3   =    7
11   -    5   =    6
12   -    7   =    5
13   -    9   =    4
14   -   11   =    3
15   -   13   =    2
16   -   15   =    1
17   -   17   =    0
18   -    3   =   15
19   -    5   =   14
20   -    7   =   13
21   -    9   =   12
22   -   11   =   11
23   -   13   =   10
24   -   15   =    9
25   -   17   =    8
26   -   19   =    7
27   -   21   =    6
28   -   23   =    5
29   -   25   =    4
30   -   27   =    3
31   -   29   =    2
32   -   31   =    1
33   -   33   =    0
34   -    3   =   31
35   -    5   =   30
36   -    7   =   29
37   -    9   =   28
38   -   11   =   27
39   -   13   =   26
40   -   15   =   25
41   -   17   =   24
42   -   19   =   23
43   -   21   =   22
44   -   23   =   21
45   -   25   =   20
46   -   27   =   19
47   -   29   =   18
48   -   31   =   17
49   -   33   =   16
50   -   35   =   15
51   -   37   =   14
52   -   39   =   13
53   -   41   =   12
54   -   43   =   11
55   -   45   =   10
56   -   47   =    9
57   -   49   =    8
58   -   51   =    7
59   -   53   =    6
60   -   55   =    5
61   -   57   =    4
62   -   59   =    3
63   -   61   =    2
64   -   63   =    1
65   -   65   =    0
66   -    3   =   63
67   -    5   =   62
68   -    7   =   61
69   -    9   =   60
70   -   11   =   59
71   -   13   =   58
72   -   15   =   57
73   -   17   =   56
74   -   19   =   55
75   -   21   =   54
76   -   23   =   53
77   -   25   =   52
78   -   27   =   51
79   -   29   =   50
80   -   31   =   49

For reasons that should now make sense, this happens at multiples of 2.


Stepping through by hand

I've slightly simplified the original pattern for this example. Remember this. We will come back to it.

a((?R)|a)a

What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of (?R) as many times as possible.

a((?R)|a)a                    # this

a((a((?R)|a)a)|a)a            # becomes this

a((a((a((?R)|a)a)|a)a)|a)a    # becomes this

# and so on...

When tracing this by hand, you could work from the inside out. In (?R)|a, a is your base case. So we'll start with that.

a(a)a

If that matches the input string, take that match (aaa) back to the original expression and put it in place of (?R).

a(aaa|a)a

If the input string is matched with our recursive value, subtitute that match (aaaaa) back into the original expression to recurse again.

a(aaaaa|a)a

Repeat until you can't match your input using the result of the previous recursion.

Example
Input: aaaaaa
Regex: a((?R)|a)a

Start at base case, aaa.
Does the input match with this value? Yes: aaa
Recurse by putting aaa in the original expression:

a(aaa|a)a

Does the input match with our recursive value? Yes: aaaaa
Recurse by putting aaaaa in the original expression:

a(aaaaa|a)a

Does the input match with our recursive value? No: aaaaaaa

Then we stop here. The above expression could be rewritten (for simplicity) as:

aaaaaaa|aaa

Since it doesn't match aaaaaaa, it must match aaa. We're done, aaa is the final result.

Wiseguy
  • 19,067
  • 8
  • 59
  • 78
  • @Wiseguy, now starting to work my way through your beautifully detailed answer. Thank you! I may have misundersood how the (?R) functions. You say it substitutes the previous match to form the new pattern. I thought that it substituted the previous **expression** to form the new pattern. In _Mastering Regular Expressions_, Jeffrey Friedl says that the sequence (?R) "means recursively apply the entire expression at this point". Is that wrong then? Also, in your first paragraph, you say there must be an "a" between the nests. But what about the optional "a?" on the right side of the alternation? – zx81 Dec 09 '11 at 05:47
  • -1 for being misleading. Regex recursion doesn't work like that. `(?R)` matches the entire *pattern* again, not the previous match that was found. – Borodin Dec 09 '11 at 05:53
  • That phrase sounds accurate to me. The "entire expression at this point" is all that has been matched by this point. Effectively, when it matches `aaa`, it takes that match and substitutes it in place of `(?R)` and runs again to see if that will match. If it does match (in this case it would be `aaaaa`), it takes _that_ match in place of `(?R)` and runs again, and so on. – Wiseguy Dec 09 '11 at 05:59
  • @Borodin That's what I meant at the beginning by "previous match of the entire pattern". See my comment just now. How do you suggest I rephrase? – Wiseguy Dec 09 '11 at 06:00
  • @Borodin For the sake of clarity, I just changed "previous match" to "previous iteration's match". Do you think that sufficiently fixes the ambiguity? – Wiseguy Dec 09 '11 at 06:06
  • @Wiseguy: If you are correct, it sounds to me like Jeffrey Friedl would have to be wrong. He also says "The (?R) construct makes a recursive reference to the entire regular expression". That would be a very roundabout way to say "the previous match". It sounds like he is talking about an expression, not a match. I may be reading it wrong. In any case, if you are right, it's a different world of recursive regex for me!!! – zx81 Dec 09 '11 at 06:11
  • @Wiseguy: I think you started off fine referring to a match of the entire pattern, but your use of 'iteration' rather than 'recursion' is confusing. I was finally convinced that you had the wrong end of the stick when you wrote about 'a+(prev iter match)+a' when it is *later* recursions that must match between the bracketing 'a's. – Borodin Dec 09 '11 at 06:52
  • 1
    @Borodin Oh, ok. I'm explaining backward in the way you might solve it by hand - starting at base case, using result in next execution. Programmatically, it dives all the way down to the base case then works its way back up. Is that the confusion? (Also, I'll change the "iteration" terminology to "recursion".) – Wiseguy Dec 09 '11 at 06:56
  • @playful I've added further explanation. – Wiseguy Dec 09 '11 at 07:24
  • 2
    @playful *Mastering Regular Expressions* was written on Perls regex, use Perl to experiment. ;-) – Qtax Dec 09 '11 at 07:38
  • @Wiseguy, you are clearly onto something. I have traced your logic up to a string of ten "a"s and it works. But then I apply it to a more complex expression and something breaks in the logic (as I understand it). I will go have a look at ridgerunner's answer---really needing to follow the path of the regex engine. – zx81 Dec 09 '11 at 21:26
4

Okay, I finally have it.

I awarded the correct answer to ridgerunner as he put me on the path to the solution, but I also wanted to write a full answer to the specific question in case someone else wants to fully understand the example too.

First the solution, then some notes.

A. Solution

Here is a summary of the steps followed by the engine. The steps should be read from top to bottom. They are not numbered. The recursion depth is shown in the left column, going up from zero to for and back down to zero. For convenience, the expression is shown at the top right. For ease of readability, the "a"s being matched are shown at their place in the string (which is shown at the very top).

        STRING    EXPRESSION
        a a a a   a(?:(?R|a?))a

Depth   Match     Token
    0   a         first a from depth 0. Next step in the expression: depth 1.
    1     a       first a from depth 1. Next step in the expression: depth 2. 
    2       a     first a from depth 2. Next step in the expression: depth 3.  
    3         a   first a from depth 3. Next step in the expression: depth 4.  
    4             depth 4 fails to match anything. Back to depth 3 @ alternation.
    3             depth 3 fails to match rest of expression, back to depth 2
    2       a a   depth 2 completes as a/empty/a, back to depth 1
    1     a[a a]  a/[detph 2]a fails to complete, discard depth 2, back to alternation
    1     a       first a from depth 1
    1     a a     a from alternation
    1     a a a   depth 1 completes, back to depth 0
    0   a[a a a]  depth 0 fails to complete, discard depth 1, back to alternation
    0   a         first a from depth 0
    0   a a       a from alternation
    0   a a a     expression ends with successful match   

B. Notes

1. The source of confusion


Here is what was counter-intuitive about it for me.

We are trying to match a a a a

I assumed that depth 0 of the recursion would match as a - - a and that depth 1 would match as - a a -

But in fact depth 1 first matches as - a a a

So depth 0 has nowhere to go to finish the match:

a [D1: a a a] 

...then what? We are out of characters but the expression is not over.

So depth 1 is discarded. Note that depth 1 is not attempted again by giving back characters, which would lead us to a different depth 1 match of - a a -

That's because recursive matches are atomic. Once a depth matches, it's all or nothing, you keep it all or you discard it all.

Once depth 1 is discarded, depth 0 moves on to the other side of the alternation, and returns the match: a a a

2. The source of clarity


What helped me the most was the example that ridgerunner gave. In his example, he showed how to trace the path of the engine, which is exactly what I wanted to understand.

Following this method, I traced the full path of the engine for our specific example. As I have it, the path is 25 steps long, so it is considerably longer than the summary above. But the summary is accurate to the path I traced.

Big Thanks to everyone else who contributed, in particular Wiseguy for a very intriguing presentation. I still wonder if somehow I might be missing something and Wiseguy's answer might amount to the same!

zx81
  • 38,175
  • 8
  • 76
  • 97
-5

After a lot of experimentation I think the PHP regex engine is broken. The exact same code under Perl works fine and matches all of your strings from beginning to end as I would expect.

Recursive regexes are hard on the imagination, but it looks to me as if /a(?:(?R)|a?)a/ should match aaaa as an a..a pair containing a second a..a pair, after which a second recursion fails and the alternate /a?/ matches instead as a null string.

Borodin
  • 123,915
  • 9
  • 66
  • 138
  • Indeed something is wrong with PCRE (or how PHP is using it). http://ideone.com/TibRz should obviously match all cases, but doesn't. – Qtax Dec 09 '11 at 07:30
  • @Borodin, thank you for testing this out on Perl. The plot thickens. And "Yes" to what you said about aaaa... That was how I wrote it and the results surprise me. I've had a hard time focusing on the PHP manual page for this topic before, but I think I'll give it another go today. :) Warmest wishes http://php.net/manual/en/regexp.reference.recursive.php – zx81 Dec 09 '11 at 07:31
  • 5
    No the regex engine is not broken. It is matching precisely what it is being asked to match. See my answer for a steb-by-step walkthrough of what is happening. – ridgerunner Dec 09 '11 at 07:42