IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.
Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.
Since your outer a
s are explicitly there, it will match an a
between two a
s, or a previous recursion's match of the entire pattern between two a
s. As a result, it will only match odd numbers of a
s (middle one plus multiples of two).
At length of three, aaa
is the current recursion's matching pattern, so on the fourth recursion it's looking for an a
between two a
s (i.e., aaa
) or the previous recursion's matched pattern between two a
s (i.e., a
+aaa
+a
). Obviously it can't match five a
s when the string isn't that long, so the longest match it can make is three.
Similar deal with a length of six, as it can only match the "default" aaa
or the previous recursion's match surrounded by a
s (i.e., a
+aaaaa
+a
).
However, it does not match all odd lengths.
Since you're matching recursively, you can only match the literal aaa
or a
+(prev recurs match)+a
. Each successive match will therefore always be two a
s longer than the previous match, or it will punt and fall back to aaa
.
At a length of seven (matching against aaaaaaa
), the previous recursion's match was the fallback aaa
. So this time, even though there are seven a
s, it will only match three (aaa
) or five (a
+aaa
+a
).
When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):
no match
aa
aaa
aaa
aaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
What's going on here? Well, I'll tell you! :-)
When a recursive match would be one character longer than the input string, it punts back to aaa
, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back to aaa
. And so on.
Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:
(input len.) - (match len.) = (difference)
1 - 0 = 1
2 - 2 = 0
3 - 3 = 0
4 - 3 = 1
5 - 5 = 0
6 - 3 = 3
7 - 5 = 2
8 - 7 = 1
9 - 9 = 0
10 - 3 = 7
11 - 5 = 6
12 - 7 = 5
13 - 9 = 4
14 - 11 = 3
15 - 13 = 2
16 - 15 = 1
17 - 17 = 0
18 - 3 = 15
19 - 5 = 14
20 - 7 = 13
21 - 9 = 12
22 - 11 = 11
23 - 13 = 10
24 - 15 = 9
25 - 17 = 8
26 - 19 = 7
27 - 21 = 6
28 - 23 = 5
29 - 25 = 4
30 - 27 = 3
31 - 29 = 2
32 - 31 = 1
33 - 33 = 0
34 - 3 = 31
35 - 5 = 30
36 - 7 = 29
37 - 9 = 28
38 - 11 = 27
39 - 13 = 26
40 - 15 = 25
41 - 17 = 24
42 - 19 = 23
43 - 21 = 22
44 - 23 = 21
45 - 25 = 20
46 - 27 = 19
47 - 29 = 18
48 - 31 = 17
49 - 33 = 16
50 - 35 = 15
51 - 37 = 14
52 - 39 = 13
53 - 41 = 12
54 - 43 = 11
55 - 45 = 10
56 - 47 = 9
57 - 49 = 8
58 - 51 = 7
59 - 53 = 6
60 - 55 = 5
61 - 57 = 4
62 - 59 = 3
63 - 61 = 2
64 - 63 = 1
65 - 65 = 0
66 - 3 = 63
67 - 5 = 62
68 - 7 = 61
69 - 9 = 60
70 - 11 = 59
71 - 13 = 58
72 - 15 = 57
73 - 17 = 56
74 - 19 = 55
75 - 21 = 54
76 - 23 = 53
77 - 25 = 52
78 - 27 = 51
79 - 29 = 50
80 - 31 = 49
For reasons that should now make sense, this happens at multiples of 2.
Stepping through by hand
I've slightly simplified the original pattern for this example. Remember this. We will come back to it.
a((?R)|a)a
What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of (?R)
as many times as possible.
a((?R)|a)a # this
a((a((?R)|a)a)|a)a # becomes this
a((a((a((?R)|a)a)|a)a)|a)a # becomes this
# and so on...
When tracing this by hand, you could work from the inside out. In (?R)|a
, a
is your base case. So we'll start with that.
a(a)a
If that matches the input string, take that match (aaa
) back to the original expression and put it in place of (?R)
.
a(aaa|a)a
If the input string is matched with our recursive value, subtitute that match (aaaaa
) back into the original expression to recurse again.
a(aaaaa|a)a
Repeat until you can't match your input using the result of the previous recursion.
Example
Input: aaaaaa
Regex: a((?R)|a)a
Start at base case, aaa
.
Does the input match with this value? Yes: aaa
Recurse by putting aaa
in the original expression:
a(aaa|a)a
Does the input match with our recursive value? Yes: aaaaa
Recurse by putting aaaaa
in the original expression:
a(aaaaa|a)a
Does the input match with our recursive value? No: aaaaaaa
Then we stop here. The above expression could be rewritten (for simplicity) as:
aaaaaaa|aaa
Since it doesn't match aaaaaaa
, it must match aaa
. We're done, aaa
is the final result.