8

I understand that the pattern r'([a-z]+)\1+' is searching for a repeated multi character pattern in the search string but I do not understand why in case k2 answer isn't 'aaaaa' (5 'a'):

import re
k1 = re.search(r'([a-z]+)\1+', 'aaaa')
k2 = re.search(r'([a-z]+)\1+', 'aaaaa')
k3 = re.search(r'([a-z]+)\1+', 'aaaaaa')
print(k1)  # <_sre.SRE_Match object; span=(0, 4), match='aaaa'>
print(k2)  # <_sre.SRE_Match object; span=(0, 4), match='aaaa'>
print(k3)  # <_sre.SRE_Match object; span=(0, 6), match='aaaaaa'>

Python 3.6.1

Ivaylo Strandjev
  • 64,309
  • 15
  • 111
  • 164
Maxim Andreev
  • 163
  • 2
  • 8
  • Explain step by step what your current reasoning is. Otherwise it's impossible to say what is really tripping you up. – Mad Physicist Feb 06 '18 at 16:32
  • 2
    https://regex101.com/ is a good site for working through RegEx (I'm not affiliated!) – AlG Feb 06 '18 at 16:33
  • 1
    It has to be an even number of characters because you repeat a thing twice. It's twice because `+` is greedy, and the first one has precedence. – Mad Physicist Feb 06 '18 at 16:34
  • 1
    The length of `(x)x` cannot be odd, no matter how many `x`s you have. – Jongware Feb 06 '18 at 16:34
  • It's because `+` is greedy. It's trying to match as much as possible. Either change it to `([a-z]+?)\1+` or use anchors `([a-z]+)\1+\b` – ctwheels Feb 06 '18 at 16:34
  • Just a fyi, this `([a-z]+)` will match the exact amount and content that this `\1` does. Therefore adding a `+` quantifier to `\1` does not match more than group 1 matched, it's impossible. This also result's in an even match. If you change it to `([a-z])\1+` it will match 2 or more of what was captured in group 1. –  Feb 06 '18 at 17:16

2 Answers2

6

Because + is greedy.

What happens is ([a-z]+) first matches 'aaaaa', then it backtracks until \1+ matches the string, and stops. Because 'aa' is the first value of the ([a-z]+) that will let \1 successfully match, that's what it returns.

C_Elegans
  • 1,055
  • 8
  • 14
2

The key notion here is backtracking. Whenever a pattern contains quantified subpatterns with varied length, the regex engine may match strings in various ways, and once a part of the regex after the quantified part fails to match some substring, it can backtrack, i.e. free up a char belonging to the quantified pattern and try to match with the subsequent subpatterns.

Have a look at the bigger picture:

enter image description here

Let's see how shorter strings match before jumping at the longer examples...

Now, why a is not matched? Because there must be at least 2 chars since [a-z]+ and \1+ require to match at least 1 char.

aa is matched since the first ([a-z]+) matched the whole string first, then backtracked to accommodate some text for the \1+ pattern (and it matches the second a), so there is a match.

Three-a string aaa matches as a whole because the first ([a-z]+) matched the whole string first, then backtracked to accommodate some text for the \1+ pattern (note the capturing group had to only hold one a as when trying with two aa, the \1+ failed to match the final third a), and there is a match of three as.

Now, coming to the examples in the question

The aaaa string matches in its entirety is a similar way the aa matched: the capturing group pattern grabs the whole aaaa at first, then backtracks since \1+ also must "find" some text, and the regex engine tries to capture aaa into Group 1. However, \1+ fails to match 3 as, so backtracking goes on, and when there are two as in Group 1, the quantified backreference matches the last two as.

And the k2 case now:

The aaaaa string matches like this:

  • aaaaa is grabbed and placed into Group 1 with the ([a-z]+) part
  • \1+ cannot find any text, the engine retries to match the string differently as the part before the \1+ can match a different text thanks to the + quantifier
  • aaaa is tried (=placed into Group 1), to no avail since \1+ does not match (as then \1 tries to match aaaa, but there is only a left before the end of string)
  • aaa is tried, again, to no avail (as \1 tries to match aaa, but there are only two as left)
  • aa is put into Group 1, \1 matches the third and fourth as, and that is the only match since only one a remains in the string.

Here is a sample scheme of how the string is matched:

enter image description here

The last a cannot be matched:

enter image description here

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • 1
    Thanks for thorough explanation. – Maxim Andreev Feb 07 '18 at 08:19
  • You are welcome. Just FYI: the red arrows in the images denote a backtracking step.The green highlighted text is the part of the pattern that is currently tried. The blue selection is the text that matched the currently tried subpattern. – Wiktor Stribiżew Feb 07 '18 at 08:25