1

As of my understanding \2 represents the contents of group 2.

So, the expression r'(\w*)(\w)\2' should return the contents of group2( i.e \w) but when we use the word having repeated characters it is returning repeated characters. example:

re.search(r'(\w*)(\w)\2','finally').group(2) -> 'l'
re.search(r'(\w*)(\w)\2','finallyy').group(2) ->'y'

In the 1st example the output is 'l' instead of 'y'.

Can anyone tell me what exactly \2 mean in regular expression and where does my understanding is wrong.

  • It thus means that the character in the second group is repeated. So this matches string where a character is repeated *two times* (or more). – Willem Van Onsem Oct 02 '18 at 18:29
  • `\2` means the second capturing group - `(\w)`. The given pattern searches for a string followed by repeated letters. So, `\1` matches `fina` and `\2` matches `ll`. – vrintle Oct 02 '18 at 18:37

1 Answers1

2

This is a "reference" to the second capture group. It thus means that the content in the second capture group is repeated.

For example with this regex, 'finally' and 'finallyy' are matched as:

(\w*) (\w) \2    <rest>
fina   l   l     y
finall y   y

Since the Kleene star is greedy, it will typically eat as much characters as possible, but still matching the string.

So in short if the second capture group would match foo, then \2 has to be able to match foo as well.

Strictly speaking such constructs are not always regular expressions (at least not in the strict mathematical sense): regular expressions can only match regular languages, and regular languages should be parseable by a finite state machine. If the second group for example can match an arbitrary number of characters (for example with (\w+)\1), then one can not encode this on a finite state machine.

Willem Van Onsem
  • 321,217
  • 26
  • 295
  • 405