3

What exactly is re.findall('(?=(b))','bbbb') doing? It returns ['b', 'b', 'b', 'b'], but I expected ['b', 'b', 'b'], since it should only return a 'b' if it sees another 'b' ahead?

Thanks!

Edit: It seems that re.findall('b(?=(b))','bbbb') returns ['b', 'b', 'b'] like I would expect, but I am still confused as to what re.findall('(?=(b))','bbbb') does.

Edit 2: Got it! Thank you for the responses.

  • It start from first index of input string and runs until last index if you do `re.findall('(?=(bb))','bbbb')` and output `['bb', 'bb', 'bb']` – Rohit-Pandey Sep 30 '18 at 08:58
  • I've reopened this question because it's not asking [how to find overlapping matches](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches), and [Reference - What does this regex mean?](//stackoverflow.com/q/22937618) is a useless duplicate target that should never be used as a duplicate target, ever. – Aran-Fey Sep 30 '18 at 12:34

3 Answers3

2

The problem is that the capturing group is inside the lookahead.

To do what you want you have to capture the letter, then use a lookahead that doesn't capture:

re.findall('(b)(?=b)','bbbb')

result:

['b', 'b', 'b']
Jean-François Fabre
  • 126,787
  • 22
  • 103
  • 165
2

You have a zero-length match there, and you have a capturing group. When the regular expression for re.findall has a capturing group, the resulting list will be what's been captured in those capturing groups (if anything).

Four positions are matched by your regex: the start of the string, before the first b, before the second b, and before the third b. Here's a diagram, where | represents the position matched (spaces added for illustration):

 b b b b
|         captures the next b, passes

 b b b b
  |       captures the next b, passes

 b b b b
    |     captures the next b, passes

 b b b b
      |   captures the next b, passes

 b b b b
        | lookahead fails, match fails

If you didn't want a capturing group and only want to match the zero-length positions instead, use (?: instead of ( for a non-capturing group:

(?=(?:b))

(though the resulting list will be composed of empty strings and won't be very useful)

CertainPerformance
  • 260,466
  • 31
  • 181
  • 209
1

A positive lookahead (?= asserts a position which is found 4 times because there are 4 positions where a b follows. In that assertion itself you capture a (b) in a capturing group which will be returned by findall.

If you want to return three times a b and you are not referring to the group anymore, you could match b and add a lookahead that asserts what is on the right side is a b

print(re.findall('b(?=b)','bbbb'))

Demo

The fourth bird
  • 96,715
  • 14
  • 35
  • 52