re.Pattern.findall works wrong

Question

I am trying to match all pattern in a string by pattern.findall,but it only works partly

code

#--coding:UTF-8 --
import re
import pprint
regex = r"(19|20|21)\d{2}"
text = "1912 2013 2134"
def main():
    pattern = re.compile(regex)
    print pattern.findall(text)

if __name__ == '__main__':
    main()

and it print:

['19', '20', '21']

should it print ['1912', '2013','2134']

score 3 · Accepted Answer · edited May 23 '17 at 11:59

3

Quoting from the re.findall docs,

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Since your original RegEx had one capturing group ((19|20|21)), the value captured in that alone was returned. You can play with that like this

regex = r"(19|20|21)(\d{2})"

Now we have two capturing groups ((19|20|21) and (\d{2})). Then the result would have been

[('19', '12'), ('20', '13'), ('21', '34')]

To fix this, you can use non-capturing group, like this

regex = r"(?:19|20|21)\d{2}"

which gives the following output

['1912', '2013', '2134']

edited May 23 '17 at 11:59

Community

1
1

answered Apr 21 '14 at 06:08

thefourtheye

206,604
43
412
459

@whatout I have linked another question where the non-capturing group is beautifully explained. Please check that once. – thefourtheye Apr 21 '14 at 06:26
I've gotten your idea, if `findall` is used to match groups, it will only match what in groups and leave everything else. so if I try `((19|20|21)\d{2})` , the result will be `[('1912', '19'), ('2013', '20'), ('2134', '21')]` – ssj Apr 21 '14 at 06:43

score 1 · Answer 2 · answered Apr 21 '14 at 06:11

It's working correctly, you're only capturing 19,20,21 in the capturing group of (19|20|21).

You need a non-capturing group by changing it to (?:19|20|21), as from the documentation.

Source: https://docs.python.org/2/howto/regex.html#non-capturing-and-named-groups

score 1 · Answer 3 · edited Apr 21 '14 at 06:27

1

Round brackets indicate matching groups. In your regex, you are looking for two digit numerals which are either 19, 20 or 21.

Perhaps you need this regex:

r'19\d{2}|20\d{2}|21\d{2}'

This looks for any number starting with 19 followed by two digits or 20 followed by two digits or a 21 followed by two digits.

Demo:

In [1]: import re
In [2]: regex =rr'19\d{2}|20\d{2}|21\d{2}'
In [3]: text = "1912 2013 2134"
In [4]: pattern = re.compile(regex)
In [5]: pattern.findall(text)
Out[5]: ['1912', '2013', '2134']

edited Apr 21 '14 at 06:27

glglgl

81,640
11
130
202

answered Apr 21 '14 at 06:13

shaktimaan

10,886
2
25
32

You have only two elements. So, no need to use range. – thefourtheye Apr 21 '14 at 06:19
@thefourtheye Agreed. It is extraneous. – shaktimaan Apr 21 '14 at 06:19
1

Wouldn't that also match things starting with 10, 11, 29, etc? – Leigh Apr 21 '14 at 06:20

score 0 · Answer 4 · answered Apr 21 '14 at 06:20

Another alternative could be to refrain from findall() and instead do

print [i.group(0) for i in pattern.finditer(text)]

finditer() gives you an iterable producing Match objects. They can be queried about the properties of each match.

The other solution are more elegant about what the regexps are capable to, but this one is more flexible as you don't have this implicit assumption about the groups which should be returned.

re.Pattern.findall works wrong

4 Answers4