1

I am trying to match all pattern in a string by pattern.findall,but it only works partly

code

#--coding:UTF-8 --
import re
import pprint
regex = r"(19|20|21)\d{2}"
text = "1912 2013 2134"
def main():
    pattern = re.compile(regex)
    print pattern.findall(text)

if __name__ == '__main__':
    main()

and it print:

['19', '20', '21']

should it print ['1912', '2013','2134']

Josh Crozier
  • 202,159
  • 50
  • 343
  • 273
ssj
  • 1,607
  • 1
  • 15
  • 26

4 Answers4

3

Quoting from the re.findall docs,

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Since your original RegEx had one capturing group ((19|20|21)), the value captured in that alone was returned. You can play with that like this

regex = r"(19|20|21)(\d{2})"

Now we have two capturing groups ((19|20|21) and (\d{2})). Then the result would have been

[('19', '12'), ('20', '13'), ('21', '34')]

To fix this, you can use non-capturing group, like this

regex = r"(?:19|20|21)\d{2}"

which gives the following output

['1912', '2013', '2134']
Community
  • 1
  • 1
thefourtheye
  • 206,604
  • 43
  • 412
  • 459
  • @whatout I have linked another question where the non-capturing group is beautifully explained. Please check that once. – thefourtheye Apr 21 '14 at 06:26
  • I've gotten your idea, if `findall` is used to match groups, it will only match what in groups and leave everything else. so if I try `((19|20|21)\d{2})` , the result will be `[('1912', '19'), ('2013', '20'), ('2134', '21')]` – ssj Apr 21 '14 at 06:43
1

It's working correctly, you're only capturing 19,20,21 in the capturing group of (19|20|21).

You need a non-capturing group by changing it to (?:19|20|21), as from the documentation.

Source: https://docs.python.org/2/howto/regex.html#non-capturing-and-named-groups

Leigh
  • 10,820
  • 4
  • 25
  • 35
1

Round brackets indicate matching groups. In your regex, you are looking for two digit numerals which are either 19, 20 or 21.

Perhaps you need this regex:

r'19\d{2}|20\d{2}|21\d{2}'

This looks for any number starting with 19 followed by two digits or 20 followed by two digits or a 21 followed by two digits.

Demo:

In [1]: import re
In [2]: regex =rr'19\d{2}|20\d{2}|21\d{2}'
In [3]: text = "1912 2013 2134"
In [4]: pattern = re.compile(regex)
In [5]: pattern.findall(text)
Out[5]: ['1912', '2013', '2134']
glglgl
  • 81,640
  • 11
  • 130
  • 202
shaktimaan
  • 10,886
  • 2
  • 25
  • 32
0

Another alternative could be to refrain from findall() and instead do

print [i.group(0) for i in pattern.finditer(text)]

finditer() gives you an iterable producing Match objects. They can be queried about the properties of each match.

The other solution are more elegant about what the regexps are capable to, but this one is more flexible as you don't have this implicit assumption about the groups which should be returned.

glglgl
  • 81,640
  • 11
  • 130
  • 202