12

I ran into a small problem using Python Regex.

Suppose this is the input:

(zyx)bc

What I'm trying to achieve is obtain whatever is between parentheses as a single match, and any char outside as an individual match. The desired result would be along the lines of:

['zyx','b','c']

The order of matches should be kept.

I've tried obtaining this with Python 3.3, but can't seem to figure out the correct Regex. So far I have:

matches = findall(r'\((.*?)\)|\w', '(zyx)bc')

print(matches) yields the following:

['zyx','','']

Any ideas what I'm doing wrong?

Unihedron
  • 10,251
  • 13
  • 53
  • 66
Julian Laval
  • 1,050
  • 4
  • 14
  • 32
  • It was just a sample input. The regex should be able to differentiate between different cases, be they for example (ab)(bc)(ca), abc, (abc)(abc)(abc), or (zyx)bc, etc whilst recognizing which chars are within parentheses and which are not. – Julian Laval Jan 06 '13 at 13:03

5 Answers5

15

From the documentation of re.findall:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

While your regexp is matching the string three times, the (.*?) group is empty for the second two matches. If you want the output of the other half of the regexp, you can add a second group:

>>> re.findall(r'\((.*?)\)|(\w)', '(zyx)bc')
[('zyx', ''), ('', 'b'), ('', 'c')]

Alternatively, you could remove all the groups to get a simple list of strings again:

>>> re.findall(r'\(.*?\)|\w', '(zyx)bc')
['(zyx)', 'b', 'c']

You would need to manually remove the parentheses though.

James Henstridge
  • 36,431
  • 6
  • 110
  • 101
  • 1
    FYI: Thanks for the answer. To remove the parentheses: 'matches = [match.strip('()') for match in findall(r'\(.*?\)|\w', case)]' – Julian Laval Jan 06 '13 at 13:46
4

Other answers have shown you how to get the result you need, but with the extra step of manually removing the parentheses. If you use lookarounds in your regex, you won't need to strip the parentheses manually:

>>> import re
>>> s = '(zyx)bc'
>>> print (re.findall(r'(?<=\()\w+(?=\))|\w', s))
['zyx', 'b', 'c']

Explained:

(?<=\() // lookbehind for left parenthesis
\w+     // all characters until:
(?=\))  // lookahead for right parenthesis
|       // OR
\w      // any character
alan
  • 4,342
  • 18
  • 28
3

Let's take a look at our output using re.DEBUG.

branch 
  literal 40 
  subpattern 1 
    min_repeat 0 65535 
      any None 
  literal 41 
or
  in 
    category category_word

Ouch, there's only one subpattern in there but re.findall only pulls out subpatterns if one exists!

a = re.findall(r'\((.*?)\)|(.)', '(zyx)bc',re.DEBUG); a
[('zyx', ''), ('', 'b'), ('', 'c')]
branch 
  literal 40 
  subpattern 1 
    min_repeat 0 65535 
      any None 
  literal 41 
or
  subpattern 2 
    any None

Better. :)

Now we just have to make this into the format you want.

[i[0] if i[0] != '' else i[1] for i in a]
['zyx', 'b', 'c']
Fredrick Brennan
  • 6,195
  • 2
  • 23
  • 50
2

The docs mention treating groups specially, so don't put a group around the parenthesized pattern, and you'll get everything, but you'll need to remove the parens from the matched data yourself:

>>> re.findall(r'\(.+?\)|\w', '(zyx)bc')
['(zyx)', 'b', 'c']

or use more groups, then process the resulting tuples to get the strings you seek:

>>> [''.join(t) for t in re.findall(r'\((.+?)\)|(\w)', '(zyx)bc')]
>>> ['zyx', 'b', 'c']
Ned Batchelder
  • 323,515
  • 67
  • 518
  • 625
1
In [108]: strs="(zyx)bc"

In [109]: re.findall(r"\(\w+\)|\w",strs)
Out[109]: ['(zyx)', 'b', 'c']

In [110]: [x.strip("()") for x in re.findall(r"\(\w+\)|\w",strs)]
Out[110]: ['zyx', 'b', 'c']
Ashwini Chaudhary
  • 217,951
  • 48
  • 415
  • 461