Regular Expression conflict due to user inputs

Question

I would like to split a string into 2 groups, based on a regular expression. The string has basically the following structure:

some text (data1 | data2 | data3 | data4)

I've used a simple regular expression as follows:

re.match("^(?P<title>.*)\((?P<data>.*)\)$", s)

It works fine provided there are no parenthesis in the string, that would conflict with the regular expression.

But if there are parenthesis in one of the groups, it outputs an unexpected result:

>>> import re
>>> def process_string1(s):
...    r = re.match("^(?P<title>.*?)\((?P<data>.*)\)$", s)
...    return r.groups()
...
>>> def process_string2(s):
...    r = re.match("^(?P<title>.*)\((?P<data>.*)\)$", s)
...    return r.groups()
...
>>> s = "this is an example (detail) (data1 | data2 | data3 | data4)"
>>> print process_string1(s)
('this is an example ', 'detail) (data1 | data2 | data3 | data4')      # Wrong
>>> print process_string2(s)
('this is an example (detail) ', 'data1 | data2 | data3 | data4')      # Good
>>> s = "this is another example (data1 (detail) | data2 | data3 | data4)"
>>> print process_string1(s)
('this is another example ', 'data1 (detail) | data2 | data3 | data4') # Good
>>> print process_string2(s)
('this is another example (data1 ', 'detail) | data2 | data3 | data4') # Wrong

Can you please help me?

Python regular expressions are "greedy" by default. They grab the longest string that satisfies the expression. The first part will grab '(' if there are other '(' in the string. Maybe you should change '.*' with '[^(]*'? — swstephe, Dec 09 '14 at 16:11

score 1 · Answer 1 · edited May 23 '17 at 12:28

See these answers:

In short, regex is not the tool for matching recursive / nested structures, like you have. You are asking a regex to match:

something (something (someting) something)

which is recursive, as the innermost something can potentially again be something (something) something. Regex is not suitable tool for this, you should use a parser for this. See these questions for more information:

score 1 · Answer 2 · answered Dec 09 '14 at 20:22

1

Many flavors of regex support recursion, or nested constructs. Python's engine doesn't currently support it, but a replacement module is in the works, and it does support recursion:

https://pypi.python.org/pypi/regex

answered Dec 09 '14 at 20:22

Brian Stephens

4,902
16
24

score 0 · Answer 3 · answered Dec 09 '14 at 17:18

I have finally adapted my code as follows:

>>> import re
>>> def process_string(s):
...    r = re.match("^(?P<title>.*)\((?P<data>.*)\)$", s)    
...    if '(' in r.group('title') and not ')' in r.group('title'):
...        r = re.match("^(?P<title>.*?)\((?P<data>.*)\)$", s)
...    return r.groups()

Which produces the result I was expecting:

>>> print process_string("this is an example (detail) (data1 | data2 | data3 | data4)")
('this is an example (detail) ', 'data1 | data2 | data3 | data4')

>>> print process_string("this is an example (data1 (detail) | data2 | data3 | data4)")
('this is an example ', 'data1 (detail) | data2 | data3 | data4')

Regular Expression conflict due to user inputs

3 Answers3