Matching Nested Structures With Regular Expressions in Python

Question

I seem to remember that Regular Expressions in DotNet have a special mechanism that allows for the correct matching of nested structures, like the grouping in "( (a ( ( c ) b ) ) ( d ) e )".

What is the python equivalent of this feature? Can this be achieved using regular expressions with some workaround? (Though it seems to be the sort of problem that current implementations of regex aren't designed for)

score 22 · Answer 1 · answered Jul 08 '09 at 23:10

22

Regular expressions cannot parse nested structures. Nested structures are not regular, by definition. They cannot be constructed by a regular grammar, and they cannot be parsed by a finite state automaton (a regular expression can be seen as a shorthand notation for an FSA).

Today's "regex" engines sometimes support some limited "nesting" constructs, but from a technical standpoint, they shouldn't be called "regular" anymore.

answered Jul 08 '09 at 23:10

Svante

46,788
11
77
118

7

+1 for this important piece of information. It should be noted that adding nesting support is NOT harmless. One of the cool things about true regex engines is that need no extra memory while processing, except a constant amount to store the state machine and a variable to remember the current state. Another is the running speed, which I think is linear to the length of the input. Adding nesting support messes up both of these benefits. – harms Jul 09 '09 at 15:20
1

@harms: Python regexes are already non-regular (they support backreferences) and can demonstrate exponential behavior [`re.match(r'(A+)*B', 'A'*n)`](http://www.rexegg.com/regex-explosive-quantifiers.html). `R = r'[^()]*' \n for _ in range(expr.count('(')): R = f'(?:{R}|\({R}\))+' \n re.fullmatch(R, expr)`. Here's `O(n**2)` algorithm: `is_prime = lambda n: not re.fullmatch(r'1?|(11+?)\1+', '1'*n)`. Though adding the recursion support would make regexes even larger problem than they are now (but "we are all consenting adults here"). – jfs Nov 12 '16 at 16:58

ars · Accepted Answer · 2009-07-09T15:19:28.807

20

You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)

However, PyParsing is a very nice package for this type of thing:

from pyparsing import nestedExpr

data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()

The output is:

[[['a', [['c'], 'b']], ['d'], 'e']]

More on PyParsing:

http://pyparsing.wikispaces.com/Documentation

edited Jul 09 '09 at 15:19

answered Jul 08 '09 at 23:14

ars

106,073
21
135
131

1

Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing – PaulMcG Aug 27 '18 at 12:59

score 14 · Answer 3 · edited May 23 '17 at 12:10

Edit: falsetru's nested parser, which I've slightly modified to accept arbitrary regex patterns to specify delimiters and item separators, is faster and simpler than my original re.Scanner solution:

import re

def parse_nested(text, left=r'[(]', right=r'[)]', sep=r','):
    """ https://stackoverflow.com/a/17141899/190597 (falsetru) """
    pat = r'({}|{}|{})'.format(left, right, sep)
    tokens = re.split(pat, text)
    stack = [[]]
    for x in tokens:
        if not x or re.match(sep, x):
            continue
        if re.match(left, x):
            # Nest a new list inside the current list
            current = []
            stack[-1].append(current)
            stack.append(current)
        elif re.match(right, x):
            stack.pop()
            if not stack:
                raise ValueError('error: opening bracket is missing')
        else:
            stack[-1].append(x)
    if len(stack) > 1:
        print(stack)
        raise ValueError('error: closing bracket is missing')
    return stack.pop()

text = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three"

print(parse_nested(text, r'\s*{{', r'}}\s*'))

yields

['a', ['c1::group', ['c2::containing::HINT'], 'a few'], ['c3::words'], 'or three']

Nested structures can not be matched with Python regex alone, but it is remarkably easy to build a basic parser (which can handle nested structures) using re.Scanner:

import re

class Node(list):
    def __init__(self, parent=None):
        self.parent = parent

class NestedParser(object):
    def __init__(self, left='\(', right='\)'):
        self.scanner = re.Scanner([
            (left, self.left),
            (right, self.right),
            (r"\s+", None),
            (".+?(?=(%s|%s|$))" % (right, left), self.other),
        ])
        self.result = Node()
        self.current = self.result

    def parse(self, content):
        self.scanner.scan(content)
        return self.result

    def left(self, scanner, token):
        new = Node(self.current)
        self.current.append(new)
        self.current = new

    def right(self, scanner, token):
        self.current = self.current.parent

    def other(self, scanner, token):
        self.current.append(token.strip())

It can be used like this:

p = NestedParser()
print(p.parse("((a+b)*(c-d))"))
# [[['a+b'], '*', ['c-d']]]

p = NestedParser()
print(p.parse("( (a ( ( c ) b ) ) ( d ) e )"))
# [[['a', [['c'], 'b']], ['d'], 'e']]

By default NestedParser matches nested parentheses. You can pass other regex to match other nested patterns, such as brackets, []. For example,

p = NestedParser('\[', '\]')
result = (p.parse("Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"))
# ['Lorem ipsum dolor sit amet', ['@a xxx yyy', ['@b xxx yyy', ['@c xxx yyy']]],
# 'lorem ipsum sit amet']

p = NestedParser('<foo>', '</foo>')
print(p.parse("<foo>BAR<foo>BAZ</foo></foo>"))
# [['BAR', ['BAZ']]]

Of course, pyparsing can do a whole lot more than the above code can. But for this single purpose, the above NestedParser is about 5x faster for small strings:

In [27]: import pyparsing as pp

In [28]: data = "( (a ( ( c ) b ) ) ( d ) e )"    

In [32]: %timeit pp.nestedExpr().parseString(data).asList()
1000 loops, best of 3: 1.09 ms per loop

In [33]: %timeit NestedParser().parse(data)
1000 loops, best of 3: 234 us per loop

and around 28x faster for larger strings:

In [44]: %timeit pp.nestedExpr().parseString('({})'.format(data*10000)).asList()
1 loops, best of 3: 8.27 s per loop

In [45]: %timeit NestedParser().parse('({})'.format(data*10000))
1 loops, best of 3: 297 ms per loop

This is a brilliant hint! Haven't even heard about re.Scanner. This would totally answer my question from yesterday. [here](http://stackoverflow.com/questions/14712046/regex-to-extract-nested-patterns). If it wouldn't be closed, I would choose this answer... Thank's again for letting me know! — hetsch, Feb 06 '13 at 21:07
I am struggling with the same problem as the original poster, but your solution has one aspect that keeps me from being able to use it: it seems to remove all whitespace characters from the ends of the results. How can I make it so that the spaces at the end of any listed strings are preserved? — DocBuckets, Jun 16 '14 at 02:30
@DocBuckets: Could you give an example of the input and desired output? Is only whitespace at the end of the string to be preserved, or whitespace everywhere? And is the whitespace after the last closing parethesis? — unutbu, Jun 16 '14 at 09:40
@unutbu I am parsing `string = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three"` based on the NestedParser code above and `{{`, `}}` as my delimiters. The code does not preserve the space between "group " and "{{c2::" for example. — DocBuckets, Jun 16 '14 at 17:54
@DocBuckets: I've edited my answer to show an alternative method which (I think) parses `string` as desired. — unutbu, Jun 16 '14 at 18:35
@unutbu fantastic! It does just what I'd hoped. It's also quite simple and easy for me to dissect and understand. Many thanks. — DocBuckets, Jun 16 '14 at 18:39
@unutbu I saw your answer on re.scanner. Im not too sure if it would be applicable in my issue here https://stackoverflow.com/questions/58915263/extract-sentences-in-nested-parentheses-using-python as well....? — , Nov 18 '19 at 13:52

score 2 · Answer 4 · answered Jul 08 '09 at 16:38

2

Python doesn't support recursion in regular expressions. So equivalents to .NET's balancing groups or PCRE regex in Perl are not immediately possible in Python.

Like you said yourself: this is NOT a problem you should really be solving with a single regex.

answered Jul 08 '09 at 16:38

Il-Bhima

10,298
1
42
50

PCRE supports recursive patterns using the (?R) directive among other things. Python might have supported an older PCRE but not the more recent versions. http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions – Il-Bhima Jul 09 '09 at 07:27

score 1 · Answer 5 · answered Jul 08 '09 at 16:34

1

I'd recommend removing the nesting from the regex itself, looping through the results and performing regexes on that.

answered Jul 08 '09 at 16:34

edef

733
4
13

score 0 · Answer 6 · answered Jul 08 '09 at 16:57

Are you talking about recursion? It's not clear from your question. An example:

ActivePython 2.6.1.1 (ActiveState Software Inc.) based on
Python 2.6.1 (r261:67515, Dec  5 2008, 13:58:38) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile(r"((\w+((\d+)[.;]))(\s+)\d)")
>>> m = p.match("Fred99. \t9")
>>> m
<_sre.SRE_Match object at 0x00454F80>
>>> m.groups()
('Fred99. \t9', 'Fred99.', '9.', '9', ' \t')

This shows matching of nested groups. The numbering of groups depends on the order in which their opening parenthesis occurs in the pattern.

Matching Nested Structures With Regular Expressions in Python

6 Answers6

Linked

Related