Recursive pattern in regex

Question

This is very much related to Regular Expression to match outer brackets however, I specifically want to know how or whether it's possible to do this regex's recursive pattern? I'm yet to find a python example using this strategy so think this ought to be a useful question!

I've seen some claims that recursive patterns can be used to match balanced parenthesis, but no examples using python's regex package (Note: re does not support recursive pattern, you need to use regex).

One claim is that syntax is b(?:m|(?R))*e where:

b is what begins the construct, m is what can occur in the middle of the construct, and e is what can occur at the end of the construct

I want to extract matches for the outer braces in the following:

"{1, {2, 3}} {4, 5}"
["1, {2, 3}", "4, 5"]  # desired

Note that this is easy to do the same for inner braces:

re.findall(r"{([^{}]*)}", "{1, {2, 3}} {4, 5}")
['2, 3', '4, 5']

(In my example I was using finditer (over match objects), see here.)

So I had hoped that the following, or some variation, would work:

regex.findall(r"{(:[^{}]*|?R)}", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}]*|?R)})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*|(?R))*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:.*)|(?R)*})", "{1, {2, 3}} {4, 5}")
regex.findall(r"({(:[^{}])|(?R)})", "{1, {2, 3}} {4, 5}")

but I'm scuppered by either [] or error: too much backtracking.

Is it possible to extract match objects for the outer parenthesis using regex's recursion?

Obviously, I run the risk of being shot down with:

don't parse html with regex
do this with pyparse
write a proper lexer & parser e.g. using ply

I want to emphasis this is about how to use the recursive pattern (which if my understanding is correct, takes us outside of regular language parsing, so may can actually be possible!). If it can be done, this ought to be a cleaner solution.

Thank you, I've never reliably known how to do this type of recursion in PCRE. Knowing about `(?R)` and `b(?:m|(?R))*e` is a great trick that I had never seen so plainly spelled out :) — Sam, Oct 15 '14 at 15:21

Casimir et Hippolyte · Accepted Answer · 2017-05-17T20:28:54.867

58

The pattern is:

{((?>[^{}]+|(?R))*)}

You can see this works for your example:

regex.findall("{((?>[^{}]+|(?R))*)}", "{1, {2, 3}} {4, 5}")
# ['1, {2, 3}', '4, 5']

Explanation:

The m part needs to exclude the brackets. The use of an atomic group is needed if you want at the same time to allow a quantifier for [^{}] and to repeat the group without catastropic backtracking problems. To be more clear, if the last closing curly bracket is missing this regex engine will backtrack atomic group by atomic group instead of character by character. To drive home this point, you can make the quantifier possessive like that: {((?>[^{}]+|(?R))*+)} (or {((?:[^{}]+|(?R))*+)} since the atomic group is no more useful).

The atomic group (?>....) and the possessive quantifier ?+, *+, ++ are the two sides of the same feature. This feature forbids the regex engine to backtrack inside the group of characters that becomes an "atom" (something you can't divide in smaller parts).

The basic examples are the following two patterns that always fail for the string aaaaaaaaaab:

(?>a+)ab
a++ab

that is:

regex.match("a++ab", "aaaaaaaaaab")
regex.match("(?>a+)ab", "aaaaaaaaaab")

When you use (?:a+) or a+ the regex engine (by default) records (in prevision) all backtracking positions for all characters. But when you use an atomic group or a possessive quantifier, theses backtracking positions are no more recorded (except for the begining of the group). So when the backtracking mechanism occurs the last "a" character can't be given back. Only the entire group can be given back.

[EDIT]: the pattern can be written in a more efficient way if you use an "unrolled" subpattern to describe the content between brackets:

{([^{}]*+(?:(?R)[^{}]*)*+)}

edited May 17 '17 at 20:28

answered Oct 15 '14 at 15:15

Casimir et Hippolyte

83,228
5
85
113

3

You're joking! So a + not a *, oh my goodness it's so obvious it retrospect!! Fantastic. – Andy Hayden Oct 15 '14 at 15:17
2

@AndyHayden: `?>` is an [atomic group](http://www.regular-expressions.info/atomic.html), which he explains, and `?:` is a [non-capturing group](http://stackoverflow.com/questions/3512471/non-capturing-group). Not sure if I've seen `??`. – Sam Oct 15 '14 at 15:35
Just in case this is another trivial one, it seems so similar, can you use a similar pattern to regex.split on spaces *outside* of braces? e.g. '{1 {2 3}} foo {4 5}' becomes ['{1 {2 3}}', 'foo', '{4 5}']. I can do it by getting the positions of those spaces (with regex.finditer) but it's messy + inefficient! @Sam – Andy Hayden Oct 15 '14 at 18:34
3

@AndyHayden: No you can't since the regex module doesn't have features like backtracking control verbs (from Perl and PHP) that allow something like this: `$res = preg_split('~({(?>[^{}]+|(?1))*})(*SKIP)(*FAIL)|\s+~', $str);`. All you can do is to use this kind of pattern with findall/iter: `r'({(?>[^{}]+|(?1))*})|[^\s{]+'` or something similar. – Casimir et Hippolyte Oct 15 '14 at 18:41
How would this needed to be edited so it works for () and not {}? – Jared Smith Apr 08 '16 at 03:26
@JaredSmith replacing `{` with `$` (and `}` with `$`) should do it. – Andy Hayden Jun 13 '16 at 04:39
@CasimiretHippolyte just saw your edit, why is the unrolled subpattern more efficient? Would be great to have some explanation :) – Andy Hayden Jun 13 '16 at 04:40
1

@AndyHayden: Because an alternation has a cost, in particular when it is repeated. It's simple, when you write `(A+|B)*` for each repetition, there's a chance the first branch tested `A+` fails, but if you write `A*(BA*)*`, there's no more alternation `A*` succeeds *(even if there aren't `A`s)* and is always followed with zero or more `BA*` *(that succeeds always too)*. The only backtracking step occurs when there's no more `B` and when the regex engine try to repeat `BA*` one more time, but it's quickly done. – Casimir et Hippolyte Jun 13 '16 at 14:04
1

@AndyHayden: about my previous comment (15/10/2014 - 18:41), the python regex module now supports these backtracking control verbs (`(*SKIP)` and `(*FAIL)` or `(*F)`). – Casimir et Hippolyte Jun 13 '16 at 14:06
?R pattern doesn't work when using Python's regex class. Works in PHP. – jsa Aug 05 '18 at 12:16
@jsa: we are speaking about the [pypi regex module](https://pypi.org/project/regex/), not the re module. Read the second paragraph of the question. – Casimir et Hippolyte Aug 05 '18 at 12:26

score 10 · Answer 2 · answered Oct 15 '14 at 15:17

10

I was able to do this no problem with the b(?:m|(?R))*e syntax:

{((?:[^{}]|(?R))*)}

Demo

I think the key from what you were attempting is that the repetition doesn't go on m, but the entire (?:m|(?R)) group. This is what allows the recursion with the (?R) reference.

answered Oct 15 '14 at 15:17

Sam

18,756
2
40
65

2

It fails for Python implementation on regex101 – hjpotter92 Oct 13 '15 at 07:42
7

@hjpotter92 This is only available in the `regex` package not in std lib's `re` module. – Andy Hayden Jan 26 '17 at 23:47
is there a solution for this in the standard re module? – roocell Jan 12 '21 at 00:49

Recursive pattern in regex

2 Answers2

Explanation:

Linked

Related