Regex to find texts between nested parenthesis

Question

I have a very long string which has nested loops. I want to extract a pattern in that.

String_Text:

some random texts......
........................
........................
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}} // End of topmost parenthesis
........................
..again some random text
........................
........................ // can also contain {{  }}
......End of string.

I want to extract all the text between the topmost parenthesis i.e.

Extracted_string:

info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................

Pattern:

1.) starts with { and can be followed by any number of {.

2.) After that there can be any number of white space.

3.) The first word after that is surely info.

4.) Extract till this bracket is not closed.

What is have tried so far:

re.findall(r'\{+[^\S\r\n]*info\s*(.*(?:\r?\n.*)*)\}+')

I know this is wrong as what this does is find the last instance of } in the string. Can someone help me out in extracting text between those bracket? TIA

Can you leverage the context here and match up to the first `}}` that are on a separate line? Like `re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)`? — Wiktor Stribiżew, Aug 22 '18 at 11:21
Wow! This works. Thanks a lot! can you explain me though how it worked? — Gopal Chitalia, Aug 22 '18 at 11:41

score 3 · Answer 1 · answered Aug 22 '18 at 11:24

3

You need to use a recursive approach:

{
    ((?:[^{}]|(?R))*)
}

This is only supported by the newer regex module, see a demo on regex101.com.

answered Aug 22 '18 at 11:24

Jan

38,539
8
41
69

1

This might overmatch since OP only expects to extract the contents of `{{info...}}` substrings. And you cannot just add `info` after the first `{` in `{((?:[^{}]|(?R))*)}`. – Wiktor Stribiżew Aug 22 '18 at 11:45
Hey! thanks a lot. It does work but it matches all the {{ }} strings, I didn't want that. I even upvoted your answer but I found the answer by Wiktor to be the most suitable, hence I accepted it. – Gopal Chitalia Aug 22 '18 at 11:45

score 1 · Accepted Answer · answered Aug 22 '18 at 11:44

The work around pattern can be the one that matches a line starting with {{info and then matches any 0+ chars as few as possible up to the line with just }} on it:

re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)

See the regex demo.

Details

(?sm) - re.DOTALL (now, . matches a newline) and re.MULTILINE (^ now matches line start and $ matches line end positions) flags
^ - start of a line
{{ - a {{ substring
[^\S\r\n]* - 0+ horizontal whitespaces
info - a substring
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars, as few as possible
^}}$ - start of a line, }} and end of the line.

score 0 · Answer 3 · answered Aug 22 '18 at 11:35

This answer explains how to do this with recursion (albeit for round brackets, but easily adaptable), however, personally, I would just write it using a while-loop:

b = 1
i = si = s.index('{')
i += 1
while b:
    if s[i] == '{': b += 1
    elif s[i] == '}': b -=1
    i += 1

ss = s[si:i]

where, with your string defined as: s, gives the substring, ss, as:

>>> print(ss)
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}}

Regex to find texts between nested parenthesis

3 Answers3