1

I have a very long string which has nested loops. I want to extract a pattern in that.

String_Text:

some random texts......
........................
........................
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}} // End of topmost parenthesis
........................
..again some random text
........................
........................ // can also contain {{  }}
......End of string.

I want to extract all the text between the topmost parenthesis i.e.

Extracted_string:

info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................

Pattern:

1.) starts with { and can be followed by any number of {.

2.) After that there can be any number of white space.

3.) The first word after that is surely info.

4.) Extract till this bracket is not closed.

What is have tried so far:

re.findall(r'\{+[^\S\r\n]*info\s*(.*(?:\r?\n.*)*)\}+')

I know this is wrong as what this does is find the last instance of } in the string. Can someone help me out in extracting text between those bracket? TIA

Community
  • 1
  • 1
Gopal Chitalia
  • 355
  • 2
  • 15

3 Answers3

3

You need to use a recursive approach:

{
    ((?:[^{}]|(?R))*)
}

This is only supported by the newer regex module, see a demo on regex101.com.

Jan
  • 38,539
  • 8
  • 41
  • 69
  • 1
    This might overmatch since OP only expects to extract the contents of `{{info...}}` substrings. And you cannot just add `info` after the first `{` in `{((?:[^{}]|(?R))*)}`. – Wiktor Stribiżew Aug 22 '18 at 11:45
  • Hey! thanks a lot. It does work but it matches all the {{ }} strings, I didn't want that. I even upvoted your answer but I found the answer by Wiktor to be the most suitable, hence I accepted it. – Gopal Chitalia Aug 22 '18 at 11:45
1

The work around pattern can be the one that matches a line starting with {{info and then matches any 0+ chars as few as possible up to the line with just }} on it:

re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)

See the regex demo.

Details

  • (?sm) - re.DOTALL (now, . matches a newline) and re.MULTILINE (^ now matches line start and $ matches line end positions) flags
  • ^ - start of a line
  • {{ - a {{ substring
  • [^\S\r\n]* - 0+ horizontal whitespaces
  • info - a substring
  • \s* - 0+ whitespaces
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • ^}}$ - start of a line, }} and end of the line.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
0

This answer explains how to do this with recursion (albeit for round brackets, but easily adaptable), however, personally, I would just write it using a while-loop:

b = 1
i = si = s.index('{')
i += 1
while b:
    if s[i] == '{': b += 1
    elif s[i] == '}': b -=1
    i += 1

ss = s[si:i]

where, with your string defined as: s, gives the substring, ss, as:

>>> print(ss)
{{info .................
.....texts..............
...{{ some text }}...... // nested parenthesis 1
........................
...{{ some text }}...... // nested parenthesis 2
........................
}}
Joe Iddon
  • 18,600
  • 5
  • 29
  • 49