-2

I would like to use a regex expression in Python that matches something like r'{.*?}', or in plain English curly braces along with everything inside them. (I am only concerned with the outermost and would like to ignore inner curly braces)

The problem with this expression is that with nested curly braces you don't get everything because you stop at the first closing brace even if it is not the matching one.

Note r'{.*}' is not a solution because there are more than one group of matching outer curly braces in the text that is being parsed.

For example if the text was:

struct my_struct{
    double d;
    struct {int i;, char c;} s;
};

I would want the expression to match:

{
    double d;
    struct{int i; char c;} s;
}

Any pointers on how to account for nested curly braces would be appreciated. Note that I am looking for one where whitespace does not hold any meaning. Solutions that say don't use regex, don't use Python, or go about it in a different way are not useful as this is part of a larger regex expression.

HashBr0wn
  • 125
  • 7
  • 2
    I don't think you can do this with regexes only, unfortunately. You may need a real parser. – AKX Aug 05 '20 at 13:31
  • "Solutions that say don't use regex...are not useful..." -- well, what if regex simply isn't capable of what you're asking? – Fred Larson Aug 05 '20 at 13:35
  • I am doubtful that regex is simply not capable. If it really is impossible a proof or a source to such a proof would be nice – HashBr0wn Aug 05 '20 at 13:37
  • @HashBr0wn I suggest reading this answer: https://stackoverflow.com/a/46334384/10785975 – Daweo Aug 05 '20 at 13:46
  • Python's regex engine has a lot more to it than a standard finite state machine, just as is the case with many high level languages. However other answers in that post may be helpful – HashBr0wn Aug 05 '20 at 13:50

1 Answers1

1

Well, you can use the newer regex module with a recursive approach and balanced parentheses:

\{(?:[^{}]+|(?R))+\}

In Python this could be

import regex as re

rx = re.compile(r'\{(?:[^{}]+|(?R))+\}')

for match in rx.finditer(you_data_as_string):
    print(match.group(0))

See a demo on regex101.com.

As noted by others, you might want to consider other approaches though (namely some sort of parser). Trying to analyze source code with regular expressions tend to get dirty quickly.

Jan
  • 38,539
  • 8
  • 41
  • 69
  • 1
    Thank you. This regex expression is simply to isolate units of code that I am going to parse with a real parser. Its just one line that I needed for that purpose – HashBr0wn Aug 05 '20 at 14:02
  • 1
    @HashBr0wn: You're welcome. Albeit this is only possible with the newer `regex` module and - as said - with balanced constructs. – Jan Aug 05 '20 at 14:03