Use class content inside REGEX

Question

I want to parse a nested structure like this one in MATLAB :

structure NAME_PART_1
    Some content

    block NAME_PART_2
        Some other content
    end NAME_PART_2

    block NAME_PART_3
        subblock NAME_PART_4
            Some content++
        end NAME_PART_4
    end NAME_PART_3

end NAME_PART_1

structure           
    NAME_PART_5

end        NAME_PART_5

First, I would like to extract the content of each structure. It's quite easy because a structure content is always between "structure NAME" and "end NAME".

So, I would like to use regex. But I don't know in advance what the structure name will be.

So, I wrote my regex like this :

\bstructure\s+([\w.-]*)((?:\s|.)*)\bend\b\s+XXXX

But, I don't know by what I should replace "XXXX", in order to "reference" the content of the first class of this regex. But is that even possible?

This content looks nested to me, in which case pure ragex might not be the best solution. You may want to write a parser here. — Tim Biegeleisen, Feb 17 '19 at 03:21
You can reference the first matched group with `\1`. See [this documentation](https://www.mathworks.com/help/matlab/matlab_prog/regular-expressions.html#btrvwd4) for more info. — Adam, Feb 17 '19 at 03:21
you can do something like this mate https://regex101.com/r/wR8VQD/1/ — Code Maniac, Feb 17 '19 at 03:27
@graille Does that solve your problem completely? If so, I'll make an answer for it. — Adam, Feb 17 '19 at 04:25

score 1 · Answer 1 · answered Feb 17 '19 at 06:05

Try this Regex:

structure\s+([\w.-]+)\s*((?:(?!end\s+\1)[\s\S])*)end\s+\1

Click for Demo

Explanation:

structure - matches structure
\s+ - matches 1+ occurrences of a white-space
([\w.-]+) - matches 1+ occurrences of either a word character or a . or a -. This sub-match which contains the structure name is captured in Group 1.
\s* - matches 0+ occurrences of a white-space
((?:(?!end\s+\1)[\s\S])*) - Tempered Greedy Token - Matches 1+ occurrences of any character [\s\S] which does not start with the sequence end followed by Group 1 contents \1 i.e, structure name. This sub-match is captured in Group 2 which contains the contents of the structure
end\s+\1 - matches the word end followed by 1+ white-spaces followed by Structure Name contained in Group 1 \1.

The fourth bird · Answer 2 · 2019-02-17T15:03:01.680

Apart from making use of a backreference \1 to refer what is captured, you might replace the alternation in the capturing group ((?:\s|.)*) with matching a newline followed by 0+ characters and repeat that while capturing it ((?:\n.*)+)

Also you might omit the word boundary after end end\b\s+ as 1+ whitespace characters is what follows after end and instead add a word boundary at the end so that \1 is not part of a larger match.

\bstructure\s+([\w.-]+)((?:\n.*)+)\bend\s+\1\b

Regex demo

Explanation

\bstructure\s+ Match structure followed by 1+ whitespace chars
([\w.-]+) Capture in a group repeating 1+ times any of the listed chars
( Capturing group
- (?:\n.*)+ Match newline followed by 0+ times any char except a newline
) Close capturing group
\bend Match end
\s+\1\b Match 1+ times a whitespace char followed by a backreference to group 1 and end with a word boundary.

Use class content inside REGEX

2 Answers2