0

I want to parse a nested structure like this one in MATLAB :

structure NAME_PART_1
    Some content

    block NAME_PART_2
        Some other content
    end NAME_PART_2

    block NAME_PART_3
        subblock NAME_PART_4
            Some content++
        end NAME_PART_4
    end NAME_PART_3

end NAME_PART_1

structure           
    NAME_PART_5

end        NAME_PART_5

First, I would like to extract the content of each structure. It's quite easy because a structure content is always between "structure NAME" and "end NAME".

So, I would like to use regex. But I don't know in advance what the structure name will be.

So, I wrote my regex like this :

\bstructure\s+([\w.-]*)((?:\s|.)*)\bend\b\s+XXXX

But, I don't know by what I should replace "XXXX", in order to "reference" the content of the first class of this regex. But is that even possible?

41686d6564
  • 15,043
  • 11
  • 32
  • 63
graille
  • 777
  • 9
  • 24

2 Answers2

1

Try this Regex:

structure\s+([\w.-]+)\s*((?:(?!end\s+\1)[\s\S])*)end\s+\1

Click for Demo

Explanation:

  • structure - matches structure
  • \s+ - matches 1+ occurrences of a white-space
  • ([\w.-]+) - matches 1+ occurrences of either a word character or a . or a -. This sub-match which contains the structure name is captured in Group 1.
  • \s* - matches 0+ occurrences of a white-space
  • ((?:(?!end\s+\1)[\s\S])*) - Tempered Greedy Token - Matches 1+ occurrences of any character [\s\S] which does not start with the sequence end followed by Group 1 contents \1 i.e, structure name. This sub-match is captured in Group 2 which contains the contents of the structure
  • end\s+\1 - matches the word end followed by 1+ white-spaces followed by Structure Name contained in Group 1 \1.
Gurmanjot Singh
  • 8,936
  • 2
  • 17
  • 37
1

Apart from making use of a backreference \1 to refer what is captured, you might replace the alternation in the capturing group ((?:\s|.)*) with matching a newline followed by 0+ characters and repeat that while capturing it ((?:\n.*)+)

Also you might omit the word boundary after end end\b\s+ as 1+ whitespace characters is what follows after end and instead add a word boundary at the end so that \1 is not part of a larger match.

\bstructure\s+([\w.-]+)((?:\n.*)+)\bend\s+\1\b

Regex demo

Explanation

  • \bstructure\s+ Match structure followed by 1+ whitespace chars
  • ([\w.-]+) Capture in a group repeating 1+ times any of the listed chars
  • ( Capturing group
    • (?:\n.*)+ Match newline followed by 0+ times any char except a newline
  • ) Close capturing group
  • \bend Match end
  • \s+\1\b Match 1+ times a whitespace char followed by a backreference to group 1 and end with a word boundary.
The fourth bird
  • 96,715
  • 14
  • 35
  • 52