Get Outer "Pair" when Nested

Question

I'm using the regex <@(.+?)@> to match patterns such as:

<@set:template default.spt @>

It works fine, but I've run into situations where I needed to nest the pattern, such as this:

<@set:template <@get:oldtemplate @> @>

Instead of getting the parent pair (<@ and @>) I get the following:

<@set:template <@get:oldtemplate @>

I don't want it to get the child one, I just want the outermost parent in all nested situations. How to I fix my regex so that it will do this for me? I figure I could do it if I knew how to require for every <@ that there was one @> inside of the parent, but I have no idea on how to enforce that.

You need `regex` package to do this. The default `re` package cannot handle arbitrary nesting level. — nhahtdh, May 16 '13 at 20:05
The answer to [this question](http://stackoverflow.com/questions/1656859/how-can-a-recursive-regexp-be-implemented-in-python) should solve your problem as well. — rpkamp, May 16 '13 at 20:10

Tobia · Accepted Answer · 2013-05-16T20:23:29.210

What you describe is a "non-regular language". It cannot be parsed with a regexp.

Ok, if you are willing to put a limit to the nesting level, technically you can do it with a regexp. But it will be ugly.

Here is how to parse your thing with a few (increasing) maximum nesting depths, if you can put the condition of not having @'s inside your tags:

no nesting: <@[^@]+@>
up to 1:    <@[^@]+(<@[^@]+@>)?[^@]*@>
up to 2:    <@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>
up to 3:    <@[^@]+(<@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>)?[^@]*@>
...

If you cannot forbid lone @'s in your tags, you will have to replace every instance of [^@] with something like this: (?:[^<@]|<[^@]|@[^>]).

Just think about that and then think about extending your regex to parse up to 10 depth nesting.

Here, I will do it for you:

<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[
^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<
[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@
[^>])+(<@(?:[^<@]|<[^@]|@[^>])+@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>]
)*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@
>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?
(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>

What I hope my answer shows is that regexp are not the right tool to parse a language. A traditional lexer (tokenizer) and parser combination will do a much better job, be significantly faster, and will handle indefinite nesting.

It is possible to allow `@` and `>` while not consuming the end tag with `(?:(?!@>).)*`. Gotta love the end result. — nhahtdh, May 16 '13 at 20:56

score 1 · Answer 2 · edited May 23 '17 at 12:11

1

I don't think you can do this with a Regular Expression, see the answer to this question which asks a similar thing. Regexes aren't sufficiently powerful to deal with arbitrary levels of nesting, if you will only ever have 2 levels of nesting then it should be possible, but maybe regexes aren't the best tool for the job.

edited May 23 '17 at 12:11

Community

1
1

answered May 16 '13 at 20:01

codebox

18,210
7
54
77

Get Outer "Pair" when Nested

2 Answers2