4

I'm using the regex <@(.+?)@> to match patterns such as:

<@set:template default.spt @>

It works fine, but I've run into situations where I needed to nest the pattern, such as this:

<@set:template <@get:oldtemplate @> @>

Instead of getting the parent pair (<@ and @>) I get the following:

<@set:template <@get:oldtemplate @>

I don't want it to get the child one, I just want the outermost parent in all nested situations. How to I fix my regex so that it will do this for me? I figure I could do it if I knew how to require for every <@ that there was one @> inside of the parent, but I have no idea on how to enforce that.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
Freesnöw
  • 25,654
  • 28
  • 83
  • 131

2 Answers2

5

What you describe is a "non-regular language". It cannot be parsed with a regexp.

Ok, if you are willing to put a limit to the nesting level, technically you can do it with a regexp. But it will be ugly.

Here is how to parse your thing with a few (increasing) maximum nesting depths, if you can put the condition of not having @'s inside your tags:

no nesting: <@[^@]+@>
up to 1:    <@[^@]+(<@[^@]+@>)?[^@]*@>
up to 2:    <@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>
up to 3:    <@[^@]+(<@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>)?[^@]*@>
...

If you cannot forbid lone @'s in your tags, you will have to replace every instance of [^@] with something like this: (?:[^<@]|<[^@]|@[^>]).

Just think about that and then think about extending your regex to parse up to 10 depth nesting.

Here, I will do it for you:

<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[
^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<
[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@
[^>])+(<@(?:[^<@]|<[^@]|@[^>])+@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>]
)*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@
>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?
(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>

What I hope my answer shows is that regexp are not the right tool to parse a language. A traditional lexer (tokenizer) and parser combination will do a much better job, be significantly faster, and will handle indefinite nesting.

Tobia
  • 14,998
  • 3
  • 64
  • 78
  • It is possible to allow `@` and `>` while not consuming the end tag with `(?:(?!@>).)*`. Gotta love the end result. – nhahtdh May 16 '13 at 20:56
1

I don't think you can do this with a Regular Expression, see the answer to this question which asks a similar thing. Regexes aren't sufficiently powerful to deal with arbitrary levels of nesting, if you will only ever have 2 levels of nesting then it should be possible, but maybe regexes aren't the best tool for the job.

Community
  • 1
  • 1
codebox
  • 18,210
  • 7
  • 54
  • 77