3

I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.

Here is the current expression:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

Here is some text it successfully matches against and the groups it builds:

[url=http://www.google.com]Go to google![/url]
1: url
2: http://www.google.com
3: Go to google!

[img]http://www.somesite.com/someimage.jpg[/img]
1: img
2: NULL
3: http://www.somesite.com/someimage.jpg

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
1: quote
2: NULL
3: [quote]first nested quote[/quote][quote]second nested quote[/quote]

All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:

[quote]first nested quote[/quote][quote]second nested quote[/quote]
1: quote
2: NULL
3: first nested quote[/quote][quote]second nested quote

Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

To this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]

By adding ((?!\[/\1\]).) we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:

[quote]first nested quote[/quote][quote]second nested quote[/quote]

[quote]first nested quote[/quote]
1: quote
2: NULL
3: first nested quote

[quote]second nested quote[/quote]
1: quote
2: NULL 3: second nested quote

I was happy that fixed it, but now we have another problem. This new regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]

[quote][quote]first nested quote[/quote]
1: quote
2: NULL
3: [quote]first nested quote

[quote]second nested quote[/quote]
1: quote
2: NULL
3: second nested quote

The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.

Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.

halfer
  • 18,701
  • 13
  • 79
  • 158
Chev
  • 54,842
  • 60
  • 203
  • 309

1 Answers1

4

Using balancing groups you can construct a regex like this:

(?>
  \[ (?<tag>[^][/=\s]+) \s*
  (?: = \s* (?<val>[^][]*) \s*)?
  ]
)

(?<content>
  (?>
    \[(?<innertag>[^][/=\s]+)[^][]*]
    |
    \[/(?<-innertag>\k<innertag>)]
    |
    [^][]+
  )*
  (?(innertag)(?!))
)

\[/\k<tag>]

Simplified according to Kobi's example.


In the following:

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

It finds these matches:

  • [foo=bar]baz[/foo]
  • [b]foo[/b]
  • [i][i][foo=bar]baz[/foo]foo[/i][/i]
  • [i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
  • [quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

Full example at http://ideone.com/uULOs

(Old version http://ideone.com/AXzxW)

Qtax
  • 31,392
  • 7
  • 73
  • 111
  • 1
    Nice. You can simplify it quite a bit and eliminate some nested quantifiers: http://ideone.com/AXzxW . If you don't mind capturing spaces and words between the tags, you can simplify it even further: http://ideone.com/82lLX . Yeah, I like balancing groups. – Kobi Aug 12 '11 at 10:06
  • What's your first link? You pasted my original link. You second example (even tho it doesn't capture the required parts) is a nice way of structuring it. And yeah, balancing groups are a nice feature. – Qtax Aug 12 '11 at 15:35
  • 1
    Oh, sorry. It should have been this one: http://ideone.com/cvlNM As for the second one - admittedly there's room for improvement. A simple `(?=\[)` on the start is enough to capture only tags, and it's also possible to add some groups to capture names and contents of other tags - .Net allows fine tuning of its captures. – Kobi Aug 12 '11 at 16:29
  • 1
    Kobi, +1, much better way of writing it. I'll incorporate that in the answer if you don't mind. :-) – Qtax Aug 12 '11 at 16:54
  • Great answer. Works perfect. How do you guys learn all this stuff? If I want to be a .NET regex guru where should I go? – Chev Aug 12 '11 at 19:24
  • 1
    @Alex, read through the manual learning about every feature. Altho I think the only thing special for .NET regex are these balancing groups (stacking of captures) (and *the lack* of some features). I saw that Kobi has some good balancing group examples on his blog that you can check out. I guess "Mastering Regular Expressions" could be a good read too. – Qtax Aug 12 '11 at 21:07
  • @Qtax, what do you mean by "read through the manual"? What manual? – Chev Aug 13 '11 at 03:36
  • @Alex, I was referring to the [.NET regex documentation](http://msdn.microsoft.com/en-us/library/hs600312.aspx). – Qtax Aug 13 '11 at 12:24