2

I am currently creating bbcode parsing engine and I have encountered a situation what I can't figure out on my own.

The thing is, that I popped into a problem exactly like this one: Apache / PHP on Windows crashes with regular expression

That means that if I make something like the example below Apache crashes because of recursion count reaching 690 (1MB memory limit for PCRE):

$txt = '[b]'.str_repeat('a', 338).'[/b]';  // if I change repeat count to lower value it's ok
$regex = '#\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))](?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)\[/(?P=tag)]#mi';

echo preg_replace_callback($regex, function($matches) { return $matches['content']; }, $txt);

So I need to somehow minimize the need of * and + in my regex, but that's where I'm out of ideas so I though maybe you could suggest something.

Other approaches for parsing bbcode (that could handle nested tags) are welcome. However I would not like to use an already built class or something. I like to do things on my own!

I have also looked into PECL and Pear HTML_BBCodeParser. But I don't want my application to be dependent on extensions. More likely I may do some script that checks for that extension and if it doesn't exist use the BBCode parser that I'm trying to do here.

Sorry if my descriptions are gloomy, I'm not pro at English ^^

EDIT. So the regex explained:

\[(?P<attributes>(?P<tag>[a-z0-9_]*?)(?:=.*?|\s.*?|))]

This is my opening tag. I have used named groups. With 'tag' I identify tag and with 'attributes' I identify tags attributes. Think of tag as an attribute also. So what is happening here? I try to match a tag, when a tag is matched, I try to match anything after = sign or anything after \s (spacer) until it reaches tag closure ].

(?P<content>(?:[^[]|\[(?!/?(?P=tag)])|(?R))+?)

Now here I am trying to match content. This is the tricky part. I am looking for any character that is not [ and if I find any, then I check if it is not my ending tag or recursion, and I tell the regex engine to do so until....

\[/(?P=tag)]

... the ending tag is found.

Community
  • 1
  • 1
Paul
  • 23
  • 3
  • 2
    "I have also looked into PECL and Pear HTML_BBCodeParser. But i don't want my application to be dependant on extensions" -- I would argue that this option is far more preferable to reinventing the wheel. – Jonathan Fingland Aug 31 '10 at 20:54
  • 2
    `I like to do things on my own!` - why is that? Do you write your own regular expression engine as well? Or your own php interpreter/runtime? – VolkerK Aug 31 '10 at 20:56
  • Btw: You might want to spread your regular expression code out onto multiple lines and explain the parts with comments. I _guess_ that can improve your chances of getting help. – VolkerK Aug 31 '10 at 21:08
  • Thanks for a tip, VolkerK. I didn't mean exactly that by "I like to do things on my own!" Oh well.. Lets forget it. I have explained the code, hope it is ok now. – Paul Aug 31 '10 at 21:31
  • Can you give an example of a string for each you run into the limit ? – Artefacto Aug 31 '10 at 22:09
  • Artefacto: $txt = '[b]'.str_repeat('a', 338).'[/b]'; 338 'a' chars inside [b][/b] tag. – Paul Aug 31 '10 at 22:15

2 Answers2

3

Your regex, especially the zero-width assertions (lookaround) cause the regex engine to backtrack catastrophically. Moral of the story: Regex can't shouldn't be used to parse languages that are not regular. If you have nested structures, that's not a regular language.

In fact, I think BBCode is evil. BBCode is a markup language invented by lazy programmers who didn't want to filter HTML the proper way. As a result, we now have a loose "standard" that's hard to implement. Filter your HTML the right way:

http://htmlpurifier.org/

NullUserException
  • 77,975
  • 25
  • 199
  • 226
  • Hmm... Maybe you are right about using regex for such matter. Well HTML instead of BBCode would be great, but people are used to BBCode and it's like some kind of standart now so you cant throw it out. – Paul Aug 31 '10 at 21:48
2

I was going to suggest a BBCodeParser...

I have also looked into PECL and Pear HTML_BBCodeParser. But i don't want my application to be dependant on extensions

I find that to be very strange. Why reinvent the wheel? One of the principles of good software-engineering is DRY (Don't Repeat Yourself). You're trying to solve a problem that has already been solved.

I like to do things on my own!

That's not bad in of itself, but there are times when you are better off using a tried and true solution; one that is better tested and more robust than your own (as you're finding out). That way you will spend time on the problem you actually want to solve instead of solving a problem that has already been solved. Don't fall into the trap of reinventing the wheel. :)

My suggestion (and solution) to you is to use a BBCode parser.

EDIT

Another thing is that you're parsing something that is HTML-like. Things of that nature don't lend themselves easily to being parsed by regular expressions.

Vivin Paliath
  • 87,975
  • 37
  • 202
  • 284
  • Well I was talking about server side (i mean it has to be installed by hosting company or server admin) extensions that php.net suggest. It is always better to have a standalone app that you can just upload to the host and it is ready to use. – Paul Aug 31 '10 at 22:00
  • This should be a comment since it doesn't answer the OP question. – Artefacto Aug 31 '10 at 22:03
  • I guess you missed this part: "My suggestion (and solution) to you is to use a BBCode parser.", and the part after the edit. – Vivin Paliath Aug 31 '10 at 22:04
  • @Paul you can always ask the hosting company or server-admin to include that particular extension. There are many developer-friendly hosting-solutions. If that was not the case, you would have to rewrite every single extension! – Vivin Paliath Aug 31 '10 at 22:21