2

Im trying to make a stable system that will allow users to paste any mixture of BB / Html code into an input and i will sanitize and strip the data AS I WANT.

The content is copied from forums and the issue is that they all seems to use different code. Some display more than one
some use a self closing br tag. Others use a [URL =] And other just use [URL]URL[/URL] etc.

So far, I use HTMLpurifier to strip everything except for img tags.

HTMLpurifier doesnt (as far as i can see) remove BBCode. So, given a string like so:

[URL=http://awebsite.com]My Link [IMG]imagelink.png[/IMG][/URL]

How can i remove the URL tags and just leave the IMG tags.

I want to remove all the URL tag options so the url given and the text as well which may prove difficult.

So far i have got quite far by converting [IMG] tags etc using REGEX which works but i feel there are too many variants to hardcode this.

Any suggestions on a more efficient way / possible way to remove the URL tags?

Lovelock
  • 6,609
  • 15
  • 71
  • 162

2 Answers2

1

Option 1

If you just want to remove tags such as [URL=http://awebsite.com] and [/URL], leaving the content inside, the regex is simple:

Search: \[/?URL[^\]]*\]

Replace: Empty string

In JavaScript

replaced = string.replace(/\[\/?URL[^\]]*\]/g, "");

In PHP

$replaced = preg_replace('%\[/?URL[^\]]*\]%', '', $str);

Option 2: Also Removing content such as MyLink

Here, we'll replace the content following [URL...] that is not another tag.

Search: \[URL[^\]]*\][^\[\]]*|\[/URL[^\]]*\]

Replace: Empty string

JavaScript:

replaced = string.replace(/\[URL[^\]]*\][^\[\]]*|\[\/URL[^\]]*\]/g, "");

PHP:

$replaced = preg_replace('%\[URL[^\]]*\][^\[\]]*|\[/URL[^\]]*\]%', '', $str);
zx81
  • 38,175
  • 8
  • 76
  • 97
  • amazing! Work perfectly! Didnt think it would be possible but a bit of regex magic and it works. Really need to have a good learn of creating regex. Last question @zx81 is there a way to make the URL part in the regex non case sensitive? sometimes people use lower case url tags. I could use two different preg_replaces, One with lower case and the other upper but that seems silly if theres a way to do it non case sensitive. Thanks! – Lovelock Jun 18 '14 at 08:16
  • `Really need to have a good learn of creating regex.` Well if you're starting to study more regex and are interested in collecting cool techniques, maybe you'd like to look at this question about a very common problem, [(matching... except)](http://stackoverflow.com/q/23589174/) or save it for later. I had a lot of fun writing the answer. :) – zx81 Jun 18 '14 at 09:14
  • Ill take a look! So much in my learning list bookmarks right now :) Have upvoted too! – Lovelock Jun 18 '14 at 09:18
  • Thanks, see you next time. :) – zx81 Jun 18 '14 at 09:19
0

A solution could be to extract only IMG tags using regex:

$pattern ="#\[IMG\](https?://[-\w\.]+(:\d+)?/[\w/_\.]*(\?\S+?)?)?\[\/IMG\]#";
$str = "[URL=http://awebsite.com]My Link [IMG]http://google.com/imagelink.png[/IMG][/URL]";
preg_match($pattern, $str, $matches);
print_r($matches);

Result:

Array
(
    [0] => [IMG]http://google.com/imagelink.png[/IMG]
    [1] => http://google.com/imagelink.png
)
Cristian
  • 1,847
  • 2
  • 13
  • 21