6

The strip_tags() documentation tells us that all the tags except the that in the second parameter are stripped. The operation this function performs is totally opposite to its name. It should have been named strip_all_tags_except().

Let's forget about the name and come to what I want to ask. I want the functionality of removing only the tags I mention in the second parameter. ie. I want the following to strip tags <iframe><script><style><embed><object> and allow all others.

my_strip_tags($data,'<iframe><script><style><embed><object>');

It's pretty opposite to what strip_tags() does.

How do I make this happen?

hakre
  • 178,314
  • 47
  • 389
  • 754
Tabrez Ahmed
  • 2,637
  • 6
  • 27
  • 47
  • 4
    Did you read the comments on the documentation page you linked to, that has at least 2 examples of how to do what you're asking? – Wooble Mar 20 '12 at 14:28
  • 3
    if you want to sanitize user input and allow some html tags you should consider a library that does this because strip tags (or a similar function that allows you to chose tags to allow) won't filter attributes and those can also be used for xss injection – mishu Mar 20 '12 at 14:30
  • 1
    I fail to see how the name is wrong: `$stripped = strip_tags($html);` strips all the tags, as advertised. The exempted tags option is just that... an **OPTION**. If you don't use it, then the function strips everything. – Marc B Mar 20 '12 at 14:31
  • You might want to update anything you've used my answer in - I found a big security hole. It's patched now, though. – Ry- Jun 24 '12 at 02:31

4 Answers4

3

It shouldn't happen at all.

strip_tags is only usable if used without any parameters. Otherwise you will have an XSS in any tag allowed.

As a matter of fact, your concern should be not only tags but also attributes. So, use some sort of HTML purifier instead.

Ry-
  • 199,309
  • 51
  • 404
  • 420
Your Common Sense
  • 152,517
  • 33
  • 193
  • 313
3

Updated 2012-06-23; major security flaw.

Here's a class from another project that should do what you're looking for:

final class Filter {
    private function __construct() {}

    const SafeTags = 'a abbr acronym address b bdo big blockquote br caption center cite code col colgroup dd del dfn dir div dl dt em font h1 h2 h3 h4 h5 h6 hr i img ins kbd legend li ol p pre q s samp small span strike strong sub sup table tbody td tfoot th thead tr tt u ul var article aside figure footer header nav section rp rt ruby dialog hgroup mark time';
    const SafeAttributes = 'href src title alt type rowspan colspan lang';
    const URLAttributes  = 'href src';

    public static function HTML($html) {
        # Get array representations of the safe tags and attributes:
        $safeTags = explode(' ', self::SafeTags);
        $safeAttributes = explode(' ', self::SafeAttributes);
        $urlAttributes = explode(' ', self::URLAttributes);

        # Parse the HTML into a document object:
        $dom = new DOMDocument();
        $dom->loadHTML('<div>' . $html . '</div>');

        # Loop through all of the nodes:
        $stack = new SplStack();
        $stack->push($dom->documentElement);

        while($stack->count() > 0) {
            # Get the next element for processing:
            $element = $stack->pop();

            # Add all the element's child nodes to the stack:
            foreach($element->childNodes as $child) {
                if($child instanceof DOMElement) {
                    $stack->push($child);
                }
            }

            # And now, we do the filtering:
            if(!in_array(strtolower($element->nodeName), $safeTags)) {
                # It's not a safe tag; unwrap it:
                while($element->hasChildNodes()) {
                    $element->parentNode->insertBefore($element->firstChild, $element);
                }

                # Finally, delete the offending element:
                $element->parentNode->removeChild($element);
            } else {
                # The tag is safe; now filter its attributes:
                for($i = 0; $i < $element->attributes->length; $i++) {
                    $attribute = $element->attributes->item($i);
                    $name = strtolower($attribute->name);

                    if(!in_array($name, $safeAttributes) || (in_array($name, $urlAttributes) && substr($attribute->value, 0, 7) !== 'http://')) {
                        # Found an unsafe attribute; remove it:
                        $element->removeAttribute($attribute->name);
                        $i--;
                    }
                }
            }
        }

        # Finally, return the safe HTML, minus the DOCTYPE, <html> and <body>:
        $html  = $dom->saveHTML();
        $start = strpos($html, '<div>');
        $end   = strrpos($html, '</div>');

        return substr($html, $start + 5, $end - $start - 5);
    }
}
Ry-
  • 199,309
  • 51
  • 404
  • 420
  • This function can strip tags, but keeps the text that is inside those tags, which may not be what you want. – Dylan Jun 23 '12 at 13:52
  • @Dylan: Well, like I said, it's for another project - and that was exactly what I wanted, personally. (If you're worried about the contents of ` – Ry- Jun 23 '12 at 18:08
  • @minitech: Would love to see having you a github account linked on your profile page ;) – hakre Jun 24 '12 at 08:58
  • This doesn't play well with UTF-8 foreign characters. I get entities based of scrambled iso-8859-1 characters. – tim May 12 '13 at 00:11
  • @tim: Don't use PHP if you're even remotely interested in that stuff :P – Ry- May 12 '13 at 01:06
1

I usually work with htmLawed lib, you can use it to filter, secure & sanitize HTML

http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/more.htm

0

I think the strip_tags() functionality matches its name. It's all a matter of perspective. :-) Without a second parameter, it strips all tags. The second parameter provides exceptions to the basic functionality.

What you want seems to be strip_some_tags().

What about just doing it with a regexp?

function strip_some_tags($input, $taglist) {
  $output=$input;
  foreach ($taglist as $thistag) {
    if (preg_match('/^[a-z]+$/i', $thistag)) {
      $patterns=array(
        '/' . "<".$thistag."\/?>" . '/',
        '/' . "<\/".$thistag.">" . '/'
      );
    } else
    if (preg_match('/^<[a-z]+>$/i', $thistag)) {
      $patterns=array(
        '/' . str_replace('>', "?>", $thistag) . '/',
        '/' . str_replace('<', "<\/?", $thistag) . '/'
      );
    }
    else {
      $patterns=array();
    }
    $output=preg_replace($patterns, "", $output);
  }
  return $output;
}

$to_strip=array( "iframe", "script", "style", "embed", "object" );

$sampletext="Testing. <object>Am I an object?</object>\n";

print strip_some_tags($sampletext, $to_strip);

Returns:

Testing. Am I an object?

Of course, this just strips the tags, not the stuff between them. Is that what you want? You didn't specify in your question.

ghoti
  • 41,419
  • 7
  • 55
  • 93
  • 2
    A wise man on another question said that regexes are perfectly usable on HTML as long as you don't have to ask how. :-) – ghoti Mar 20 '12 at 14:47
  • 3
    @ghoti: You mean... http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 – BoltClock Mar 20 '12 at 14:49
  • @BoltClock - that's the one, yes. – ghoti Mar 20 '12 at 14:52
  • Thanks for replying.Yes, I just want to strip the tags. stripping the stuff is also something I would need in future. – Tabrez Ahmed Mar 20 '12 at 14:54