1

I'm trying to extract nonHTML tags ( like: <!This TAG> ) from strings. I use below regular expression to extract tags:

$Tags = preg_split('/(<![^>]*[^\/]>)/i', $Content, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

But problem is all HTML comment tags ( like <!-- This One --> ) will be extract as well.

I can use a trick like below example to remove comment Tags but still any nonHTML tags between them will be extracted!

foreach($Tags as $key => $value) {
    if(mb_substr($value, 0, 4) == '<!--')
        continue;
    $CheckTag = mb_substr($value, 0, 2);
    if($CheckTag == '<!') {
        //...
    }
}

For examples:

<!--<p>some text here.</p>--> => Work.

<!-- <!Tag1><!Tag2><!Tag3> --> => Not Work! (Tag2 & Tags3 extracted)

I'm looking for better regular expression to skip entire content between <!-- to --> , thanx for any tips.

For a better perspective this is the original function:

public function extractFakeTags($Content) {
        $Tags = preg_split('/(<![^>]*[^\/]>)/i', $Content, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
        $FakeTags = array();
        $Content = $Tags;
        foreach($Tags as $key => $current) {
            if(mb_substr($current, 0, 4) == '<!--')
                continue;
            $TagBegin = mb_substr($current, 0, 2);
            if($TagBegin == '<!') {
                $TagLength = mb_strlen($current);
                $TagEnd = mb_substr($current, ($TagLength-1), 1);
                if($TagEnd=='>') {
                    $TagName = mb_substr($current, 2, ($TagLength-3));
                    if (array_key_exists($TagName, $FakeTags)) {
                        array_push($FakeTags[$TagName], $key);
                    }
                    else {
                        $FakeTags[$TagName] = array($key);
                    }
                    $Content[$key] = NULL;
                }
            }
        }
        return $FakeTags;
    }
DarkMaze
  • 233
  • 2
  • 14
  • 2
    mistake #1: using a regex. you should be using [DOM](http://php.net/dom). Though, since you're dealing with non-html "tags", then probably htmlpurifier would be a better choice. – Marc B Feb 02 '15 at 19:36
  • This code is a part of an HTML parser engine and i don't want to use any third-party class for some security reasons. DOM is not an optimal way to parse many big strings. (Process time issue) – DarkMaze Feb 02 '15 at 19:41
  • 1
    What security reasons (out of curiosity)? – Jay Blanchard Feb 02 '15 at 19:47
  • Any intangible or unwanted possible bugs! Or any possible problem in their further updates! This simple code work for me so far; but i'm not expert in regex. so a simple change in this regex could solve my problem instead of putting myself in another bigger problem! – DarkMaze Feb 02 '15 at 19:56
  • The purpose of this code is to split a custom HTML Template to an array, for example: `
    ` to `[0] => '
    ', [tag1] => NULL, [2] => '
    `
    – DarkMaze Feb 02 '15 at 20:09
  • I founded a solution: First remove any comments with this regex: `` Then extract nonHTML tags with: `(]*[^\/]>)` – DarkMaze Feb 02 '15 at 20:31

1 Answers1

1

I'm looking for better regular expression to skip entire content between <!-- to -->

To skip something use (*SKIP)(*F). Put such as <!--(?s:.*?)-->(*SKIP)(*F)| before:

/<!--(?s:.*?)-->(*SKIP)(*F)|(<![^>]*[^\/]>)/i

Didn't modify your actual regex. Regex101 is good for testing also see Regex FAQ :)

Community
  • 1
  • 1
Jonny 5
  • 11,051
  • 2
  • 20
  • 42
  • thanx, that's what i'm looking for! – DarkMaze Feb 02 '15 at 20:36
  • Is it is better solution to skip comments: `(*SKIP)(*F)|(]*[^\/]>)` according to [this link](http://stackoverflow.com/questions/19676024/using-regular-expression-remove-html-comments-from-content) – DarkMaze Feb 02 '15 at 20:41
  • 1
    @DarkMaze Used the `s` [flag](http://php.net/manual/en/reference.pcre.pattern.modifiers.php) ...`(?s:`... to make the dot also match newlines. Here it's the same like adding `s` at the end ...`/is` and change first part to `(*SKIP)(*F)` – Jonny 5 Feb 02 '15 at 20:44
  • 1
    @DarkMaze All these variants do the same: Maching as few as possibly of any character (also newline) between `` without capturing. 1.) `//` 2.) `//` 3.) `//` 4.) `//s` 5.) `//Us` 6.) `//` just different notations. Yours needs a bit more steps possibly because of the alternation. – Jonny 5 Feb 02 '15 at 21:03