Extraxt nonHTML tags with regular expression in PHP

Question

I'm trying to extract nonHTML tags ( like: <!This TAG> ) from strings. I use below regular expression to extract tags:

$Tags = preg_split('/(<![^>]*[^\/]>)/i', $Content, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

But problem is all HTML comment tags ( like  ) will be extract as well.

I can use a trick like below example to remove comment Tags but still any nonHTML tags between them will be extracted!

foreach($Tags as $key => $value) {
    if(mb_substr($value, 0, 4) == '<!--')
        continue;
    $CheckTag = mb_substr($value, 0, 2);
    if($CheckTag == '<!') {
        //...
    }
}

For examples:

 => Work.

 => Not Work! (Tag2 & Tags3 extracted)

I'm looking for better regular expression to skip entire content between  , thanx for any tips.

For a better perspective this is the original function:

public function extractFakeTags($Content) {
        $Tags = preg_split('/(<![^>]*[^\/]>)/i', $Content, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
        $FakeTags = array();
        $Content = $Tags;
        foreach($Tags as $key => $current) {
            if(mb_substr($current, 0, 4) == '<!--')
                continue;
            $TagBegin = mb_substr($current, 0, 2);
            if($TagBegin == '<!') {
                $TagLength = mb_strlen($current);
                $TagEnd = mb_substr($current, ($TagLength-1), 1);
                if($TagEnd=='>') {
                    $TagName = mb_substr($current, 2, ($TagLength-3));
                    if (array_key_exists($TagName, $FakeTags)) {
                        array_push($FakeTags[$TagName], $key);
                    }
                    else {
                        $FakeTags[$TagName] = array($key);
                    }
                    $Content[$key] = NULL;
                }
            }
        }
        return $FakeTags;
    }

mistake #1: using a regex. you should be using [DOM](http://php.net/dom). Though, since you're dealing with non-html "tags", then probably htmlpurifier would be a better choice. — Marc B, Feb 02 '15 at 19:36
This code is a part of an HTML parser engine and i don't want to use any third-party class for some security reasons. DOM is not an optimal way to parse many big strings. (Process time issue) — DarkMaze, Feb 02 '15 at 19:41
Any intangible or unwanted possible bugs! Or any possible problem in their further updates! This simple code work for me so far; but i'm not expert in regex. so a simple change in this regex could solve my problem instead of putting myself in another bigger problem! — DarkMaze, Feb 02 '15 at 19:56
The purpose of this code is to split a custom HTML Template to an array, for example: `
` to `[0] => '
', [tag1] => NULL, [2] => '
` — DarkMaze, Feb 02 '15 at 20:09
I founded a solution: First remove any comments with this regex: `` Then extract nonHTML tags with: `(]*[^\/]>)` — DarkMaze, Feb 02 '15 at 20:31

score 1 · Accepted Answer · edited May 23 '17 at 11:57

1

I'm looking for better regular expression to skip entire content between 

To skip something use (*SKIP)(*F). Put such as (*SKIP)(*F)| before:

/<!--(?s:.*?)-->(*SKIP)(*F)|(<![^>]*[^\/]>)/i

Didn't modify your actual regex. Regex101 is good for testing also see Regex FAQ :)

edited May 23 '17 at 11:57

Community

1
1

answered Feb 02 '15 at 20:30

Jonny 5

11,051
2
20
42

thanx, that's what i'm looking for! – DarkMaze Feb 02 '15 at 20:36
Is it is better solution to skip comments: `(*SKIP)(*F)|(]*[^\/]>)` according to [this link](http://stackoverflow.com/questions/19676024/using-regular-expression-remove-html-comments-from-content) – DarkMaze Feb 02 '15 at 20:41
1

@DarkMaze Used the `s` [flag](http://php.net/manual/en/reference.pcre.pattern.modifiers.php) ...`(?s:`... to make the dot also match newlines. Here it's the same like adding `s` at the end ...`/is` and change first part to `(*SKIP)(*F)` – Jonny 5 Feb 02 '15 at 20:44
1

@DarkMaze All these variants do the same: Maching as few as possibly of any character (also newline) between `` without capturing. 1.) `//` 2.) `//` 3.) `//` 4.) `//s` 5.) `//Us` 6.) `//` just different notations. Yours needs a bit more steps possibly because of the alternation. – Jonny 5 Feb 02 '15 at 21:03

Extraxt nonHTML tags with regular expression in PHP

1 Answers1