regex - match not in tag

Question

this should be easy but somehow I can't figure it out: I have HTML snippet like this one: This is 201 some 20 text 1 30 with some numbers 30 20 ...

I need to match numbers 1, 20, 30 (only those) and replace them with links. Obviously I do not want to replace numbers inside tag

The output should be: This is 201 some <a href="#20">20</a> text <a href="#1">1</a> <a href="#30">30</a> with some numbers <a href="#30">30</a> <a href="#20">20</a> ...

This is what I have:

$text = '<p style="padding:0 10 20 30; margin: 1 2 3 4 ">This is 201 some 20 text 1 <b>30</b> with some numbers 30 20</p> ...';

$pat[]  = '/(?<=\>)([^<]*)([^0-9\:])(1|20|30)([^0-9])/s';
$repl[] = '$1$2<a href="#$3" class="p2">$3</a>$4';
echo preg_replace($pat, $repl, $text);

It works but it matches only one number at a time, and I do not want to run it in loop.

Any ideas?

--

I see the point of using HTML parser, however it seems like something that can be done with regexp. Especially when there is no standard library for parsing HTML in PHP, and I'm not sure if I want to import third party HTML parser just for this task. Any attempt to fix my regex?

-- I managed to write regexp that works in my case. If anyone is interested:

$pat[] = '/>(([^<]*)(([^0-9\:]))|())(1|20|30)(?(?=[<]+?)(?!<\/a>)|(([^0-9\<])([^<]*)<(?!\/a>)))/sU'; $repl[] = '>$1<a href="#$6" class="p22">$6</a>$7';

I know very well that it can be easily accomplished with HTML parser, but I do not want to include third party parsers in my software.

Regards, Philia

score 1 · Answer 1 · edited Oct 21 '18 at 10:38

1

Regular expressions are meant to parse regular languages - those that can be described with finite automata. HTML is not a regular language. Parsing HTML with regular expressions is the Cthulhu way: Parsing Html The Cthulhu Way.

edited Oct 21 '18 at 10:38

Cœur

32,421
21
173
232

answered Dec 02 '09 at 20:46

Alex Weinstein

9,499
8
38
58

score 1 · Answer 2 · answered Dec 02 '09 at 20:46

1

It is really simple: extract only the text with an HTML parser, then use regular expressions on that.

answered Dec 02 '09 at 20:46

Svante

46,788
11
77
118

score 0 · Answer 3 · edited May 23 '17 at 12:19

0

HTML should not be parsed with regex because it's not a regular language. You might be able to do it to properly form XHTML, but I wouldn't recommend it. See the most voted up answer on SO

edited May 23 '17 at 12:19

Community

1
1

answered Dec 02 '09 at 21:05

Malfist

29,255
58
174
263

regex - match not in tag

3 Answers3

Linked