0

this should be easy but somehow I can't figure it out: I have HTML snippet like this one: <p style="padding:0 10 20 30; margin: 1 2 3 4 ">This is 201 some 20 text 1 <b>30</b> with some numbers 30 20</p> ...

I need to match numbers 1, 20, 30 (only those) and replace them with links. Obviously I do not want to replace numbers inside tag

The output should be: <p style="padding:0 10 20 30; margin: 1 2 3 4 ">This is 201 some <a href="#20">20</a> text <a href="#1">1</a> <b><a href="#30">30</a></b> with some numbers <a href="#30">30</a> <a href="#20">20</a></p> ...

This is what I have:

$text = '<p style="padding:0 10 20 30; margin: 1 2 3 4 ">This is 201 some 20 text 1 <b>30</b> with some numbers 30 20</p> ...';

$pat[]  = '/(?<=\>)([^<]*)([^0-9\:])(1|20|30)([^0-9])/s';
$repl[] = '$1$2<a href="#$3" class="p2">$3</a>$4';
echo preg_replace($pat, $repl, $text);

It works but it matches only one number at a time, and I do not want to run it in loop.

Any ideas?

--

I see the point of using HTML parser, however it seems like something that can be done with regexp. Especially when there is no standard library for parsing HTML in PHP, and I'm not sure if I want to import third party HTML parser just for this task. Any attempt to fix my regex?

-- I managed to write regexp that works in my case. If anyone is interested:

$pat[] = '/>(([^<]*)(([^0-9\:]))|())(1|20|30)(?(?=[<]+?)(?!<\/a>)|(([^0-9\<])([^<]*)<(?!\/a>)))/sU'; $repl[] = '>$1<a href="#$6" class="p22">$6</a>$7';

I know very well that it can be easily accomplished with HTML parser, but I do not want to include third party parsers in my software.

Regards, Philia

Philia
  • 1
  • 1

3 Answers3

1

Regular expressions are meant to parse regular languages - those that can be described with finite automata. HTML is not a regular language. Parsing HTML with regular expressions is the Cthulhu way: Parsing Html The Cthulhu Way.

Cœur
  • 32,421
  • 21
  • 173
  • 232
Alex Weinstein
  • 9,499
  • 8
  • 38
  • 58
1

It is really simple: extract only the text with an HTML parser, then use regular expressions on that.

Svante
  • 46,788
  • 11
  • 77
  • 118
0

HTML should not be parsed with regex because it's not a regular language. You might be able to do it to properly form XHTML, but I wouldn't recommend it. See the most voted up answer on SO

Community
  • 1
  • 1
Malfist
  • 29,255
  • 58
  • 174
  • 263