10

I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:

<a href="example.com" alt="yasar home page">yasar</a>

So that yasar reads <span class="selected-word">yasar</span> , my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace() I am using looks like this:

preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);

How can I make a regular expression, so that it doesn't match anything inside a html tag?

yasar
  • 11,262
  • 26
  • 80
  • 154
  • @MarcB for once, I think this is a valid regex problem. It's hard to easily do what the OP wants with a DOM parser. He just needs to know how not to match words that are within quotes. – Alex Turpin Oct 25 '11 at 15:39
  • @Xeon: still a bad idea. Use dom/xpath to get the textnodes, then manipulate them individually. It's the only 100% reliable method to make sure you're dealing only with "relevant" text and not some wonky subchunk of a badly formed tag that matched. – Marc B Oct 25 '11 at 15:47

4 Answers4

20

You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >, or before any <. The latter test is easier to accomplish as lookahead assertions can be variable length:

/(asf|foo|barr)(?=[^>]*(<|$))/

See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.

mario
  • 138,064
  • 18
  • 223
  • 277
  • Yada yada, silly bobince answer... -- Yes, that's not quite correct. This regex only works for XML/XHTML, and only without CDATA edge cases etc. But even in real-world HTML you don't see angle brackets in attributes. So, workable as basic solution. – mario Oct 25 '11 at 15:49
  • I am getting `Compilation failed: lookbehind assertion is not fixed length at offset 27` when trying to run your regexp. Maybe you missed something? – yasar Oct 25 '11 at 15:55
  • Try again. Code edited since. (There was a `?<=` where the `?=` should have been.) – mario Oct 25 '11 at 15:56
  • I don't have any idea how this worked because I am new to the concept of lookaheads, but it worked. Thanks :) – yasar Oct 25 '11 at 15:58
  • seems like it only change one string multiple times. for example, I have a string "....AAA...BBB...AAA...", only AAA will be changed, but not BBB. why is that? – likeforex.com Jul 21 '12 at 02:29
  • mario, how to get in touch with u? i want to use ur solution, but know almost nothing about it – likeforex.com Jul 22 '12 at 18:24
  • 1
    @likeforex.com: We don't do personal support here, and SO is not a forum; discussing a different topic in between is not provided for. Especially if inquiries are that vague. ("What have you tried?"). I have no clue what you want. -- For help see also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. – mario Jul 22 '12 at 18:28
  • that's cool. the thing is when I tried to use your solution. it can only replace one string with a link, while a glossary can have a few different links. – likeforex.com Jul 22 '12 at 18:36
  • for example i have a text string like AAA BBB CCC DDD EEE F G H, i want to replace to AAA BBB CCC DDD EEE F G H. your solution only works for one link! Is that right? – likeforex.com Jul 22 '12 at 18:39
  • @likeforex.com: Yes it only works for one link. No, it cannot be fixed. – mario Jul 22 '12 at 18:41
  • this is what i have. the terms are from a loop and tried to replace whatever in the loop but not inside a link. preg_replace("~($term)(?=[^>]*( – likeforex.com Jul 22 '12 at 18:42
  • is it possible to look for anything not inside ... in the lookaround part? – likeforex.com Jul 22 '12 at 18:44
9

Yasar, resurrecting this question because it had another solution that wasn't mentioned.

Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>.

With all the disclaimers about using regex to parse html, here is the regex:

<[^>]*>(*SKIP)(*F)|word1|word2|word3

Here is a demo. In code, it looks like this:

$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Here is an online demo of this code.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
0

This might be the kind of thing that you're after: http://snipplr.com/view/3618/ In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:

[b]bold text[b] [i]italic text[i]

However I appreciate that this might not work well with what you're trying to do.

Another option may be HTML Purifier, see: http://htmlpurifier.org/

0

From top of my mind, this should be working:

echo preg_replace("/<(.*)>(.*)<\/(.*)>/i","<$1><span class=\"some-class\">$2</span></$3>",$target);

But, I don't know how safe this would be. I am just presenting a possibility :)

bosniamaj
  • 718
  • 4
  • 8
  • 17