0

I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed?

Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa">.

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
Tom
  • 79
  • 2
  • 5
  • 8
    Use an html parser. See http://stackoverflow.com/q/1732348/372239 – Toto Nov 28 '13 at 11:56
  • 3
    You *really must* use an HTML parser. You may find something that you *think* works, but it will break later on and you will have *no idea* where to look to find the bug. – Borodin Nov 28 '13 at 14:04
  • With a sed, you have too much exception to deal that it would be bery hard to manage (and understand). – NeronLeVelu Nov 28 '13 at 14:48

2 Answers2

2

As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work:

a(?![^<]*>)

Working example on RegExr and the same for input.

And in Perl:

$var = "salut <a href='a.html'></a> ah ha <a href='about.asp' /> animal";
#        ^     ^       ^         ^  ^   ^  ^       ^     ^       ^   ^
$var =~ s/a(?![^<]*>)/b/g;
print $var;

Output:

sblut <a href='a.html'></a> bh hb <a href='about.asp' /> bnimbl
 ^                          ^   ^                        ^   ^
OGHaza
  • 4,683
  • 7
  • 21
  • 29
  • Might I direct you to this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Vector Gorgoth Nov 28 '13 at 17:17
  • 1
    @VectorGorgoth Thanks for the downvote. This answer is correct given the OP asked for a regex. I have seen that link posted many times, I do not need reminding. For simple applications I see no reason not to use a regex, the question does not concern the structure of the HTML. If the question was: I want to replace all `A`s in a sentence but only when not enclosed in brackets `"hallo [not this a] but this a"` would you still have downvoted me? Because the problem being solved is exactly the same. Feel free to post a working solution that uses a HTML parser and I will delete this answer. – OGHaza Nov 28 '13 at 17:27
  • Using regexes to parse HTML is dangerous and stupid. You'll notice all the replies to the question in the comments amount to, "DON'T USE REGEXES". – Vector Gorgoth Nov 28 '13 at 17:31
  • 1
    @VectorGorgoth I have a provided a solution to a problem. If you ever have this problem yourself I advise you use a HTML parser, if the OP wants a quick and dirty 1 liner (literally 17 characters), my solution 100% fulfils his specified needs. I have gone ahead and added a disclaimer to the top of my post, I hope this satisfies you. – OGHaza Nov 28 '13 at 18:25
  • Okay, okay. Undownvoted. I still think it's better to avoid giving "quick and dirty" solutions because in my experience NO "one-off" solution has any guarantee of remaining that way. – Vector Gorgoth Nov 29 '13 at 20:52
0

Resurrecting this ancient question because it had a simple solution that wasn't mentioned.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

#!/usr/bin/perl
$regex = '<[^>]*|(a)';
$subject = 'aig arother <a href="aaa">';
($replaced = $subject) =~ s/$regex/
if (defined $1)  {"b";} else {$&;} /eg;
print $replaced . "\n";

See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97