Replace characters in an HTML document that match a regex, except those inside tags

Question

I want to replace all characters matching a pattern in a HTML document except those inside HTML tags. How do you do this with a regex using Perl or sed?

Example: replace all "a" with "b" but not if "a" is in an HTML tag like <a href="aaa">.

Use an html parser. See http://stackoverflow.com/q/1732348/372239 — Toto, Nov 28 '13 at 11:56
You *really must* use an HTML parser. You may find something that you *think* works, but it will break later on and you will have *no idea* where to look to find the bug. — Borodin, Nov 28 '13 at 14:04
With a sed, you have too much exception to deal that it would be bery hard to manage (and understand). — NeronLeVelu, Nov 28 '13 at 14:48

OGHaza · Answer 1 · 2013-11-28T18:28:17.427

2

As pointed out in the comments a HTML parser is the ideal solution for your problem, however if you do for whatever reason want to use a regex, the following will work:

a(?![^<]*>)

Working example on RegExr and the same for input.

And in Perl:

$var = "salut <a href='a.html'></a> ah ha <a href='about.asp' /> animal";
#        ^     ^       ^         ^  ^   ^  ^       ^     ^       ^   ^
$var =~ s/a(?![^<]*>)/b/g;
print $var;

Output:

sblut <a href='a.html'></a> bh hb <a href='about.asp' /> bnimbl
 ^                          ^   ^                        ^   ^

edited Nov 28 '13 at 18:28

answered Nov 28 '13 at 12:07

OGHaza

4,683
7
21
29

Might I direct you to this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Vector Gorgoth Nov 28 '13 at 17:17
1

@VectorGorgoth Thanks for the downvote. This answer is correct given the OP asked for a regex. I have seen that link posted many times, I do not need reminding. For simple applications I see no reason not to use a regex, the question does not concern the structure of the HTML. If the question was: I want to replace all `A`s in a sentence but only when not enclosed in brackets `"hallo [not this a] but this a"` would you still have downvoted me? Because the problem being solved is exactly the same. Feel free to post a working solution that uses a HTML parser and I will delete this answer. – OGHaza Nov 28 '13 at 17:27
Using regexes to parse HTML is dangerous and stupid. You'll notice all the replies to the question in the comments amount to, "DON'T USE REGEXES". – Vector Gorgoth Nov 28 '13 at 17:31
1

@VectorGorgoth I have a provided a solution to a problem. If you ever have this problem yourself I advise you use a HTML parser, if the OP wants a quick and dirty 1 liner (literally 17 characters), my solution 100% fulfils his specified needs. I have gone ahead and added a disclaimer to the top of my post, I hope this satisfies you. – OGHaza Nov 28 '13 at 18:25
Okay, okay. Undownvoted. I still think it's better to avoid giving "quick and dirty" solutions because in my experience NO "one-off" solution has any guarantee of remaining that way. – Vector Gorgoth Nov 29 '13 at 20:52

score 0 · Answer 2 · edited May 23 '17 at 10:25

Resurrecting this ancient question because it had a simple solution that wasn't mentioned.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

#!/usr/bin/perl
$regex = '<[^>]*|(a)';
$subject = 'aig arother <a href="aaa">';
($replaced = $subject) =~ s/$regex/
if (defined $1)  {"b";} else {$&;} /eg;
print $replaced . "\n";

See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...

Replace characters in an HTML document that match a regex, except those inside tags

2 Answers2

Linked