RegEx: Matching a especific string that is not inside in HTML tag

Question

<tag value='botafogo'> botafogo is the best </tag>

Needs match only botafogo (...is the best) and not 'botafogo' value

my program "annotates" automatically the term in a pure text:

botafogo is the best 

to

<team attr='best'>botafogo</team> is the best

and when i "replace all" the "best" word, i have a big problem...

<team attr='<adjective>best</adjective>'>botafogo</team> is the <adjective>best</adjective>

Ps.: Java language

This can't be done reliably. Good luck coming up with a regex that even reliably matches a single HTML tag, much less things *not* in one. — Matchu, Mar 03 '10 at 02:40
**DO _NOT_ PARSE HTML USING Regular Expressions!** http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — SLaks, Mar 03 '10 at 02:42
Can you tell us more about the context where you need this functionality? What language you're using, where you get the input HTML from, etc? — polygenelubricants, Mar 03 '10 at 02:52

score 5 · Answer 1 · answered Mar 03 '10 at 02:40

5

The best way to accomplish this is to NOT use regular expression and use a proper HTML parser. HTML is not a regular language and doing this with regular expression will be tedious, hard to maintain, and more than likely still contain various errors.

HTML parsers, on the hand, are well-suited for the job. Many of them are mature and reliable, and they take care of every little details for you and makes your life much easier.

answered Mar 03 '10 at 02:40

polygenelubricants

348,637
121
546
611

1

*"While you can hack around these problems with more and more regular expression cleverness, you eventually paint yourself into a corner with complexity. Regular expressions don't truly understand the code that they are colorizing-- but parsers do."* -- http://www.codinghorror.com/blog/2005/04/parsing-beyond-regex.html – John K Mar 03 '10 at 02:42

score 4 · Answer 2 · answered Mar 03 '10 at 02:40

4

Have you considered to use DOM functions instead of regex?

document.getElementsByTagName('tag')[0].innerHTML.match('botafogo')

answered Mar 03 '10 at 02:40

YOU

106,832
29
175
207

score 1 · Answer 3 · answered Mar 03 '10 at 02:42

1

HTML parser is best, then cycle through text contents. (See other answers.)

If you're in PHP, you can do a quick solution by running strip_tags() on the content to remove HTML first. It depends on if you're doing a replace, in which case stripping first is not an option, or if you're just matching, in which case content that is not part of a match can be removed without concern.

answered Mar 03 '10 at 02:42

Matchu

77,193
15
148
158

my program "annotates" automatically the term in a pure text: botafogo is the best botafogo is the best and when i "replace all" the "best" word, i have a big problem... botafogo is the best – celsowm Mar 03 '10 at 02:54
Well. No good stripping, then. But I'll leave the answer for reference. – Matchu Mar 03 '10 at 03:00

score 0 · Answer 4 · answered Mar 03 '10 at 03:26

@OP, in your favourite language, do a split on </tag>, then do another split on >. eg Python

>>> s="<tag value='botafogo'> botafogo is the best </tag>"
>>> for item in s.split("</tag>"):
...  if "<tag" in item:
...      print item.split(">")[-1]
...
 botafogo is the best

No regex needed

score 0 · Answer 5 · answered Jan 16 '12 at 07:12

I was just looking for a solution to the same task, and created one that seems to do the job.

Negative lookahead is the key. To make sure the match is not within a tag, look ahead to see that the closing angle bracket is not found prior to the opening one. Suppose, we want to find a word "needle":

#needle(?![^<]+>)#i

My case is in PHP, and looks something like this:

function filter_highlighter($content) {
    $patterns = array(
        '#needle(?![^<]+>)#i',
        '#<b>Need</b>le#',
        '#<strong>Need</strong>le#'
    );
    $replacement = '<span class="highlighted">Need</span>le';
    $content = preg_replace( $patterns, $replacement, $content);
    return $content;
}

So far it works.

RegEx: Matching a especific string that is not inside in HTML tag

5 Answers5

Linked