4
<tag value='botafogo'> botafogo is the best </tag>

Needs match only botafogo (...is the best) and not 'botafogo' value

my program "annotates" automatically the term in a pure text:

botafogo is the best 

to

<team attr='best'>botafogo</team> is the best 

and when i "replace all" the "best" word, i have a big problem...

<team attr='<adjective>best</adjective>'>botafogo</team> is the <adjective>best</adjective>

Ps.: Java language

celsowm
  • 749
  • 7
  • 28
  • 48
  • 2
    This can't be done reliably. Good luck coming up with a regex that even reliably matches a single HTML tag, much less things *not* in one. – Matchu Mar 03 '10 at 02:40
  • 4
    **DO _NOT_ PARSE HTML USING Regular Expressions!** http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – SLaks Mar 03 '10 at 02:42
  • Can you tell us more about the context where you need this functionality? What language you're using, where you get the input HTML from, etc? – polygenelubricants Mar 03 '10 at 02:52

5 Answers5

5

The best way to accomplish this is to NOT use regular expression and use a proper HTML parser. HTML is not a regular language and doing this with regular expression will be tedious, hard to maintain, and more than likely still contain various errors.

HTML parsers, on the hand, are well-suited for the job. Many of them are mature and reliable, and they take care of every little details for you and makes your life much easier.

polygenelubricants
  • 348,637
  • 121
  • 546
  • 611
  • 1
    *"While you can hack around these problems with more and more regular expression cleverness, you eventually paint yourself into a corner with complexity. Regular expressions don't truly understand the code that they are colorizing-- but parsers do."* -- http://www.codinghorror.com/blog/2005/04/parsing-beyond-regex.html – John K Mar 03 '10 at 02:42
4

Have you considered to use DOM functions instead of regex?

document.getElementsByTagName('tag')[0].innerHTML.match('botafogo')
YOU
  • 106,832
  • 29
  • 175
  • 207
1

HTML parser is best, then cycle through text contents. (See other answers.)

If you're in PHP, you can do a quick solution by running strip_tags() on the content to remove HTML first. It depends on if you're doing a replace, in which case stripping first is not an option, or if you're just matching, in which case content that is not part of a match can be removed without concern.

Matchu
  • 77,193
  • 15
  • 148
  • 158
  • my program "annotates" automatically the term in a pure text: botafogo is the best botafogo is the best and when i "replace all" the "best" word, i have a big problem... botafogo is the best – celsowm Mar 03 '10 at 02:54
  • Well. No good stripping, then. But I'll leave the answer for reference. – Matchu Mar 03 '10 at 03:00
0

@OP, in your favourite language, do a split on </tag>, then do another split on >. eg Python

>>> s="<tag value='botafogo'> botafogo is the best </tag>"
>>> for item in s.split("</tag>"):
...  if "<tag" in item:
...      print item.split(">")[-1]
...
 botafogo is the best

No regex needed

ghostdog74
  • 286,686
  • 52
  • 238
  • 332
0

I was just looking for a solution to the same task, and created one that seems to do the job.

Negative lookahead is the key. To make sure the match is not within a tag, look ahead to see that the closing angle bracket is not found prior to the opening one. Suppose, we want to find a word "needle":

#needle(?![^<]+>)#i

My case is in PHP, and looks something like this:

function filter_highlighter($content) {
    $patterns = array(
        '#needle(?![^<]+>)#i',
        '#<b>Need</b>le#',
        '#<strong>Need</strong>le#'
    );
    $replacement = '<span class="highlighted">Need</span>le';
    $content = preg_replace( $patterns, $replacement, $content);
    return $content;
}

So far it works.

Serge
  • 1,381
  • 1
  • 19
  • 38