0

I was reading regular-expressions.info examples to try to learn more regex patterns.

The first example Grabbing HTML Tags talks about a regex for the opening and closing pair of a specific HTML tag.

<TAG\b[^>]*>(.*?)</TAG>

I'm a little confused here. Why is \b[^>]* added to the above regex pattern, where the same thing can be achieved by using the below regex pattern:

<TAG>(.*?)</TAG>

Why is this extra regex pattern used? Will it help in any performance?

Braj
  • 44,339
  • 5
  • 51
  • 69
  • This link may be relevant (or at least interesting): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – McLovin Jul 12 '14 at 02:35
  • I got it. I was just looking for some examples that explain about the use of this extra regex pattern. Thanks to all. – Braj Jul 12 '14 at 02:37

4 Answers4

1
  • That's in order to match things like <a href=...> stuff </a>, as opposed to a simple <b> stuff </b> where your option would work.
  • The \b boundary is needed in order to avoid matching things like <attribute ...> stuff </a>
  • The lazy quantifier .*? between the opening and closing tags is needed, as opposed to [^<]*, because between the opening and closing tags you might have another tag (for instance <b>)
zx81
  • 38,175
  • 8
  • 76
  • 97
0

Because without the word boundary, it matches anything not only the tags.

DEMO

You could try the demo. Just play with and without \b in the pattern.

<TAG\b[^>]*>(.*?)</TAG>

Explanation:

  • < Matches < symbol.
  • TAG Tag name
  • \b Matches between a word character and a non-word character.
  • [^>]* Matches any chars not of > zero or more times.
  • (.*?) Captures the section within the opening and closing tag.? after the * does an reluctant match.
  • </TAG> Matches the end tag.

For example:

Input:

<a href="www.foo.com">link</a>
<ahref="www.foo.com">link</a>

Regex:

<a[^>]*>(.*?)<\/a>

The above regex would match both the links.

Regex:

<a\b[^>]*>(.*?)<\/a>

But this would match the first one because there is an word boundary exists between a and the first space character.

Avinash Raj
  • 160,498
  • 22
  • 182
  • 229
0

Some opening tags have attributes like <img src="asdf.png">. The tag does not end until the > is reached, so the word boundary and non-> characters match the src="asdf.png".

McLovin
  • 2,944
  • 1
  • 12
  • 14
0

The \b[^>]* in

<TAG\b[^>]*>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

allows there to be text (such as parameters: width="30") and whitespace in the open-tag (as long as it's only a TAG and not TAGX or some other type--that's what the \b word boundary is for). Syntax and spacing in html is very loosey goosey. It's always safe to allow extra parameters and whitespace, as a single html tag can span many lines.

The latter regex

<TAG>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

Only allows the opening tag to be exactly <TAG> then "some text which may span multiple lines", then </TAG>.

The ? in .*? is reluctance, meaning the next close </TAG> is the only one that can be matched. Eliminating the ? changes it to greedy, meaning that the last close </TAG> in the search-string is matched.


Be sure to check out the Stack Overflow Regular Expressions FAQ :)

Community
  • 1
  • 1
aliteralmind
  • 18,274
  • 16
  • 66
  • 102