Confused about Grabbing HTML Tags regex pattern

Question

I was reading regular-expressions.info examples to try to learn more regex patterns.

The first example Grabbing HTML Tags talks about a regex for the opening and closing pair of a specific HTML tag.

<TAG\b[^>]*>(.*?)</TAG>

I'm a little confused here. Why is \b[^>]* added to the above regex pattern, where the same thing can be achieved by using the below regex pattern:

<TAG>(.*?)</TAG>

Why is this extra regex pattern used? Will it help in any performance?

This link may be relevant (or at least interesting): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — McLovin, Jul 12 '14 at 02:35
I got it. I was just looking for some examples that explain about the use of this extra regex pattern. Thanks to all. — Braj, Jul 12 '14 at 02:37

zx81 · Accepted Answer · 2014-07-12T02:35:52.200

1

That's in order to match things like <a href=...> stuff </a>, as opposed to a simple <b> stuff </b> where your option would work.
The \b boundary is needed in order to avoid matching things like <attribute ...> stuff </a>
The lazy quantifier .*? between the opening and closing tags is needed, as opposed to [^<]*, because between the opening and closing tags you might have another tag (for instance <b>)

edited Jul 12 '14 at 02:35

answered Jul 12 '14 at 02:30

zx81

38,175
8
76
97

Thanks I got it. I literally forget about this type of uses. – Braj Jul 12 '14 at 02:31
Thanks for description. I was just looking for the uses of this extra pattern. That's clear to me. – Braj Jul 12 '14 at 02:39
Thanks, glad it helped. :) – zx81 Jul 12 '14 at 02:47

Avinash Raj · Answer 2 · 2014-07-12T02:51:52.267

Because without the word boundary, it matches anything not only the tags.

DEMO

You could try the demo. Just play with and without \b in the pattern.

<TAG\b[^>]*>(.*?)</TAG>

Explanation:

< Matches < symbol.
TAG Tag name
\b Matches between a word character and a non-word character.
[^>]* Matches any chars not of > zero or more times.
(.*?) Captures the section within the opening and closing tag.? after the * does an reluctant match.
</TAG> Matches the end tag.

For example:

Input:

<a href="www.foo.com">link</a>
<ahref="www.foo.com">link</a>

Regex:

<a[^>]*>(.*?)<\/a>

The above regex would match both the links.

Regex:

<a\b[^>]*>(.*?)<\/a>

But this would match the first one because there is an word boundary exists between a and the first space character.

score 0 · Answer 3 · answered Jul 12 '14 at 02:33

0

Some opening tags have attributes like <img src="asdf.png">. The tag does not end until the > is reached, so the word boundary and non-> characters match the src="asdf.png".

answered Jul 12 '14 at 02:33

McLovin

2,944
1
12
14

score 0 · Answer 4 · edited May 23 '17 at 12:20

The \b[^>]* in

<TAG\b[^>]*>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

allows there to be text (such as parameters: width="30") and whitespace in the open-tag (as long as it's only a TAG and not TAGX or some other type--that's what the \b word boundary is for). Syntax and spacing in html is very loosey goosey. It's always safe to allow extra parameters and whitespace, as a single html tag can span many lines.

The latter regex

<TAG>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

Only allows the opening tag to be exactly <TAG> then "some text which may span multiple lines", then </TAG>.

The ? in .*? is reluctance, meaning the next close </TAG> is the only one that can be matched. Eliminating the ? changes it to greedy, meaning that the last close </TAG> in the search-string is matched.

Be sure to check out the Stack Overflow Regular Expressions FAQ :)

Confused about Grabbing HTML Tags regex pattern

4 Answers4