2

I'm following along a tutorial (Ruby) that uses a regex to remove all html tags from a string:

product.description.gsub(/<.*?>/,'').

I don't know how to interpret the ?. Does it mean: "at least one of the previous"? In that case, wouldn't /<.+>/ have been more adequate?

AndersTornkvist
  • 2,502
  • 17
  • 35
Flavius Stef
  • 13,206
  • 2
  • 24
  • 22
  • 1
    Note that HTML attributes may contain plain `>` characters. Your regular expression doesn’t consider that. – Gumbo Jul 04 '10 at 09:30
  • I was following along a tutorial, which (as you point out) uses a simple approach to the problem. I was more interested about the way *? works. – Flavius Stef Jul 04 '10 at 09:37
  • See also http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532 - I covered this in detail with illustrative examples. – polygenelubricants Jul 04 '10 at 11:37

4 Answers4

8

In this case, it make * lazy.

1* - match as many 1s as possible.
1*? - match as few 1s as possible.

Here, when you have <a>text<b>some more text, <.*> will match <a>text<b>.
<.*?>, however, will match <a> and <b>.

See also: Laziness Instead of Greediness

Another important note here is that this regex can easily fail on valid HTML, it is better to use an HTML parser, and get the text of your document.

Kobi
  • 125,267
  • 41
  • 244
  • 277
6

By default .* is greedy which means that it matches as much as possible. So with .* the replacement would change:

This <b>is</b> an <i>example</i>.
     ^-------------------------^

to

This .

If you use a question mark after a quantifier it makes it non-greedy, so that it matches as little as possible. With .*? the replacement works as follows:

This <b>is</b> an <i>example</i>.
     ^-^  ^--^    ^-^       ^--^

Becomes:

This is an example.

This is different from the more common use of ? as a quantifier where it means 'match zero or one'.

Either way if your text is HTML you should use a HTML parser instead of regular expressions.

Mark Byers
  • 719,658
  • 164
  • 1,497
  • 1,412
0

Quantifiers such as * are greedy by default. This means they match as much as possible. Adding ? after them makes them lazy so they stop matching as soon as possible.

Daniel Egeberg
  • 8,229
  • 29
  • 44
0

that's the best website I found about regex after the regex library:

http://www.wellho.net/regex/java.html

Hope that helps!

Saher Ahwal
  • 8,230
  • 28
  • 75
  • 136