3

What do these two regular expressions do?

/<(.*?)>.*?<\/\1>/

/<(.*?)>.*<\/\1>/

What I've learned is .*? means as few characters as possible.

For example:

  my $a = '"helllo"++"world"';

  print "a $1\n" if $a =~/(".*")/;            # "helllo"++"world"

  print "b $1\n" if $a =~/(".*?")/;           # "helllo"

  print "c $1\n" if $a =~/(.*)/;              # "helllo"++"world"

  print "d $1\n" if $a =~/(.*?)/;             # (nothing)

why does d show nothing, and b is helllo not ""?

So, i couldn't find any good examples to show the difference between 1 and 2 like the helllo world example.

Can someone help by giving some examples to show the difference?

Thanks.

Alan Moore
  • 68,531
  • 11
  • 88
  • 149
Patrick
  • 203
  • 1
  • 5
  • 10

4 Answers4

2

.*? will search for a match using as few characters as possible. If you're matching ".*?" against "a" "b", the regex engine will report just "a" as a match. The greedy version ".*" would report the entire string as a match because * wants to consume as much input as possible.

The regex .*? can be satisfied with an empty string, so if you don't put anything around it to prevent a match that's exactly what it'll do (like in your case d).

Your regexes 1) and 2) are for parsing text that looks like XML. Consider this text:

<tag>text</tag> <tag>more text</tag>

The regex 1) will consume only the first tagged substring <tag>text</tag>, while the regex 2 consumes the entire string.

Now consider this text:

<tag>what about <tag>nested</tag> tags?</tag>

The regex 1) will produce a match only until the first closing </tag> while the regex 2) consumes the entire string.

Joni
  • 101,441
  • 12
  • 123
  • 178
2

You really have two questions in one, here;

What do these two regular expressions do?

/<(.*?)>.*?</\1>/

A dot, '.', is an atom that matches any character. The asterisk after it means "match the previous atom as many times as possible." The combination is referred to as "greedy" as the dot matches anything and the '*' says - "just keep going" and so, without any other constraint or anchor, the combination 'eats' or matches the rest of the string. The question mark changes this behaviour from "greedy" to "stingy" - it will try to match as little as possible.

The round brackets - or parentheses - don't stipulate what to match - they are there to indicate that you want to "capture" whatever does match to a special variable called "$1" for the first pair of brackets, "$2" for the second and so on.

So this, - <(.*?)> means, match an open angle bracket (or "less than"), then match anything (but take up as little as possible) and then match a closing angle (or "greater than"). The round brackets stipulate nothing about what to match - they just mean "Put whatever text is between the angle brackets into $1."

The \1 in the last part, <\/\1>, is what's called a back reference - it means "whatever you captured in the first set of round brackets I want to match again right here". The \/ before it is an escaped forward slash - so what we are looking for here is a "tag" (text surrounded by angle brackets), some text and then a matching "closing tag" - ie the same text in angle brackets with a '/' out the front.

/<(.*?)>.*</\1>/

This does almost the same thing except it tries to take as much text as possible between the opening and closing tags.

my $a = '"helllo"++"world"';
...
print "b $1\n" if $a =~/(".*?")/;            # "helllo"
...
print "d $1\n" if $a =~/(.*?)/;

why does d show nothing, and b is "helllo" not ""?

With (b), you are saying "I insist on a double quote (") at the start, then any text, and then I insist again on a closing double quote (")". Now, if you look at the text in $a, you can see all of the following start and end with a (") with some text inbetween;

"helllo" 
"helllo"++" 
"helllo"++"world"

Here's the main point - .*? (stingy) means "I want the smallest one" - ie in this case the first, whereas, .* (no '?' - greedy) means "I want the longest one" - ie in this case, the last.

With (d), there are no angle brackets or double quote characters that you stipulate at the start or end of the string - you are simply saying "match anything, (.*) but take as little as possible". So, the RE gave you nothing at all as that certainly is the smallest match that satisfies the criteria (ie no criteria! :-)

Community
  • 1
  • 1
Marty
  • 2,690
  • 9
  • 16
1

* means it matches zero or more times

*? means it matches zero or more times but not greedy

. means it matches any character except new line

'"helllo"++"world"' This is your stirng

In below example ^ means the current position (there's a current position in the string and a current position in the pattern (kind of))

In your first case.

Step1

"helllo"++"world"      ".+"
^                      ^  

Step2

"helllo"++"world"      ".+"
 ^                      ^  

Step3

"helllo"++"world"      ".+"
                ^        ^

Step4

Since we find " at the end of the string .+ backtrack and matches the character before the "

check "

We have the match at this position so success

"helllo"++"world"      ".+"
               ^          ^

In your second case. *? Match zero or more times but not greedy

Step1

  "helllo"++"world"     ".*?"
  ^                     ^  

Step2

  "helllo"++"world"     ".*?"
   ^                     ^  

Step3

  "helllo"++"world"     ".*?"
         ^                 ^  

Step4 (* become fail when the first match is satisfy.)

  "helllo"++"world"     ".*?"
         ^                  ^  

In your thrid case

Step 1

"helllo"++"world"        .* 
^                        ^ 

Step 2 (* is greedy so it will go to until end. And the result is give the all matched pattern)

"helllo"++"world"        .* 
                ^         ^ 

In your fourth case,

Your pattern is .*? It means anything(.) zero or more times(*) but not greedy(?). Here the non greedy character matches the zero character so the result is null.

Mogsdad
  • 40,814
  • 19
  • 140
  • 246
mkHun
  • 5,507
  • 1
  • 25
  • 65
0

The reason d) prints nothing is pretty simple: You already know .*? matches as little as possible, so without adding any other criteria, "as little as possible" is nothing at all.

The reason b) "(.*?)" matches "hello" is the two quotes in the expression - ie the match (if found) must start and end with a quote. The middle part .*? matches as little as possible, so that's "hello".

Bohemian
  • 365,064
  • 84
  • 522
  • 658