1

I am running the following snippet of code on Perl 5.22:

  DB<41> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days).*$/
0  34

The above code works as expected and pulls out the 34 from "34 days".

My question comes in when I make the capture group optional by adding a ? at the end of it like this:

  DB<4> x "up 34 days, 22:04 and more" =~ m/.*?(?:(\d+) days)?.*$/
0  undef

Why does it no longer match the 34? I have searched the web, but couldn't find any questions that matched mine (if you do have a link that explains it, that would be fantastic).

Thanks, in advance, for your time.

Adam
  • 21
  • 3
  • You are correct, it is Perl 5.22 and I have corrected in my original post. Thanks. – Adam Jan 21 '20 at 21:39
  • Note that a leading `.*?` only servers to slow down the match (possibly a lot). Remove it (or use `^.*?` in the unlikely event that you want the leading part to be in `$&`). – ikegami Jan 22 '20 at 02:26

2 Answers2

1

Regexes work from left to right, always; and quantifiers always try first to match as much as they can, or as little as they can when made non-greedy (like .*?). When they reach an unmatchable state, only then they will back up and try a new match (backtracking). The key to regexes is working around what the regex engine will try first.

.*? will first try to match the empty string at the beginning of the string, since that's the least it can match. In the case of the first regex, that will not result in a successful overall match, so it eventually backtracks until .*? matches "up " so that the following group can match "34 days". But if you make the following group optional, the first thing it will try is to match initial pattern of .*? to the empty string followed by (?:(\d+) days)? matching the empty string (since it cannot match digits followed by "days" at that particular position, but it can match the empty string) followed by .* matching the rest of the string followed by the end of the string; a successful match.

Regexp::Debugger can be nice to visualize the behavior, as well as https://regex101.com/ (just beware that PCRE is not exactly the same as Perl regex).

Grinnz
  • 8,748
  • 10
  • 17
1

Since both, .*? and (?:(\d+) days)? match the empty string and .*$ then matches any other string, i.e. also the the whole input string.

If you check the following

use strict;
use warnings;

my $s = "up 34 days, 22:04 and more";

if ($s =~ m/.*?(?:(\d+) days)(.*)$/) {
  print("first:\n  $1=\"$1\"\n  \$2=\"$2\"\n");
}
if ($s =~ m/.*?(?:(\d+) days)?(.*)$/) {
  print("second:\n  \$1=\"$1\"\n  \$2=\"$2\"\n");
}

you'll get

first:
  34="34"
  $2=", 22:04 and more"
second:
  $1=""
  $2="up 34 days, 22:04 and more"

as output (and a warning about $1 being undefined that you can ignore here) which illustrates that.

sticky bit
  • 31,711
  • 12
  • 26
  • 38