9

I have been reading perl regular expression with modifier s m and g. I understand that //g is a global matching where it will be a greedy search.

But I am confused with the modifier s and m. Can anyone explain the difference between s and m with code example to show how it can be different? I have tried to search online and it only gives explanation as in the link http://perldoc.perl.org/perlre.html#Modifiers. In stackoverflow I have even seen people using s and m together. Isn't s is the opposite of m?

//s 
//m 
//g

I am not able to match multiple line using using m.

use warnings;
use strict;
use 5.012;

my $file; 
{ 
 local $/ = undef; 
 $file = <DATA>; 
};
my @strings = $file =~ /".*"/mg; #returns all except the last string across multiple lines
#/"String"/mg; tried with this as well and returns nothing except String
say for @strings;

__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"
Borodin
  • 123,915
  • 9
  • 66
  • 138
user2763829
  • 713
  • 2
  • 10
  • 20

4 Answers4

15

The documentation that you link to yourself seems very clear to me. It would help if you would explain what problem you had with understanding it, and how you came to think that /s and /m were opposites.

Very briefly, /s changes the behaviour of the dot metacharacter . so that it matches any character at all. Normally it matches anything except a newline "\n", and so treats the string as a single line even if it contains newlines.

/m modifies the caret ^ and dollar $ metacharacters so that they match at newlines within the string, treating it as a multi-line string. Normally they will match only at the beginning and end of the string.

You shouldn't get confused with the /g modifier being "greedy". It is for global matches which will find all occurrences of the pattern within the string. The term greedy is usually user for the behaviour of quantifiers within the pattern. For instance .* is said to be greedy because it will match as many characters as possible, as opposed to .*? which will match as few characters as possible.


Update

In your modified question you are using /".*"/mg, in which the /m is irrelevant because, as noted above, that modifier alters only the behaviour of the $ and ^ metacharacters, and there are none in your pattern.

Changing it to /".*"/sg improves things a little in that the . can now match the newline at the end of each line and so the pattern can match multi-line strings. (Note that it is the object string that is considered to be "single line" here - i.e. the match behaves just as if there were no newlines in it as far as . is concerned.) Hower here is the conventional meaning of greedy, because the pattern now matches everything from the first double-quote in the first line to the last double-quote at the end of the last line. I assume that isn't what you want.

There are a few ways to fix this. I recommend changing your pattern so that the string you want is a double-quote, followed by any sequence of characters except double-quotes, followed by another double quote. This is written /"[^"]*"/g (note that the /s modifier is no longer necessary as there are now no dots in the pattern) and very nearly does what you want except that the escaped double-quotes are seen as ending the pattern.

Take a look at this program and its output, noting that I have put a chevron >> at the start of each match so that they can be distinguished

use strict;
use warnings;

my $file = do {
  local $/;
  <DATA>; 
};

my @strings = $file =~ /"[^"]*"/g;

print ">> $_\n\n", for @strings;

__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"

output

>> "This is string"

>> "1!=2"

>> "This is \"

>> ""

>> "string1"

>> "string2"

>> "String"

>> "S
t
r
i
n
g"

As you can see everything is now in order except that in "This is \"string\"" it has found two matches, "This is \", and "". Fixing that may be more complicated than you want to go but it's perfectly possible. Please say so if you need that fixed too.


Update

I may as well finish this off. To ignore escaped double-quotes and treat them as just part of the string, we need to accept either \" or any character except double-quote. That is done using the regex alternation operator | and must be grouped inside non-capturing parentheses (?: ... ). The end result is /"(?:\\"|[^"])*"/g (the backslash itself must be escaped so it is doubled up) which, when put into the above program, produces this output, which I assume is what you wanted.

>> "This is string"

>> "1!=2"

>> "This is \"string\""

>> "string1"

>> "string2"

>> "String"

>> "S
t
r
i
n
g"
Borodin
  • 123,915
  • 9
  • 66
  • 138
  • Thanks for the explanation. However, I am not able to match multiple line using m. Mind to explain the code in the edited question. – user2763829 Apr 09 '14 at 13:17
  • Regexes can always match multiline strings. The `m` modifier just means that it *recognizes* the string as multiline (for purposes of matching the anchors, `^` and `$`). But don't read too much into the names, *multiline* and *single-line*. They're bad names for modes that never should have existed. In Perl 6 they've been eliminated. – Alan Moore Apr 09 '14 at 13:32
  • 1
    So when I use `/s` it will chnage the behavior of `.` so that `.` will include `\n` as part of `.`. Am I right? – user2763829 Apr 09 '14 at 13:38
  • +1 for thoroughly thorough. :) – zx81 Jun 06 '14 at 21:37
  • Hi, Borodin. I think your saying is a bit misleading at least for a novice like me. I spend a whole afternoon asking question on this site to figure out to do multiline substitution we must use 0777 option, am I right? But you say about "match multi-line" without mention 0777 – user15964 Jul 25 '16 at 11:35
  • @user15964: *"I spend a whole afternoon asking question on this site"* I don't see any questions from you at all about this. No, you don't need option `-0777`, which is intended for one-liner programs only. My solution was a whole program, and the `do` block `my $file = do {local $/; ; }` is the proper way to read an entire file. The OP used the same construct. I can't allow for every possible lack of knowledge from anyone reading my solution. If you don't know about [using `$/`](http://perldoc.perl.org/perlvar.html#Variables-related-to-filehandles) then you have some serious work to do. – Borodin Jul 25 '16 at 11:56
  • actually, I remembered you noticed my post http://stackoverflow.com/questions/38561556/cross-line-regex-match-in-a-perl-one-liner : ) So you mean that `local $/` is actually equivalent to -0777 right? – user15964 Jul 25 '16 at 16:39
5

/m and /s both affect how the match operator treats multi-line strings.

With the /m modifier, ^ and $ match the beginning and end of any line within the string. Without the /m modifier, ^ and $ just match the beginning and end of the string.

Example:

$_ = "foo\nbar\n";

/foo$/,  /^bar/       do not match
/foo$/m, /^bar/m      match

With the /s modifier, the special character . matches all characters including newlines. Without the /s modifier, . matches all characters except newlines.

$_ = "cat\ndog\ngoldfish";

/cat.*fish/           does not match
/cat.*fish/s          matches

It is possible to use /sm modifiers together.

$_ = "100\n101\n102\n103\n104\n105\n";

/^102.*104$/          does not match
/^102.*104$/s         does not match
/^102.*104$/m         does not match
/^102.*104$/sm        matches
mob
  • 110,546
  • 17
  • 138
  • 265
4

With /".*"/mg your match

  1. starts with "
  2. and then .*" matches every character (except \n) as much as possible till "
  3. since you use /g and match stopped at second ", regex will try to repeat first two steps
  4. /m doesn't make difference here as you're not using ^ or $ anchors

Since you have escaped quotes in your example, regex is not the best tool to do what you want. If that wasn't the case and you wanted everything between two quotes, /".*?"/gs would do the job.

mpapec
  • 48,918
  • 8
  • 61
  • 112
1

Borodin's regex will work for the examples from this lab assignment.

However, it's also possible for a backslash to escape itself. This comes up when one includes windows paths in a string, so the following regex would catch that case:

use warnings;
use strict;
use 5.012;

my $file = do { local $/; <DATA>};

my @strings = $file =~ /"(?:(?>[^"\\]+)|\\.)*"/g;

say "<$_>" for @strings;

__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i
n
g"
"C:\\windows\\style\\path\\"
"another string"

Outputs:

<"This is string">
<"1!=2">
<"This is \"string\"">
<"string1">
<"string2">
<"String">
<"S
t
r
i
n
g">
<"C:\\windows\\style\\path\\">
<"another string">

For a quick explanation of the pattern:

my @strings = $file =~ m{
    "
        (?:
            (?>            # Independent subexpression (reduces backtracking)
                [^"\\]+    # Gobble all non double quotes and backslashes
            )
        |
            \\.            # Backslash followed by any character
        )*
    "
    }xg;                   # /x modifier allows whitespace and comments.
Miller
  • 34,344
  • 4
  • 33
  • 55