21

I am not clear on the use/need of the \G operator.
I read in the perldoc:

You use the \G anchor to start the next match on the same string where the last match left off.

I don't really understand this statement. When we use \g we usually move to the character after the last match anyway.
As the example shows:

$_ = "1122a44";  
my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )  

Then it says:

If you use the \G anchor, you force the match after 22 to start with the a:

$_ = "1122a44";
my @pairs = m/\G(\d\d)/g;

The regular expression cannot match there since it does not find a digit, so the next match fails and the match operator returns the pairs it already found

I don't understand this either. "If you use the \G anchor, you force the match after 22 to start with a." But without the \G the matching will be attempted at a anyway right? So what is the meaning of this sentence?
I see that in the example the only pairs printed are 11 and 22. So 44 is not tried.

The example also shows that using c option makes it index 44 after the while.

To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.
Could someone please help me understand this, perhaps with a meaningful example?

Update
I think I did not understand this key sentence:

If you use the \G anchor, you force the match after 22 to start with the a . The regular expression cannot match there since it does not find a digit, so the next match fails and the match operator returns the pairs it already found.

This seems to mean that when the match fails, the regex does not proceed further attempts and is consistent with the examples in the answers

Also:

After the match fails at the letter a , perl resets pos() and the next match on the same string starts at the beginning.

Scott Weaver
  • 6,328
  • 2
  • 23
  • 37
Jim
  • 17,102
  • 31
  • 115
  • 227
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Anchors". – aliteralmind Apr 10 '14 at 00:21

3 Answers3

22

\G is an anchor; it indicates where the match is forced to start. When \G is present, it can't start matching at some arbitrary later point in the string; when \G is absent, it can.

It is most useful in parsing a string into discrete parts, where you don't want to skip past other stuff. For instance:

my $string = " a 1 # ";
while () {
    if ( $string =~ /\G\s+/gc ) {
        print "whitespace\n";
    }
    elsif ( $string =~ /\G[0-9]+/gc ) {
        print "integer\n";
    }
    elsif ( $string =~ /\G\w+/gc ) {
        print "word\n";
    }
    else {
        print "done\n";
        last;
    }
}

Output with \G's:

whitespace
word
whitespace
integer
whitespace
done

without:

whitespace
whitespace
whitespace
whitespace
done

Note that I am demonstrating using scalar-context /g matching, but \G applies equally to list context /g matching and in fact the above code is trivially modifiable to use that:

my $string = " a 1 # ";
my @matches = $string =~ /\G(?:(\s+)|([0-9]+)|(\w+))/g;
while ( my ($whitespace, $integer, $word) = splice @matches, 0, 3 ) {
    if ( defined $whitespace ) {
        print "whitespace\n";
    }
    elsif ( defined $integer ) {
        print "integer\n";
    }
    elsif ( defined $word ) {
        print "word\n";
    }
}
ysth
  • 88,068
  • 5
  • 112
  • 203
  • Is state of match bound to string, and how to reset matching when on ie. integer? – mpapec Feb 23 '14 at 18:29
  • I run this in my cli and it behaves like you say but I don't understand this. Without the '\G` this:`$string =~ /\s+/gc` matches whitespace and since we got a match the regex should move to `a`.But it seems that it does not and keeps printed "whitespace" meaning it is "stuck" to the first if statement.But why? – Jim Feb 23 '14 at 18:51
  • 2
    @mpapec: position is bound to the string and can be changed/set with http://perldoc.perl.org/functions/pos.html – ysth Feb 23 '14 at 18:56
  • @Jim: no, not stuck, matching a different space each time – ysth Feb 23 '14 at 18:57
  • @ysth:But in this case we should have i) match of ` ` then ii) match on `a` iii) match on ` ` etc.But this prints only whitespace.So why is the match at `a` skipped? – Jim Feb 23 '14 at 19:01
  • because the first match succeeds every time through the loop. with \G, the first match only succeeds when the next character is whitespace – ysth Feb 23 '14 at 19:02
  • 1
    you seem to be trying to fit what people say/show into your ideas of how it works; this is keeping you from listening to what people are actually saying: \G tells it where it *must start matching* – ysth Feb 23 '14 at 19:03
  • @ysth:Let me rephrase then.The target string is: `" a 1 # "`. The regex matching process is attempted 1 char at a time. So the first if matches the first space at the start of the string. Then since we have the `g` the regex will keep going. So it will move to the next character in the string since we process one at a time. The next character is `a`. This should be matched by the 3 else. This is my understanding of a normal regex match.It is obviously wrong since the output prints `whitespace` 3 times.Where is my misunderstanding/error in my thought process?I really want to understand this. – Jim Feb 23 '14 at 19:21
  • @Jim: scalar context `/\s+/g` will match the at first space in the string after the previous match. that means each time through the loop, the if condition succeeds and it never gets to any else clause (until the end of the string is reached, when only the final else branch is taken). – ysth Feb 23 '14 at 19:23
  • @ysth:Ah!So it will match the first whitespace.Then in the next iteration it will try to match the whitespace with `a` but it will fail, so it will proceed with the next character and so on.Hence we have the whitespace printed 3 times.Right?So adding a `\G` the whitespace is matched and printed, then the `a` is attempted to match the first if, it fails and then the regex does not proceed but tried the rest of elses? – Jim Feb 23 '14 at 19:43
  • yes. if the current position is on the a, /\G\s+/g will not match but /\s+/g will (later in the string). Does this mean you now understand what \G does? – ysth Feb 23 '14 at 20:53
  • I think that what your example answer indicates is that we use `\G` when we want to match a string against a series of regexes.I think the idea is that `\G` behavior is to *stop* the engine to match more characters on the first failure (concurrently keeping the position of the last match so that the next regexes can be attempted).Is my understanding correct?Are there are other applications besides matching against multiple regexes for the same target text? – Jim Feb 23 '14 at 20:57
  • Also are there any "special" variables that print the current position of `\G`? And last, is there a meaning of the letter G here (used as a mnemonic)? – Jim Feb 23 '14 at 20:59
  • 1
    stop thinking about 'first failure' and start thinking about constraint; must match at a given particular position. just like `(?=x)` means must match before an x and `\A` means must match at the very start of the string. – ysth Feb 23 '14 at 21:34
  • `pos($string)` gives the current position (or undef if it should start at the beginning; 0 and undef are slightly different in that 0 means a 0-length match already occurred at the beginning, so a zero-length match isn't allowed there again) – ysth Feb 23 '14 at 21:36
  • also, doesn't have to be a series of regexes; I will add an equivalent list context single regex. – ysth Feb 23 '14 at 21:37
  • `stop thinking about 'first failure' and start thinking about constraint; must match at a given particular position`. This sentence specifically the term "constraint" helped me a lot. Thank you very much! – Jim Feb 23 '14 at 22:26
  • the letter G is presumably because it is useful with `/g` (though it is useful other places too) – ysth Jan 29 '16 at 19:08
15

But without the \G the matching will be attempted at a anyway right?

Without the \G, it won't be constrained to start matching there. It'll try, but it'll try starting later if required. You can think of every pattern as having an implied \G.*? at the front.

Add the \G, and the meaning becomes obvious.

$_ = "1122a44";  
my @pairs = m/\G     (\d\d)/xg;   # qw( 11 22 ) 
my @pairs = m/\G .*? (\d\d)/xg;   # qw( 11 22 44 )
my @pairs = m/       (\d\d)/xg;   # qw( 11 22 44 )

To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.

As you can see, you get different results by adding a \G, so the usefulness is getting the result you want.

ikegami
  • 322,729
  • 15
  • 228
  • 466
  • 1)The `\d\d` will match at `11` and then "move" to match `22` and then move to match `a4`.There it fails and tries to match `44`. So I am still not clear what you mean by `on't be constrained to start matching there. It'll try, but it'll try starting later if required` could you please elaborate on this? 2) `usefulness is getting the result you want.` I can't think of an example where `\G` would save the day – Jim Feb 23 '14 at 17:58
  • @Jim : In the `perldoc` example, `\G` enforces that matches should be consecutive/uninterrupted. Once it ceases to match, it doesn't try any further. – Zaid Feb 23 '14 at 18:01
  • @Zaid:`matches should be consecutive/uninterrupted` I don't understand this. It still matches the same way without '\G`. `11-22-a4(fail)-44` – Jim Feb 23 '14 at 18:03
  • @Jim: look at the comments in the example code; the top matches twice, the bottom three times (skipping past the failing a) – ysth Feb 23 '14 at 18:05
  • Re: "I can't think of an example where \G would save the day", Again, if you want "11" and "22" from "1122a44" – ikegami Feb 23 '14 at 22:02
  • Re "I am still not clear what you mean by don't be constrained to start matching there", then why don't you actually repeat your experiment using `\G` to see how it constrains. – ikegami Feb 23 '14 at 22:03
  • 1
    Re "It still matches the same way without `\G`", Seriously? No, it doesn't. I showed what it matches: 11 and 22 with `\G`, 11, 22 and 44 with `\G`. – ikegami Feb 23 '14 at 22:05
  • @ikegami I think you meant "without" the second time. Good answer, `\G` actually makes some sense to me now. It's like a zero length assertion that matches the place where the last match left off, right? If I got this right, it would be `\G` is the `pos` of `$+[0]`? – TLP Feb 24 '14 at 08:53
  • 1
    @TLP, yes, it's just like "^", except it matches where the last match left off instead of at the start of the string. – ikegami Feb 24 '14 at 13:42
  • @ikegami.. I just get 44 for the second one ````perl -le ' BEGIN { $_ = "1122a44"; @pairs=m/\G .* (\d\d)/xg; print join(",",@pairs) } '```` – stack0114106 Dec 23 '20 at 19:38
  • @stack0114106 That's not the code I posted. You changed the pattern. – ikegami Dec 23 '20 at 19:45
  • @ikegami.. sorry i missed the ````?````. btw this is very useful – stack0114106 Dec 23 '20 at 19:51
4

Interesting answers and alot are valid I guess, but I can also guess that is still doesn't explain alot.

\G 'forces' the next match to occur at the position the last match ended.

Basically:

$str="1122a44";
while($str=~m/\G(\d\d)/g) {
#code
}

First match = "11" Second match is FORCED TO START at 22 and yes, that's \d\d, so result is "22" Third 'try' is FORCED to start at "a", but that's not \d\d, so it fails.

Dubelo
  • 41
  • 1