32

Do recursive regexes understand named captures? There is a note in the docs for (?{{ code }}) that it's an independent subpattern with its own set of captures that are discarded when the subpattern is done, and there's a note in (?PARNO) that its "similar to (?{{ code }}). Is (?PARNO) discarding its own named captures when it's done?

I'm writing about Perl's recursive regular expressions for Mastering Perl. perlre already has an example with balanced parens (I show it in Matching balanced parenthesis in Perl regex), so I thought I'd try balanced quote marks:

#!/usr/bin/perl
# quotes-nested.pl

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched!" if m/
    (
        ['"]
            ( 
                (?: 
                    [^'"]+
                    | 
                    ( (?1) ) 
                )* 
            )
        ['"]
    )
    /xg;

print "
1 => $1
2 => $2
3 => $3
4 => $4
5 => $5
";

This works and the two quotes show up in $1 and $3:

Matched!
1 => 'Amelia said "I am a camel"'
2 => Amelia said "I am a camel"
3 => "I am a camel"
4 => 
5 => 

That's fine. I understand that. However, I don't want to know the numbers. So, I make the first capture group a named capture and look in %- expecting to see the two substrings I previously saw in $1 and $2:

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched [$+{said}]!" if m/
    (?<said>
        ['"]
            ( 
                (?: 
                    [^'"]+
                    | 
                    (?1) 
                )* 
            )
        ['"]
    )
    /xg;

use Data::Dumper;
print Dumper( \%- );

I only see the first:

Matched ['Amelia said "I am a camel"']!
$VAR1 = {
          'said' => [
                      '\'Amelia said "I am a camel"\''
                    ]
        };

I expected that (?1) would repeat everything in the first capture group, including the named capture to said. I can fix that a bit by naming a new capture:

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched [$+{said}]!" if m/
    (?<said>
        ['"]
            ( 
                (?: 
                    [^'"]+
                    | 
                    (?<said> (?1) ) 
                )* 
            )
        ['"]
    )
    /xg;

use Data::Dumper;
print Dumper( \%- );

Now I get what I expected:

Matched ['Amelia said "I am a camel"']!
$VAR1 = {
          'said' => [
                      '\'Amelia said "I am a camel"\'',
                      '"I am a camel"'
                    ]
        };

I thought that I could fix this by moving the named capture up one level:

use v5.10;

$_ =<<'HERE';
He said 'Amelia said "I am a camel"'
HERE

say "Matched [$+{said}]!" if m/
    (
        (?<said>
        ['"]
            ( 
                (?: 
                    [^'"]+
                    | 
                    (?1)
                )* 
            )
        ['"]
        )
    )
    /xg;

use Data::Dumper;
print Dumper( \%- );

But, this doesn't catch the smaller substring in said either:

Matched ['Amelia said "I am a camel"']!
$VAR1 = {
          'said' => [
                      '\'Amelia said "I am a camel"\''
                    ]
        };

I think I understand this, but I also know that there are people here who actually touch the C code that makes it happen. :)

And, as I write this, I think I should overload the STORE tie for %- to find out, but then I'd have to find out how to do that.

Community
  • 1
  • 1
brian d foy
  • 121,466
  • 31
  • 192
  • 551
  • 1
    I'm troubled by your $4 and $5; there are only three capturing parens there. Do you somehow think $3 is coming from the PARNO recursion? It's not. – ysth Oct 14 '13 at 01:40
  • Oh, $4 and $5 were left over from other things I was doing. – brian d foy Oct 14 '13 at 02:08
  • try Scala pattern-matching... it might be a bit more predictable. I know it doesn't answer your question, that's why it's only a Comment :-) – Alex R Nov 13 '13 at 01:39
  • Did you already try to play with [Damian Conway's Regexp::Debugger](http://search.cpan.org/~dconway/Regexp-Debugger-0.001016/lib/Regexp/Debugger.pm) and your sophisticated *regexp*? I recommand the trip, it's nice! – F. Hauri Nov 16 '13 at 22:57

1 Answers1

4

After playing around with this, I'm satisfied that what I said in the question is right. Each call to (?PARNO) gets a complete and separate set of the match variables that it discards at the end of its run.

You can get all the things that matched in each sub pattern by using an array external to the pattern match operator and pushing onto it at the end of the repeated sub pattern, like in this example:

#!/usr/bin/perl
# nested_carat_n.pl

use v5.10;

$_ =<<'HERE';
Outside "Top Level 'Middle Level "Bottom Level" Middle' Outside"
HERE

my @matches;

say "Matched!" if m/
    (?(DEFINE)
        (?<QUOTE_MARK> ['"])
        (?<NOT_QUOTE_MARK> [^'"])
    )
    (
    (?<quote>(?&QUOTE_MARK))
        (?:
            (?&NOT_QUOTE_MARK)++
            |
            (?R)
        )*
    \g{quote}
    )
    (?{ push @matches, $^N })
    /x;

say join "\n", @matches;

I go through it in depth in Chapter 2 of Mastering Perl, which you can read for free (at least for awhile).

brian d foy
  • 121,466
  • 31
  • 192
  • 551
  • I had hoped someone would magically find an elegant solution to this problem, but alas: Perl regexes have some [annoying limitations](http://stackoverflow.com/q/17039670/1521179) that can't be circumvented without evals. [\*sigh\*](http://www.livememe.com/3nbrsg1). Well, thanks anyway for this truly challenging and (intellectually) entertaining question, and for your efforts towards finding a solution. – amon Nov 21 '13 at 22:18