14

How would you match a^n b^n c^n for n > 0 with PCRE?

The following cases should match:

abc
aabbcc
aaabbbccc

The following cases should not match:

abbc
aabbc
aabbbccc

Here's what I've "tried"; /^(a(?1)?b)$/gmx but this matches a^n b^n for n > 0:

ab
aabb
aaabbb

Online demo

Note: This question is the same as this one with the change in language.

Community
  • 1
  • 1
HamZa
  • 13,530
  • 11
  • 51
  • 70
  • using balancing groups http://www.regular-expressions.info/balancing.html – Max Carroll Apr 25 '15 at 14:24
  • 1
    @MaxCarroll PCRE doesn't support balancing groups – HamZa Apr 25 '15 at 14:25
  • 2
    in that case its a good question... I will vote it up – Max Carroll Apr 25 '15 at 14:25
  • I actually think that this could also be solved with the [Qtax trick](http://stackoverflow.com/a/17177790). See also [Capturing Quantifiers and Quantifier Arithmetic](http://stackoverflow.com/questions/23001137/capturing-quantifiers-and-quantifier-arithmetic). Kudos to who is able to use it or come up with another trick! – HamZa Apr 25 '15 at 14:38
  • 3
    Relevant: [Chapter 2 of polygenelubricants's series of educational regex articles: How can we match a^n b^n with Java regex?](http://stackoverflow.com/q/3644266/3622940) – Unihedron Apr 25 '15 at 14:48

2 Answers2

16

Qtax trick

(The mighty self-referencing capturing group)

^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1\2$

This solution is also named "The Qtax trick" because it uses the same technique as from "vertical" regex matching in an ASCII "image" by Qtax.


The problem in question burns down to a need to assert that three groups are matched of the same length. As a simplified version, to match:

xyz

Where x, y and z are really just subpatterns with a variable with matching length n of a, b and c. With an expression that uses lookaheads with self-referencing capturing groups, a character we specify is added to each repetition of the lookahead, which can effectively be used to "count":

aaabbbccc
 ^  ^  ^

This is achieved by the following:

  • (?:a…)+ A character of subpattern a is matched. With (?=a*, we skip directly to the "counter".
  • (\1?+b) Capturing group (\1) effectively consumes whatever has previously been matched, if it is there, and uses a possessive match which does not permit backtracking, and the match fails if the counter goes out of sync - That is, there has been more of subpatterns b than subpattern a. On the first iteration, this is absent, and nothing is matched. Then, a character of subpattern b is matched. It is added to the capturing group, effectively "counting" one of b in the group. With b*, we skip directly to the next "counter".
  • (\2?+c) Capturing group (\2) effectively consumes whatever has previously been matched just like the above. Because this additional character capture works just like the previous group, characters are allowed to sync up in length within these character groups. Assuming continuous sequences of a..b..c..:

(Excuse my art.)

First iteration:

| The first 'a' is matched by the 'a' in '^(?:a…)'.
| The pointer is stuck after it as we begin the lookahead.
v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
 ^^^   |^^^       ^
skipped| skipped  Matched by c in (\2?+c);
by a*  | by b*         \2 was "nothing",
       |               now it is "c".
       Matched by b
       in (\1?+b).
     \1 was "nothing", now it is "b".

Second iteration:

 | The second 'a' is matched by the 'a' in '^(?:a…)'.
 | The pointer is stuck after it as we begin the lookahead.
 v,- Matcher pointer
aaaa...bbbbbbbb...cccc...
       /|^^^      |^
eaten by| skipped |Matched by c in (\2?+c);
\1?+    | by b*   |     '\2' was "nothing",
  ^^    |      \2?+     now it is "cc".
 skipped|
 by a*  \ Matched by b
          in (\1?+b).
          '\1' was "nothing", now it is "bb".

As the three groups discussed above "consumes" one of each of a, b, c respectively, they are matched in round-robin style and "counted" by the (?:a…)+, (\1?+b) and (\2?+c) groups respectively. With the additional anchoring and capturing what we started, we can assert that we match xyz (Representing each group above) where x, y and z are an, bn and cn respectively.


As a bonus, to "count" more, one can do this:

Pattern: ^(?:a(?=a*(\1?+b)b*(\2?+c)))+\1{3}\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
Pattern: ^(?:a(?=a*(\1?+bbb)b*(\2?+c)))+\1\2$
Matches: abbbc
aabbbbbbcc
aaabbbbbbbbbccc
Community
  • 1
  • 1
Unihedron
  • 10,251
  • 13
  • 53
  • 66
  • 2
    [*Trick naming rant*] Is this really called Qtax Trick? Qtax's answer refers to PolygeneLubricants's answer as the source. Either way, I think "Self referencing capturing group" is clearer. [*/Trick naming rant*] [*Acknowledgement and respect*] Respect to all parties involved, including you Unihedron - great answer! [*/Acknowledgement and respect*] – Kobi Apr 26 '15 at 05:23
11

First, let's explain the pattern you have:

^               # Assert begin of line
    (           # Capturing group 1
        a       # Match a
        (?1)?   # Recurse group 1 optionally
        b       # Match b
    )           # End of group 1
$               # Assert end of line

With the following modifiers:

g: global, match all
m: multiline, match start and end of line with ^ and $ respectively
x: extended, indentation are ignored with the ability to add comments with #

The recursion part is optional in order to exit the "endless" recursion eventually.

We could use the above pattern to solve the problem. We need to add some regex to match the c part. The problem is when aabb is matched in aabbcc, it is already consumed which means we could not track back.

The solution? Using lookaheads! Lookaheads are zero-width, which means it won't consume and move forward. Check it out:

^                    # Assert begin of line
    (?=              # First zero-with lookahead
        (            # Capturing group 1
            a        # Match a
            (?1)?    # Recurse group 1 optionally
            b        # Match b
        )            # End of group 1
        c+           # Match c one or more times
    )                # End of the first lookahead

    (?=              # Second zero-with lookahead
        a+           # Match a one or more times
        (            # Capturing group 2
            b        # Match b
            (?2)?    # Recurse group 2 optionally
            c        # Match c
        )            # End of group 2
    )                # End of the second lookahead
a+b+c+               # Match each of a,b and c one or more times
$                    # Assert end of line

Online demo

Basically we first assert that there's a^n b^n and then we assert b^n c^n which would result into a^n b^n c^n.

HamZa
  • 13,530
  • 11
  • 51
  • 70