103

This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?

One of the archetypal non-regular languages is:

L = { anbn: n > 0 }

This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.

This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.

Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.

But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?

References

Linked questions

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
polygenelubricants
  • 348,637
  • 121
  • 546
  • 611
  • 4
    This series was started by permission of some in the community (http://meta.stackexchange.com/questions/62695/permission-to-start-a-series-of-advanced-regex-articles). If the reception is good, I plan to continue to cover other more advanced as well as more basic features of regex. – polygenelubricants Sep 04 '10 at 22:52
  • Part 3: http://stackoverflow.com/questions/3664881/how-does-this-java-regex-detect-palindromes – polygenelubricants Sep 08 '10 at 05:43
  • Wow, I never knew Java's regexs won't restricted to regular expressions. I guess that explains why I've always thought they won't fully implemented. What I mean is that there are no complement, difference, or product operators built into Java Regexs, but that makes sense since they aren't restricted to Regular Languages. – Lan Nov 05 '13 at 13:53
  • This question has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Advanced Regex-Fu". – aliteralmind Apr 10 '14 at 01:38

3 Answers3

147

The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".

Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.

The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.


Step 1: Lookahead for assertion

Let's start with a simpler problem: we want to match a+ at the beginning of a string, but only if it's followed immediately by b+. We can use ^ to anchor our match, and since we only want to match the a+ without the b+, we can use lookahead assertion (?=…).

Here is our pattern with a simple test harness:

function testAll($r, $tests) {
   foreach ($tests as $test) {
      $isMatch = preg_match($r, $test, $groups);
      $groupsJoined = join('|', $groups);
      print("$test $isMatch $groupsJoined\n");
   }
}
 
$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');
 
$r1 = '/^a+(?=b+)/';
#          └────┘
#         lookahead

testAll($r1, $tests);

The output is (as seen on ideone.com):

aaa 0
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a

This is exactly the output we want: we match a+, only if it's at the beginning of the string, and only if it's immediately followed by b+.

Lesson: You can use patterns in lookarounds to make assertions.


Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)

Now let's say that even though we don't want the b+ to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's use x modifier for free-spacing so we can make our regex more readable.

Building on our previous PHP snippet, we now have the following pattern:

$r2 = '/ ^ a+ (?= (b+) ) /x';
#             │   └──┘ │
#             │     1  │
#             └────────┘
#              lookahead
 
testAll($r2, $tests);

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b
aaaxb 0
xaaab 0
b 0
abbb 1 a|bbb

Note that e.g. aaa|b is the result of join-ing what each group captured with '|'. In this case, group 0 (i.e. what the pattern matched) captured aaa, and group 1 captured b.

Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.


Step 3: Refactoring the lookahead into the "loop"

Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the + repetition "loop". This is fine so far because we just wanted to assert that there's a b+ following our a+, but what we really want to do eventually is assert that for each a that we match inside the "loop", there's a corresponding b to go with it.

Let's not worry about the counting mechanism for now and just do the refactoring as follows:

  • First refactor a+ to (?: a )+ (note that (?:…) is a non-capturing group)
  • Then move the lookahead inside this non-capturing group
    • Note that we must now "skip" a* before we can "see" the b+, so modify the pattern accordingly

So we now have the following:

$r3 = '/ ^ (?: a (?= a* (b+) ) )+ /x';
#          │     │      └──┘ │ │
#          │     │        1  │ │
#          │     └───────────┘ │
#          │       lookahead   │
#          └───────────────────┘
#           non-capturing group

The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the + "loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.

Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.


Step 4: This is the step where we start counting

Here's what we're going to do: we'll rewrite group 1 such that:

  • At the end of the first iteration of the +, when the first a is matched, it should capture b
  • At the end of the second iteration, when another a is matched, it should capture bb
  • At the end of the third iteration, it should capture bbb
  • ...
  • At the end of the n-th iteration, group 1 should capture bn
  • If there aren't enough b to capture into group 1 then the assertion simply fails

So group 1, which is now (b+), will have to be rewritten to something like (\1 b). That is, we try to "add" a b to what group 1 captured in the previous iteration.

There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.

There are many ways around this, but for now let's just make the self-reference matching optional, i.e. \1?. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb'
);
 
$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+ /x';
#          │     │      └─────┘ | │
#          │     │         1    | │
#          │     └──────────────┘ │
#          │         lookahead    │
#          └──────────────────────┘
#             non-capturing group

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b        # (*gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a|b          # yes!
aabb 1 aa|bb        # YES!!
aaabbbbb 1 aaa|bbb  # YESS!!!
aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....

A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough bs, and somehow it counted wrong! We'll examine why this happened in the next step.

Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.


Step 4½: Understanding what went wrong

The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough b's. Let's closely examine what happens at every iteration of our pattern with aaaaabbb as input.

 a a a a a b b b
↑
# Initial state: Group 1 is "uninitialized".
           _
 a a a a a b b b
  ↑
  # 1st iteration: Group 1 couldn't match \1 since it was "uninitialized",
  #                  so it matched and captured just b
           ___
 a a a a a b b b
    ↑
    # 2nd iteration: Group 1 matched \1b and captured bb
           _____
 a a a a a b b b
      ↑
      # 3rd iteration: Group 1 matched \1b and captured bbb
           _
 a a a a a b b b
        ↑
        # 4th iteration: Group 1 could still match \1, but not \1b,
        #  (!!!)           so it matched and captured just b
           ___
 a a a a a b b b
          ↑
          # 5th iteration: Group 1 matched \1b and captured bb
          #
          # No more a, + "loop" terminates

A-ha! On our 4th iteration, we could still match \1, but we couldn't match \1b! Since we allow the self-reference matching to be optional with \1?, the engine backtracks and took the "no thanks" option, which then allows us to match and capture just b!

Do note, however, that except on the very first iteration, you could always match just the self-reference \1. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we captured bbb last time, we're guaranteed that there will still be bbb, but there may or may not be bbbb this time).

Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.


Step 5: Self-possession to the rescue!

The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply ?, use ?+ instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).

In very informal terms, this is what ?+, ? and ?? says:

?+

  • (optional) "It doesn't have to be there,"
    • (possessive) "but if it is there, you must take it and not let go!"

?

  • (optional) "It doesn't have to be there,"
    • (greedy) "but if it is you can take it for now,"
      • (backtracking) "but you may be asked to let it go later!"

??

  • (optional) "It doesn't have to be there,"
    • (reluctant) "and even if it is you don't have to take it just yet,"
      • (backtracking) "but you may be asked to take it later!"

In our setup, \1 will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus, \1?+ would accomplish exactly what we want.

$r5 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

Now the output is (as seen on ideone.com):

aaa 0
aaab 1 a|b          # Yay! Fixed!
aaaxb 0
xaaab 0
b 0
abbb 1 a|b
aabb 1 aa|bb
aaabbbbb 1 aaa|bbb
aaaaabbb 1 aaa|bbb  # Hurrahh!!!

Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!

Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.


Step 6: Finishing touches

So what we have right now is a pattern that matches a repeatedly, and for every a that was matched, there is a corresponding b captured in group 1. The + terminates when there are no more a, or if the assertion failed because there isn't a corresponding b for an a.

To finish the job, we simply need to append to our pattern \1 $. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extra b's in the string; in other words, that in fact we have anbn.

Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb',
  '', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',
  str_repeat('a', 5000).str_repeat('b', 5000)
);
 
$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

It finds 4 matches: ab, aabb, aaabbb, and the a5000b5000. It takes only 0.06s to run on ideone.com.


Step 7: The Java test

So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.

public static void main(String[] args) {
 
        String aNbN = "(?x) (?:  a  (?= a* (\\1?+ b))  )+ \\1";
        String[] tests = {
                "",      // false
                "ab",    // true
                "abb",   // false
                "aab",   // false
                "aabb",  // true
                "abab",  // false
                "abc",   // false
                repeat('a', 5000) + repeat('b', 4999), // false
                repeat('a', 5000) + repeat('b', 5000), // true
                repeat('a', 5000) + repeat('b', 5001), // false
        };
        for (String test : tests) {
                System.out.printf("[%s]%n  %s%n%n", test, test.matches(aNbN));
        }
 
}
 
static String repeat(char ch, int n) {
        return new String(new char[n]).replace('\0', ch);
}

The pattern works as expected (as seen on ideone.com).


And now we come to the conclusion...

It needs to be said that the a* in the lookahead, and indeed the "main + loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).

It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match ^(a+)(b+)$, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.

In PHP, it may look something like this (as seen in ideone.com):

function is_anbn($s) {
   return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&
      (strlen($groups[1]) == strlen($groups[2]));
}

The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.

As mentioned at the top, while this article is necessarily tagged [regex] for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.


Bonus material! PCRE recursive pattern!

Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for preg_match (as seen on ideone.com):

$rRecursive = '/ ^ (a (?1)? b) $ /x';

Currently Java's regex does not support recursive pattern.


Even more bonus material! Matching anbncn !!

So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?

The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).

^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $

Community
  • 1
  • 1
polygenelubricants
  • 348,637
  • 121
  • 546
  • 611
  • No doubt there may be mistakes/typos in this long answer, so please leave feedbacks as comments so I can revise them on my own. – polygenelubricants Sep 04 '10 at 22:55
  • Great job. It'll take me a while to read over, but the very last line is basically impossible to read; it is such small font. ------ Oh wait. Is that a `feature`?.... Not sure if it's a good idea. I know what the last symbol is, but it can't be read (aside from copy pasting it). – Peter Ajtai Sep 04 '10 at 23:54
  • 6
    @Peter: Highlight the small text and copy and paste into something else. It's made difficult to read on purpose: it's a spoiler, the solution to the bonus puzzle. – polygenelubricants Sep 04 '10 at 23:56
  • 8
    +1: Fantastic explanation, these "Advanced articles" are brilliant ideas. – Callum Rogers Sep 06 '10 at 16:24
  • great article. You asked for feedback... you mention java regexes in the title, but then use PHP. I'm confused by that. Does PHP's preg_match() use java regexes? – LarsH Sep 28 '10 at 21:26
  • 1
    @LarsH PHP's `preg_match()` is an example of [PCRE](http://www.pcre.org/ "Perl Compatible Regular Expressions"). Java regexes [seem to be based on an older version of Perl regexps](http://stackoverflow.com/a/1448213/1337). Which means that PHP regexes are more powerful than the version in Java. As of [2013-02-21](https://en.wikipedia.org/wiki/ISO_8601), [pcre.txt](http://www.pcre.org/pcre.txt) states that it *corresponds approximately with Perl 5.12*. While Perl is currently at 5.16, with 5.18 a few months off. (There actually hasn't been much added to regexes in that time) – Brad Gilbert Feb 21 '13 at 17:46
  • Fantastic explanation, have 100 rep for me :) Thanks again for your wonderful posts about regular expressions! – ComFreek Nov 05 '13 at 15:14
  • Since lookaround doesn't affect which languages can be matched by regular expressions, could this be written as a regex without lookarounds? – gatinueta Nov 27 '13 at 20:54
  • This is a great answer. Thank you. – Witiko Dec 15 '15 at 23:10
20

Given that no mention has been made of PCRE supporting recursive patterns, I'd just like to point out the simplest and most efficient example of PCRE that describes the language in question:

/^(a(?1)?b)$/
NikiC
  • 95,987
  • 31
  • 182
  • 219
jaytea
  • 1,651
  • 13
  • 19
  • +1 wow, I didn't know PCRE supports recursive pattern (I'm still learning! Every day!). I've revised the article to accommodate this information. I don't think recursive pattern can match `a^n b^n c^n`, though. – polygenelubricants Sep 06 '10 at 14:36
  • It should be noted this option is simpler, but not as good as the posted answer - the recursion overflows on long strings. – Kobi Sep 06 '10 at 18:49
  • @Kobi This depends on your definition of "good". For example the recursive solution is around one order of magnitude faster than the other one (http://codepad.viper-7.com/CWgy7c). And it is far easier to understand. The recursive solution is pretty much the direct transformation of the grammar into a regex (actually you could just write it in the grammarized form, it would work). – NikiC Sep 19 '11 at 16:04
  • 1
    @polygeniclubricants, you can match that pattern with two recursive patterns, one to consume `a`s and `b`s without capturing (and verifies there are the same amount w/recursion), followed by a capturing regex that greedily consumes all a's, and then applies the recursive pattern to consume and verify that there are the same number of `b`s and `c`s. The regex is: `/^(?=(a(?-1)?b)c)a+(b(?-1)?c)$/x`. Credit to: http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html – Josh Reback Aug 13 '17 at 23:39
13

As mentioned in the question — with .NET balancing group, the patterns of the type anbncndn…zn can be matched easily as

^
  (?<A>a)+
  (?<B-A>b)+  (?(A)(?!))
  (?<C-B>c)+  (?(B)(?!))
  ...
  (?<Z-Y>z)+  (?(Y)(?!))
$

For example: http://www.ideone.com/usuOE


Edit:

There is also a PCRE pattern for the generalized language with recursive pattern, but a lookahead is needed. I don't think this is a direct translation of the above.

^
  (?=(a(?-1)?b))  a+
  (?=(b(?-1)?c))  b+
  ...
  (?=(x(?-1)?y))  x+
     (y(?-1)?z)
$

For example: http://www.ideone.com/9gUwF

kennytm
  • 469,458
  • 94
  • 1,022
  • 977
  • 1
    @poly: Thanks :). Actually I am not familiar with .NET patterns, but for this kind of patterns it turns out to be very easy with balancing groups, so I supplement this answer. – kennytm Sep 06 '10 at 16:23
  • can you do this with recursive pattern? Because if you can't, that's an interesting twist that balancing group can do things that recursive pattern can't. (And yes, I very much appreciate the supplement). – polygenelubricants Sep 06 '10 at 16:23
  • by the way, the reason why I omitted .NET solution was because I do have plans for "How can we match `a^n b^n` with .NET regex?" article in the future, but you're more than welcome to write it if you want. I'm not doing these articles just for myself; I do want to encourage others to do it as well to have good content on the site. – polygenelubricants Sep 06 '10 at 16:28
  • Please update if you figure out a way to do it with recursive patterns. I played around with balancing groups to capture words whose lengths make a Fibonacci series, and couldn't get it to work. It may be possible using look-around, similar to what I've done. – Kobi Sep 06 '10 at 18:17
  • Somebody needs to write a tongue-in-cheek RE language with expressions like `a(?!#@&*b)`... which translates to, 'What in tarnation is an `a` doing there? It's supposed to be a `b`!' – LarsH Sep 28 '10 at 21:29
  • 1
    I'd just like to point out that the PCRE version of this pattern is slightly flawed as it matches if the next chunk of characters is longer than the previous. See here: https://regex101.com/r/sdlRTm/1 You need to add `(?!b)`, `(?!c)`, etc. after the capture groups like so: https://regex101.com/r/sdlRTm/2 – jaytea Nov 19 '17 at 12:53