At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.

Capturing Quantifiers

Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.

For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR

You could imagine a syntax such as


where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}

This would be useful to balance the number of {@,=,-,/} in strings such as @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

Relation to Recursion

In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in

L(+) some_content R{\q1} 

The idea is presented in some details on the following page: Captured Quantifiers

It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.

I am trying to find out if anything like this exists.

Thanks in advance for your insights!!!


Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.

Update 2 (much later)

I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.

  • 1
    I'm not aware of any regexp engine that allows you to get the count of a quantifier. In general, you can't do arithmetic with regexps. Some regexp engines support recursion, you can use that to match balanced expressions. See http://www.regular-expressions.info/refrecurse.html
  • I think your best bet would capture it as a group and count the characters in your language of choice. – Sam Apr 11 '14 at 01:32

2 Answers2


I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:

@@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"

you can check if @ = - / are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"

pattern details:

~                          # pattern delimiter
(?<!@)                     # negative lookbehind used as an @ boundary
(                          # first capturing group for the @
        @                  # one @
        (?=                # checks that each @ is followed by the same number
                           # of = - /  
            [^=]*          # all that is not an =
            (\2?+=)        # The possessive optional self-referencing group:
                           # capture group 2: backreference to itself + one = 
            [^-]*(\3?+-)   # the same for -
            [^/]*(\4?+/)   # the same for /
        )                  # close the lookahead
    )+                     # close the non-capturing group and repeat
)                          # close the first capturing group
(?!@)                      # negative lookahead used as an @ boundary too.

# this checks the boundaries for all groups

The main idea

The non-capturing group contains only one @. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.

the possessive-optional self-referencing group

How does it work?

( (?: @ (?= [^=]* (\2?+ = ) .....) )+ )

At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that (\2 =) that will make the pattern fail. To avoid the problem, the way is to make the backreference optional: \2?

The second aspect of this group is that the number of character = matched is incremented at each repetition of the non capturing group, since an = is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new = character.

Note that this group can be seen like that: if group 2 exists then match it with the next =

( (?(2)\2) = )

The recursive way


You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.

pattern details:

(?<!@)                # left @ boundary
(?=                   # open a lookahead (to allow overlapped matches)
    (                 # open a capturing group
        (?>           # open an atomic group
            [^@=]+    # all that is not an @ or an =, one or more times
          |           # OR
            (?-1)     # recursion: the last defined capturing group (the current here)
        )*            # repeat zero or more the atomic group
        =             #
    )                 # close the capture group
    (?!=)             # checks the = boundary
)                     # close the lookahead
(?=(@(?>[^@-]+|(?-1))*-)(?!-))  # the same for -
(?=(@(?>[^@/]+|(?-1))*/)(?!/))  # the same for /

The main difference with the precedent pattern is that this one doesn't care about the order of = - and / groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)

Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (^ or \A). And if you want to obtain the whole string as match result you must add .* at the end (otherwise the match result will be empty as playful notices it.)

  • 3
    Thank you very much for your beautiful answer---it must have taken quite some time to compose! I am very impressed and need some time to study the method. By the way this method you generously laid out seems to prove three things: 1. that one can "torture" PCRE into performing this task, 2. that captured quantifiers would be a lovely way to make such tasks trivial in comparison to the olympics-grade gymnastics currently required (would you agree?), and 3. how smart some people are. Thank you for the education. This will probably be the regex highlight of my year.
  To throw something into the mix: for more precision, after matching the @s, instead of the final boundary check with a lookahead, we can match exactly what we want: (\s"[^"]+")\s\2(?5)\s\3(?5)\s\4(?5)$
  • 1
    This question has been added to the Stack Overflow Regular Expressions FAQ, under "Advanced Regex-Fu".
  Two small corrections seem needed in the (brilliant) recursive version. First, at the moment the expression does not match anything because after doing the lookaheads no matching is done. At the very least we could add an ugly dot-star at the end, or better, match exactly what we want: ^@+(\s"[^"]+")\s=+(?4)\s-+(?4)\s/+(?4)$ The second tweak is that the atomic group does not need to be repeated (or so it seems to me), so we can drop the *. While we're at it we can start the regex with a ^(?xm) instead of the boundary. What do you think?
  Also, given the original question, if you edit your first paragraph to indicate that even though the task can be done, given the complexity of the task, some form of quantifier capture in the regex syntax would indeed greatly facilitate matters, I believe your answer will really stand out as an example of an amazing, complete answer. :)
  • @playful: The reader can build his own opinion, no need to write this. I am not here to militate, it is your (and Jeff) fight. – Casimir et Hippolyte Apr 12 '14 at 21:05
  • 1
    about possible change to the patterns: I didn't try to be too precise here. This is a description of possible ways (in general) to compare quantities. Indeed, for the example string you don't need the quantifier `*` for atomic groups, but keep in mind that this is only for two reasons: 1) symbols are consecutives, 2) there is at least one character between groups of symbols. I think that anyone that read this post and understand the pattern is able to adapt it to a specific situation.
  • @Casimir Of course I will pick your brilliant answer, but don't you think the recursive regex would look better if it consumed some characters after the lookarounds? Right now there is nothing to match (just lookarounds). If you replace the (? – zx81 Apr 12 '14 at 21:30
  • 1
    I order to reassure everyone, I would like to state that I have never participate to a gloubiboulga party! – Casimir et Hippolyte Apr 12 '14 at 21:35
  • 1
    [LMGTFY](http://translate.google.com/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Ffr.wikipedia.org%2Fwiki%2FGloubi-boulga&edit-text=) :) – aliteralmind Apr 12 '14 at 22:02
  I added a note about the empty match. About `@+(\s"[^"]+")\s=+(?4)\s-+(?4)\s/+(?4)$` and "the ugly `.*`". If you don't need to match something else after, using a precise description of the string is useless and will give more work to the regex engine than a simple `.*`. If you need to continue the pattern after this part a full description is obviously more secure.
  • @CasimiretHippolyte "I have never participated to a gloubiboulga party" Friendly monsters? Yes, it's paradise here. Thank you again for your beautiful answer, which is an instant classic. – zx81 Apr 12 '14 at 22:45

Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".

Here is the solution I came up with. It looks long, but it is quite simple.


How does it work?

  1. The first non-capturing group matches the @ characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an @ is matched, a capture is added to the capture collections for Groups c1, c2 and c3.

  2. Next, [^@=]+ eats up all the characters up to the first =.

  3. The second quantified group (?<-c1>=)+ matches the = characters. That group seems to be named -c1, but -c1 is not a group name. -c1 is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more = than @ characters. (Later, we'll have to make sure that we cannot have more @ than = characters.)

  4. The next steps repeat steps 2 and 3 for the - and / characters, decrementing counters c2 and c3.

  5. The [^/]+ eats up the rest of the string.

  6. The (?(c1)(?!)) is a conditional that says "If group c1 has been set, then fail". You may know that (?!) is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more @ than = characters.

  7. Likewise, the (?(c2)(?!)) and (?(c3)(?!)) ensure that there cannot be more @ than - and / characters.

I don't know about you, but even this is a bit long, I find it really intuitive.

