I was always under the impression that you couldn't use repetition quantifiers in zero-width assertions (Perl Compatible Regular Expressions [PCRE]). However, it has recently transpired to me that you can use them in look ahead assertions.

How does the PCRE regex engine work when searching with zero-width look behinds which precludes repetition quantifiers from being used?

Here is a simple example from a PCRE in R:

# Our string
x <- 'MaaabcccM'

##  Does it contain a 'b', preceeded by an 'a' and followed by zero or more 'c',
##  then an 'M'?
grepl( '(?<=a)b(?=c*M)' , x , perl=T )
# [1] TRUE

##  Does it contain a 'b': (1) preceeded by an 'M' and then zero or more 'a' and
##                         (2) followed by zero or more 'c' then an 'M'?
grepl( '(?<=Ma*)b(?=c*M)' , x , perl = TRUE )
# Error in grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) :
#   invalid regular expression '(?<M=a*)b(?=c*M)'
# In addition: Warning message:
# In grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) : PCRE pattern compilation error
#         'lookbehind assertion is not fixed length'
#         at ')b(?=c*M)'
  • 2
    Yes, only lookahead assertions can be variable length. The one exception to this is the special `\K` code which is a special form of a lookbehind assertion that can be variable. So in your second example the following would work in perl: `/a*\Kb(?=c*)/`. *Obviously it's a little meaningless to use an assertion that can be zero width, so perhaps using `+` would make for a better example* – Miller May 30 '14 at 22:45
  • 1
    Because variable length look-behind assertions are a pain in the @$$ when a regex engine needs to backtrack. – mob May 30 '14 at 22:48
  • 1
    @mob Can you explain **why** they're more of a pain to deal with than variable-length lookahead assertions? From a naive point of view, both operations will involve looking at the same number of characters, right. (I know that must be wrong, but how so?) – Josh O'Brien May 30 '14 at 22:51
  • 9
    The paragraph beginning "The bad news" on [this page](http://www.regular-expressions.info/lookaround.html) may hint at the reason. It sounds like regular expression engines can really only work forward, so that look-behind assertions are actually matched by stepping back `n` characters, and examining them from their first character. With a variable-length lookbehind assertion, you can't know `n` in advance, which would mean you'd have to test over and over and over again, once for each possible beginning character in the string. Can some regex wizard plz confirm whether this is +/- correct? – Josh O'Brien May 30 '14 at 22:59
  • @JoshO'Brien that sort of makes sense. Thanks for the link. – Simon O'Hanlon May 30 '14 at 23:45
  • And [it looks like](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075) languages differ in whether or not they allow variable length lookbehinds (though most don't). (Search that linked page for "Lookbehind limits" to quickly find what I'm referring to.) – Josh O'Brien May 30 '14 at 23:51
  • 1
    The notion of " 'b' preceeded by zero or more 'a' " is rather ridiculous since it will always be satisfied. "b" is either preceded by at least one "a" .. or not, so being preceded by zero "a", means the condition is vacuous. Likewise for zero or more "c" following it. – IRTFM May 31 '14 at 03:57
  • One more link backing up my initial guess: https://mail.mozilla.org/pipermail/es-discuss/2012-March/021373.html – Josh O'Brien Jun 04 '14 at 06:38
  • Has your question been answered? If so, post it as an answer and accept it. – mareoraft Jun 06 '14 at 01:11
  • @JoshO'Brien: I've posted a small bounty on the question. If you feel confident about your guess, do you mind expanding to to an answer? :) Thanks! – Amal Murali Jul 06 '14 at 03:27
  • @AmalMurali -- Thanks for adding that very well-placed bounty. The answers it's elicited are very useful! – Josh O'Brien Jul 08 '14 at 14:43

The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. In the meantime, let's gradually zoom into the question from higher levels.

Variable-Width Lookbehind vs. Infinite-Width Lookbehind

First off, a quick clarification on terms. A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance:

  • the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE)
  • the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE)

In contrast, in infinite-width lookbehind, you can use quantified tokens such as a+

Engines that Support Infinite-Width Lookbehind

For the record, these engines support infinite lookbehind:

  • .NET (C#, VB.NET etc.)
  • Matthew Barnett's regex module for Python
  • JGSoft (EditPad etc.; not available in a programming language).

As far as I know, they are the only ones.

Variable Lookbehind in PCRE

In PCRE, the most relevant section in the documentation is this:

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length.

Therefore, the following lookbehind is valid:

(?<=a |big )cat

However, none of these are:

  • (?<=a\s?|big )cat (the sides of the alternation do not have a fixed width)
  • (?<=@{1,10})cat (variable width)
  • (?<=\R)cat (\R does not have a fixed-width as it can match \n, \r\n, etc.)
  • (?<=\X)cat (\X does not have a fixed-width as a Unicode grapheme cluster can contain a variable number of bytes.)
  • (?<=a+)cat (clearly not fixed)

Lookbehind with Zero-Width Match but Infinite Repetition

Now consider this:


On the face of it, this is a fixed-width lookbehind, because it can only ever find a zero-width match (defined by the lookahead (?=@++)). Is that a trick to get around the infinite lookbehind limitation?

No. PCRE will choke on this. Even though the content of the lookbehind is zero-width, PCRE will not allow infinite repetition in the lookbehind. Anywhere. When the documentation says all the strings it matches must have a fixed length, it should really be:

All the strings that any of its components matches must have a fixed length.

Workarounds: Life without Infinite Lookbehind

In PCRE, the two main solutions to problems where infinite lookbehinds would help are \K and capture Groups.

Workaround #1: \K

The \K assertion tells the engine to drop what was matched so far from the final match it returns.

Suppose you want (?<=@+)cat#+, which is not legal in PCRE. Instead, you can use:


Workaround #2: Capture Groups

Another way to proceed is to match whatever you would have placed in a lookbehind, and to capture the content of interest in a capture group. You then retrieve the match from the capture group.

For instance, instead of the illegal (?<=@+)cat#+, you would use:


In R, this could look like this:

matches <- regexpr("@+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)

In languages that don't support \K, this is often the only solution.

Engine Internals: What Does the PCRE Code Say?

The ultimate answer is to be found in pcre_compile.c. If you examine the code block that starts with this comment:

If lookbehind, check that this branch matches a fixed-length string

You find that the grunt work is done by the find_fixedlength() function.

I reproduce it here for anyone who would like to dive into further details.

static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
int length = -1;

register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;

/* Scan along the opcodes for this branch. If we get to the end of the
branch, check the length against that of the other branches. */

for (;;)
  int d;
  pcre_uchar *ce, *cs;
  register pcre_uchar op = *cc;

  switch (op)
    /* We only need to continue for OP_CBRA (normal capturing bracket) and
    OP_BRA (normal non-capturing bracket) because the other variants of these
    opcodes are all concerned with unlimited repeated groups, which of course
    are not of fixed length. */

    case OP_CBRA:
    case OP_BRA:
    case OP_ONCE:
    case OP_ONCE_NC:
    case OP_COND:
    d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += 1 + LINK_SIZE;

    /* Reached end of a branch; if it's a ket it is the end of a nested call.
    If it's ALT it is an alternation in a nested call. An ACCEPT is effectively
    an ALT. If it is END it's the end of the outer call. All can be handled by
    the same code. Note that we must not include the OP_KETRxxx opcodes here,
    because they all imply an unlimited repeat. */

    case OP_ALT:
    case OP_KET:
    case OP_END:
    case OP_ACCEPT:
    if (length < 0) length = branchlength;
      else if (length != branchlength) return -1;
    if (*cc != OP_ALT) return length;
    cc += 1 + LINK_SIZE;
    branchlength = 0;

    /* A true recursion implies not fixed length, but a subroutine call may
    be OK. If the subroutine is a forward reference, we can't deal with
    it until the end of the pattern, so return -3. */

    case OP_RECURSE:
    if (!atend) return -3;
    cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1);  /* Start subpattern */
    do ce += GET(ce, 1); while (*ce == OP_ALT);           /* End subpattern */
    if (cc > cs && cc < ce) return -1;                    /* Recursion */
    d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
    if (d < 0) return d;
    branchlength += d;
    cc += 1 + LINK_SIZE;

    /* Skip over assertive subpatterns */

    case OP_ASSERT:
    case OP_ASSERT_NOT:
    do cc += GET(cc, 1); while (*cc == OP_ALT);
    cc += PRIV(OP_lengths)[*cc];

    /* Skip over things that don't match chars */

    case OP_MARK:
    case OP_PRUNE_ARG:
    case OP_SKIP_ARG:
    case OP_THEN_ARG:
    cc += cc[1] + PRIV(OP_lengths)[*cc];

    case OP_CALLOUT:
    case OP_CIRC:
    case OP_CIRCM:
    case OP_CLOSE:
    case OP_COMMIT:
    case OP_CREF:
    case OP_DEF:
    case OP_DNCREF:
    case OP_DNRREF:
    case OP_DOLL:
    case OP_DOLLM:
    case OP_EOD:
    case OP_EODN:
    case OP_FAIL:
    case OP_PRUNE:
    case OP_REVERSE:
    case OP_RREF:
    case OP_SET_SOM:
    case OP_SKIP:
    case OP_SOD:
    case OP_SOM:
    case OP_THEN:
    cc += PRIV(OP_lengths)[*cc];

    /* Handle literal characters */

    case OP_CHAR:
    case OP_CHARI:
    case OP_NOT:
    case OP_NOTI:
    cc += 2;
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);

    /* Handle exact repetitions. The count is already in characters, but we
    need to skip over a multibyte character in UTF8 mode.  */

    case OP_EXACT:
    case OP_EXACTI:
    case OP_NOTEXACT:
    case OP_NOTEXACTI:
    branchlength += (int)GET2(cc,1);
    cc += 2 + IMM2_SIZE;
    if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);

    case OP_TYPEEXACT:
    branchlength += GET2(cc,1);
    if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
      cc += 2;
    cc += 1 + IMM2_SIZE + 1;

    /* Handle single-char matchers */

    case OP_PROP:
    case OP_NOTPROP:
    cc += 2;
    /* Fall through */

    case OP_HSPACE:
    case OP_VSPACE:
    case OP_NOT_HSPACE:
    case OP_NOT_VSPACE:
    case OP_NOT_DIGIT:
    case OP_DIGIT:
    case OP_WORDCHAR:
    case OP_ANY:
    case OP_ALLANY:

    /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
    otherwise \C is coded as OP_ALLANY. */

    case OP_ANYBYTE:
    return -2;

    /* Check a class for variable quantification */

    case OP_CLASS:
    case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
    case OP_XCLASS:
    /* The original code caused an unsigned overflow in 64 bit systems,
    so now we use a conditional statement. */
    if (op == OP_XCLASS)
      cc += GET(cc, 1);
      cc += PRIV(OP_lengths)[OP_CLASS];
    cc += PRIV(OP_lengths)[OP_CLASS];

    switch (*cc)
      case OP_CRSTAR:
      case OP_CRMINSTAR:
      case OP_CRPLUS:
      case OP_CRMINPLUS:
      case OP_CRQUERY:
      case OP_CRMINQUERY:
      case OP_CRPOSSTAR:
      case OP_CRPOSPLUS:
      case OP_CRPOSQUERY:
      return -1;

      case OP_CRRANGE:
      case OP_CRMINRANGE:
      case OP_CRPOSRANGE:
      if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
      branchlength += (int)GET2(cc,1);
      cc += 1 + 2 * IMM2_SIZE;


    /* Anything else is variable length */

    case OP_ANYNL:
    case OP_BRAPOS:
    case OP_BRAZERO:
    case OP_CBRAPOS:
    case OP_EXTUNI:
    case OP_KETRMAX:
    case OP_KETRMIN:
    case OP_KETRPOS:
    case OP_MINPLUS:
    case OP_MINPLUSI:
    case OP_MINQUERY:
    case OP_MINQUERYI:
    case OP_MINSTAR:
    case OP_MINSTARI:
    case OP_MINUPTO:
    case OP_MINUPTOI:
    case OP_NOTPLUS:
    case OP_NOTPLUSI:
    case OP_NOTQUERY:
    case OP_NOTQUERYI:
    case OP_NOTSTAR:
    case OP_NOTSTARI:
    case OP_NOTUPTO:
    case OP_NOTUPTOI:
    case OP_PLUS:
    case OP_PLUSI:
    case OP_POSPLUS:
    case OP_POSPLUSI:
    case OP_POSQUERY:
    case OP_POSQUERYI:
    case OP_POSSTAR:
    case OP_POSSTARI:
    case OP_POSUPTO:
    case OP_POSUPTOI:
    case OP_QUERY:
    case OP_QUERYI:
    case OP_REF:
    case OP_REFI:
    case OP_DNREF:
    case OP_DNREFI:
    case OP_SBRA:
    case OP_SBRAPOS:
    case OP_SCBRA:
    case OP_SCBRAPOS:
    case OP_SCOND:
    case OP_SKIPZERO:
    case OP_STAR:
    case OP_STARI:
    case OP_TYPEPLUS:
    case OP_TYPEQUERY:
    case OP_TYPESTAR:
    case OP_TYPEUPTO:
    case OP_UPTO:
    case OP_UPTOI:
    return -1;

    /* Catch unrecognized opcodes so that when new ones are added they
    are not forgotten, as has happened in the past. */

    return -4;
/* Control never gets here */
  • 1
    A lot of juicy info but still doesn't answer the question: How does the PCRE regex engine work when searching with zero-width look behinds __which precludes repetition quantifiers from being used__? – HamZa Jul 06 '14 at 01:07
  • @HamZa Great to hear from you. I added a section `Lookbehind with Zero-Width Match but Infinite Repetition`, which considers a lookbehind with a zero-width match that contains infinite repetition: `(?<=(?=@+))` Is this what you mean? Not sure as from Simon's examples I couldn't quite tell if that's the general idea (at first it didn't seem so, which is why I had not considered that interesting case). – zx81 Jul 06 '14 at 03:58
  • @HamZa Also, for anyone interested in looking at engine internals, I tracked down the section of `pcre_compile.c` responsible for checking the fixed length of lookbehind strings. It calls the `find_fixedlength()` function, whose coded I pasted in case someone wants the ultimate answer to the finest level of detail. – zx81 Jul 06 '14 at 04:16
  • Thank you, @anubhava! :) – zx81 Jul 10 '14 at 07:15

Regex engines are designed to work from left to right.

For lookaheads, the engine matches the entire text at the right of current position. However, for lookbehinds, the regex engine determines the length of string to step back and then checks for the match (again left to right).

So, if you provide some infinite quantifiers like * or +, lookbehind wont work because the engine does not know how many steps to go backward.

I'll give an example of how lookbehind works (the example is pretty silly though).

Suppose you want to match the last name Panta, only if the first name is 5-7 characters long.

Let's take the string:

Full name is Subigya Panta.

Consider the regex:


How the engine works

The engine acknowledges the existence of a positive lookbehind and so it first searches for the word Panta (with a whitespace character before it). It is a match.

Now, the engine looks to match the regex inside the lookbehind. It steps backward 7 characters (as the quantifier is greedy). The word boundary matches the position between space and S. Then it matches all the 7 characters, and then the next word boundary matches the position between a and the space.

The regex inside the lookbehind is a match and thus the whole regex returns true because the matched string contains Panta. (Note that lookaround assertions are zero-width, and do not consume any characters.)

    You've got a syntax error there: `{5-7}` should be `{5,7}`. But your explanation only applies to the Java and ICU flavors, which support variable-length lookbehind if the maximum possible length can be determined when the regex is compiled. Your example will also work in .NET, JGSoft and Perl 6 (which place no restrictions at all), but in most flavors it's fixed-length lookbehinds only. – Alan Moore Jul 06 '14 at 02:08
  • @AlanMoore thanks, I made the edit. Agreed, it also works with PCRE and the question was related with PCRE. – DrGeneral Jul 06 '14 at 02:52
  • The PCRE library (which is what R uses when you specify `perl=TRUE`) mimics the behavior of Perl 5. So it supports lookbehinds consisting of multiple fixed-length alternatives (as mentioned in the other answers), but it doesn't permit quantifiers in lookbehinds. – Alan Moore Jul 06 '14 at 04:21

The pcrepattern man page documents the restriction that lookbehind assertions must be either be fixed-width, or be several fixed width patterns separated by |'s, and then explains that this is because:

The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed length and then try to match. If there are insufficient characters before the current position, the assertion fails.

I'm not sure why they do it this way, but my guess is that they spent a lot of time writing a good backtracking RE-matching engine that runs forward, and they didn't want to duplicate all that effort to write another that runs backwards. The obvious approach would be to run over the string backwards -- that's easy -- while matching a "reverse" version of your lookbehind assertion. Reversing a "real" (DFA-matchable) RE is possible -- the reverse of a regular language is a regular language -- but PCRE's "extended" RE's are IIRC turing complete, and it may not even be possible to flip one around to run backwards efficiently in general. And even if it were, probably no-one has actually cared enough to bother. After all, lookbehind assertions are a pretty minor feature in the grand scheme of things.

