5

I have a regular expression of the form

def parse(self, format_string):
    for m in re.finditer(
        r"""(?: \$ \( ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \$ \( ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \$ \( ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

I would like to replace all the repeating sequences (\$ \() with some custom shorthand "constant", which would look like this:

def parse(self, format_string):
    re.<something>('\BEGIN = \$\(')
    for m in re.finditer(
        r"""(?: \BEGIN ( [^)]+ ) \) )   # the field access specifier
          | (
                (?:
                    \n | . (?= \BEGIN ) # any one single character before the '$('
                )
              | (?:
                    \n | . (?! \BEGIN ) # any one single character, except the one before the '$('
                )*
            )""",
        format_string,
        re.VERBOSE):
    ...

Is there a way to do this with regular expressions themselves (i.e. not using Python's string formatting to substitute \BEGIN with \$\()?

Clarification: the Python source is purely for context and illustration. I'm looking for RE solution, which would be available in some RE dialect (maybe not Python's one), not the solution specifically for Python.

Michael Pankov
  • 3,271
  • 2
  • 20
  • 30
  • No, sorry. What's wrong with string formatting? – Pavel Anossov Aug 09 '13 at 15:04
  • Well, not "portable" in the sense that the resulting regular expression with custom shorthands could be used in different programming languages. – Michael Pankov Aug 09 '13 at 15:05
  • 1
    Many regex dialects do not support verbose regexes. Or assertions. – Pavel Anossov Aug 09 '13 at 15:09
  • 1
    It's possible in PCRE, but I guess you're looking for a Python solution? – Martin Ender Aug 09 '13 at 15:53
  • Well the whole syntax `\BEGIN` is invalid since `\B` is a valid metacharacter. – HamZa Aug 09 '13 at 16:16
  • 2
    @HamZa I think this was just an example for what the OP imagined the reuse syntax could look like. – Martin Ender Aug 09 '13 at 16:19
  • @m.buettner you're right, it's just an example – Michael Pankov Aug 09 '13 at 16:23
  • @m.buettner I'm not looking for Python solution, that's the point. Could you please give more details on PCRE implementation? – Michael Pankov Aug 09 '13 at 16:25
  • The approach that I have used in the past was defining string variables and using string concatenation then to build the string. But that is also not completely portable, at least not between languages using different syntax for string concatenation like `+` or `.`, or `&`. – FrankPl Aug 09 '13 at 16:32
  • @FrankPl the problem with string concatenation is, it doesn't guarantee valid expressions, may cause problems with unescaped meta-characters and most of all, messes up your capturing group count. – Martin Ender Aug 09 '13 at 16:42
  • I agree. It can make complex regexes more readable, however, expressing your intention in the string variable names. For the capturing group count, some regex engines (.net and Java at least) have named capturing groups, thus avoiding all the mess that you have when changing your expression and keeping the capturing group number in sync. – FrankPl Aug 09 '13 at 16:51
  • @FrankPl named groups are orthogonal to the issue at hand. – Michael Pankov Aug 09 '13 at 17:02

1 Answers1

9

I don't think this is possible in Python's regex flavor. You would need recursion (or rather pattern reuse) which is only supported by PCRE. In fact, PCRE even mentions how defining shorthands works in its man page (search for "Defining subpatterns").

In PCRE, you can use the recursion syntax in a similar way to backreferences - except that the pattern is applied again, instead of trying to get the same literal text as from a backreference. Example:

/(\d\d)-(?1)-(?1)/

Matches something like a date (where (?1) will be replaced with with \d\d and evaluated again). This is really powerful, because if you use this construct within the referenced group itself you get recursion - but we don't even need that here. The above also works with named groups:

/(?<my>\d\d)-(?&my)-(?&my)/

Now we're already really close, but the definition is also the first use of the pattern, which somewhat clutters up the expression. The trick is to use the pattern first in a position that is never evaluated. The man pages suggest a conditional that is dependent on a (non-existent) group DEFINE:

/
(?(DEFINE)
  (?<my>\d\d)
)
(?&my)-(?&my)-(?&my)
/x

The construct (?(group)true|false) applies pattern true if the group group was used before, and (the optional) pattern false otherwise. Since there is no group DEFINE, the condition will always be false, and the true pattern will be skipped. Hence, we can put all kinds of definitions there, without being afraid that they will ever be applied and mess up our results. This way we get them into the pattern, without really using them.

And alternative is a negative lookahead that never reaches the point where the expression is defined:

/
(?!
  (?!)     # fail - this makes the surrounding lookahead pass unconditionally
  # the engine never gets here; now we can write down our definitions
  (?<my>\d\d) 
)
(?&my)-(?&my)-(?&my)
/x

However, you only really need this form, if you don't have conditionals, but do have named pattern reuse (and I don't think a flavor like this exists). The other variant has the advantage, that the use of DEFINE makes it obvious what the group is for, while the lookahead approach is a bit obfuscated.

So back to your original example:

/
# Definitions
(?(DEFINE)
  (?<BEGIN>[$][(])
)
# And now your pattern
  (?: (?&BEGIN) ( [^)]+ ) \) ) # the field access specifier
|
  (
    (?: # any one single character before the '$('
      \n | . (?= (?&BEGIN) ) 
    )
  | 
    (?: # any one single character, except the one before the '$('
      \n | . (?! (?&BEGIN) ) 
    )*
  )
/x

There are two major caveats to this approach:

  1. Recursive references are atomic. That is, once the reference has matched something it will never be backtracked into. For certain cases this can mean that you have to be a bit clever in crafting your expression, so that the first match will always be the one you want.
  2. You cannot use capturing inside the defined patterns. If you use something like (?<myPattern>a(b)c) and reuse it, the b will never be captured - when reusing a pattern, all groups are non-capturing.

The most important advantage over any kind of interpolation or concatenation however is, that you can never produce invalid patterns with this, and you cannot mess up your capturing group counts either.

Martin Ender
  • 40,690
  • 9
  • 78
  • 120
  • 2
    This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Control Verbs and Recursion". – aliteralmind Apr 10 '14 at 01:12