4

I would like to parse codetags in source files. I wrote this regex that works fine with PCRE:

(?<tag>(?&TAG)):\s*
(?<message>.*?)
(
<
   (?<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
   (?<date>(?&DATE))?
   (?<flags>(?&FLAGS))?
>
)?
$

(?(DEFINE)
   (?<TAG>\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG))
   (?<DATE>\d{4}-\d{2}-\d{2})
   (?<FLAGS>[pts]:\w+\b)
)

Unfortunately it seems Python doesn't understand the DEFINE (https://regex101.com/r/qH1uG3/1#pcre)

What is the best workaround in Python?

Micah Elliott
  • 7,319
  • 4
  • 44
  • 48
nowox
  • 19,233
  • 18
  • 91
  • 202
  • Given that you don't use any of those definitions more that once, what not just put them in-line? – jonrsharpe Feb 11 '15 at 09:13
  • @TimPietzcker Yes I did. And obviously I understood that Python doesn't support DEFINE statements but maybe there are workarounds like the way of writing it (such as `(?P)` instead of `(?)`) – nowox Feb 11 '15 at 09:16
  • 4
    Ah, sorry. I thought you already knew that Python doesn't support subregex definitions. Sorry for the snooty comment. I guess the only workaround would be to make use of the [`regex` module (PyPI)](https://pypi.python.org/pypi/regex). – Tim Pietzcker Feb 11 '15 at 09:18
  • @TimPietzcker I am coming from Perl and I am painfully living the transition shock :) – nowox Feb 11 '15 at 09:20
  • Yeah, Python's `re` module is kind of outdated. I hope that the `regex` module will soon make it into the core language. – Tim Pietzcker Feb 11 '15 at 09:21
  • @TimPietzcker Thank you for the new `regex` module information ! – nowox Feb 11 '15 at 09:23
  • The new regex module supports recursion. I would give that a try. – HamZa Feb 11 '15 at 09:24
  • ...although I'm still searching the docs whether something like `DEFINE` is supported there. Haven't found it yet... – Tim Pietzcker Feb 11 '15 at 09:25
  • 1
    @TimPietzcker As [explained here](http://stackoverflow.com/a/18151617). The define part is basically just an IF statement that's always false. I think we could write [something like this](https://regex101.com/r/lA9cX0/1) but I don't have an environment to test this quickly right now. – HamZa Feb 11 '15 at 09:33
  • Another option is string concatenation (or interpolation, if available), which is pretty much the way complex regex should be built in languages which doesn't have support for "subroutine call" feature in regex. – nhahtdh Feb 11 '15 at 10:10

3 Answers3

5

The way with the regex module:

As explained in comments the regex module allows to reuse named subpatterns. Unfortunately there is no (?(DEFINE)...) syntax like in Perl or PCRE.

So the way is to use the same workaround than with Ruby language that consists to put a {0} quantifier when you want to define a named subpattern:

import regex

s = r'''
// NOTE: A small example
// HACK: Another example <ABC 2014-02-03>
// HACK: Another example <ABC,DEF 2014-02-03>
// HACK: Another example <ABC,DEF p:0>
'''

p = r'''
    # subpattern definitions
    (?<TAG> \b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG) ){0}
    (?<DATE> \d{4}-\d{2}-\d{2} ){0}
    (?<FLAGS> [pts]:\w+ ){0}

    # main pattern
    (?<tag> (?&TAG) ) : \s*
    (?<message> (?>[^\s<]+[^\n\S]+)* [^\s<]+ )? \s* # to trim the message
    <
    (?<author> (?: \w{3} \s* , \s* )*+ \w{3} )? \s*
    (?<date> (?&DATE) )?
    (?<flags> (?&FLAGS) )?
    >
    $
'''

rgx = regex.compile(p, regex.VERBOSE | regex.MULTILINE)

for m in rgx.finditer(s):
    print (m.group('tag'))

Note: the subpatterns can be defined at the end of the pattern too.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • 1
    One thing I dislike about this solution is that it pollutes `groups/groupdict` with never-matching subpatterns, however if you're only interested in a specific named groups, this is a non-issue. BTW, `regex` supports direct `m['tag']`, no need for `.groups('tag')`. – georg Feb 11 '15 at 18:41
  • @georg: Yes, but note that if you use `(?(DEFINE)...)` in PHP or any capture in a branch that fails or that is not tested at all, you will have the same problem. An other way is to use a C formated string with placeholders, but it's not handy when you have too many different subpatterns, since in this case you don't use names at all. – Casimir et Hippolyte Feb 11 '15 at 20:33
  • For very complex regexes I personally prefer "mini-grammars" (e.g. [here](http://stackoverflow.com/a/28043277/989121)) - this is vendor-independent and has no nasty side effects. – georg Feb 12 '15 at 09:22
1
(?P<tag>\b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)):\s*
(?P<message>.*?)
(
<
   (?P<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
   (?P<date>\d{4}-\d{2}-\d{2})?
   (?P<flags>[pts]:\w+\b)?
>
)?
$

You can just replace tag definitions in place as a workaround.See demo.

https://regex101.com/r/qH1uG3/2

vks
  • 63,206
  • 9
  • 78
  • 110
  • 1
    Sure it's a working solution but it doesn't really help. Is there any alternative to the DEFINE statement unless to inline everything? – nowox Feb 11 '15 at 09:17
  • @Coin yes you can make your `re` on the fly.Store definition of `tag` in some variable and make your re like `re.compile(r"aasdsad"+tag+"asdsad")` something of this sort – vks Feb 11 '15 at 09:18
  • I was afraid of this answer. This is the solution to my issue that I didn't want to see. – nowox Feb 11 '15 at 09:21
1

As a quick fix, place your define's in a dict:

defines = {
    'TAG': r'\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)',
    'DATE': r'\d{4}-\d{2}-\d{2}',
    'FLAGS': r'[pts]:\w+\b'
}

and replace them in your regex:

regex = re.sub(r'\(\?&(\w+)\)', lambda m: defines[m.group(1)], regex)

If you have recursive define's, wrap that in a loop:

define = r'\(\?&(\w+)\)'
while re.search(define, regex):
    regex = re.sub(define, lambda m: defines[m.group(1)], regex)

A not-so-quick fix is to write your own re parser-compiler - but that's almost definitely an overkill for the task at hand.

georg
  • 195,833
  • 46
  • 263
  • 351
  • The function inside re.sub `\(\?&(\w+)\)` can be different for different tags.This will break in those case i guess – vks Feb 11 '15 at 09:44
  • @vks: not sure what you mean here... care to elaborate? – georg Feb 11 '15 at 09:57
  • This model will work it re.sub catches all tags correclty.So all tags have to be of unifom type so that one generic function can catch that and replace.But othertimes it can be different. – vks Feb 11 '15 at 10:03