Can I ignore zero-length regex match in python when searching for compass directions?

Question

Inform 7 text adventure code can heavily feature directions such as north, south, west, east, northwest, southwest, southeast, and northeast. I am developing a code verifying script, and one of its tasks is to find instances of these words. My first try used brute force:

import re

sample_line = 'The westerly barn is a room. The field is east of the barn. \
  The stable is northeast of the field. The forest is northwest of the field.'

# note: this could be generated with zip and north/south'' and east/west/'', but that's another exercise.
x = [ 'north', 'south', 'east', 'west', 'northwest', 'southwest', 'southeast', 'northeast' ]

regstr = r'\b({0})\b'.format('|'.join(x))

print(re.findall(regstr, sample_line))

This worked and gave me what I wanted: [ 'east', 'northeast', 'northwest' ] while ignoring westerly.

I wanted to use a bit of symmetry to cut down the regex some more. But I noticed my preferred way left open the possibility of a zero-length match. So I came up with this:

regstr2 = r'\b(north|south|(north|south)?(east|west))\b'

print(sample_line)
print([x[0] for x in re.findall(regstr2, sample_line)])

This worked, but it felt inelegant.

My third try, with help from this link, was:

regstr3 = r'(?=.)(\b(north|south)?(east|west)?\b)'

print(sample_line)
print([x[0] for x in re.findall(regstr3, sample_line)])

This gots the three directions I want, but it also got a lot of zero-length matches I'd hoped to ignore, even with the recommended (?=.).

Is there a way Python could get a variant of regstr3 to work? While there are obvious workarounds, it would be pleasing to have a tidy regex without a lot of repetitions and similar words.

I think your second attempt is as far as you can get in terms of repeating as little as possible. If this were PCRE you could have used "recurse sub pattern", but this is python :(. — Sweeper, Aug 06 '19 at 03:53

score 1 · Accepted Answer · answered Aug 06 '19 at 07:30

You may restrict the word boundaries: let the initial word boundary only match start of words by adding (?<!\w) after it, and let the trailing word boundary only match at the end of words by adding (?!\w) after it:

\b(?<!\w)((?:north|south)?(?:east|west)?)\b(?!\w)

See the regex demo

Pattern details

\b(?<!\w) - a word boundary that has no word char immediately on the left
((?:north|south)?(?:east|west)?) - Capturing group 1:
- (?:north|south)? - an optional substring, either north or south
- (?:east|west)? - an optional substring, either east or west
\b(?!\w) - a word boundary that has no word char immediately on the right.

Python demo:

import re
rx = r"\b(?<!\w)((?:north|south)?(?:east|west)?)\b(?!\w)"
s = "The westerly barn is a room. The field is east of the barn.   The stable is northeast of the field. The forest is northwest of the field."
print( re.findall(rx, s) )

Can I ignore zero-length regex match in python when searching for compass directions?

1 Answers1