-2

I am working on a personal project, and am stuck on extracting the text surrounding month abbreviations.

A sample input text is of the form:

text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"

I expect output of the form:

[ ("apr25, 2016\nblah blah\npow\n"), ("may22, 2017\nasdf rtys\nqwer\n"), ("jan9, 2018\npoiu\nlkjhj yertt") ]

I tried a simple regex, but it is incorrect:

import re

# Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*)|(may[\w\W]*)|(jan[\w\W]*)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt', '', '')]

# Non-Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*?)|(may[\w\W]*?)|(jan[\w\W]*?)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr', '', ''), ('', 'may', ''), ('', '', 'jan')]

Can you help me produce the desired output with python3 regex?

Or do i need to write custom python3 code to produce the desired output?

martineau
  • 99,260
  • 22
  • 139
  • 249
user1290793
  • 93
  • 1
  • 8
  • You even don't know the rules, right? – revo Apr 28 '18 at 15:48
  • i have basic knowledge of regular expression in python - i had taken the google class on python regular expressions, online, a few years back. but i did not know how to stop before the subsequent month abbreviation after i already matched a month abbreviation and its following text – user1290793 Apr 28 '18 at 17:20

1 Answers1

1

The problem was in stopping around month abbreviations in my regex, after matching for month abbreviations.

I referred Python RegEx Stop before a Word and used the tempered greedy token solution mentioned there.

import re

REGEX_MONTHS_TEXT = re.compile(r'(apr|may|jan)((?:(?!apr|may|jan)[\w\W])+)')
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"

arr = REGEX_MONTHS_TEXT.findall(text)
# arr = [ ('apr', '25, 2016\nblah blah\npow\n'),  ('may', '22, 2017\nasdf rtys\nqwer\n'),  ('jan', '9, 2018\npoiu\nlkjhj yertt')]

# The above arr can be combined using list comprehension to form
# list of singleton tuples as expected in the original question
output = [ (x + y,) for (x, y) in arr ]
# output = [('apr25, 2016\nblah blah\npow\n',), ('may22, 2017\nasdf rtys\nqwer\n',), ('jan9, 2018\npoiu\nlkjhj yertt',)]

Additional Resource for Tempered Greedy Token: Tempered Greedy Token - What is different about placing the dot before the negative lookahead

user1290793
  • 93
  • 1
  • 8
  • You may reduce lookaheads into one: `(apr|may|jan)((?:(?!apr|may|jan)[\w\W])+)` – revo Apr 28 '18 at 17:23
  • i had tried ` (apr|may|jan)((?:(?!(apr|may|jan))[\w\W])+) ` . it produced similar output as above but with one extra empty element in each tuple, because of the extra matching group. Thanks for your optimization. – user1290793 Apr 28 '18 at 17:32
  • It shouldn't differ. They are literally equal. – revo Apr 28 '18 at 17:33