0

I'm learning Python Regex. I'm trying to extract the date strings from the below text.

text= '''CMS Info Systems Pvt. Ltd Ref: CMS/HR/10-11/08/98 30th Aug 2010 Mr. Manohar an P Emp Code Designation DOJ : 46947 : FMS Engineer. : 015' Feb 2008 Chennai Dear Manoharan P, Reg: Acceptance of your Resignation We are in receipt of your resignation letter dated 14h July 2010, tendering your resignation thereof. As requested by you, we have accepted your resignation and you will be relieved of your assignment with us from the close of working hours on 25" Aug 2010'''

So I have written the below regex pattern

matches_text1= re.findall('\d{1,2}[a-z\s\"]*[^0-9a-zA-Z]*((?i)Jan|(?i)Feb|(?i)Mar|(?i)Apr|(?i)May|(?i)Jun|(?i)Jul|(?i)Aug|(?i)Sep|(?i)Oct|(?i)Nov|(?i)Dec|(?i)January|(?i)February|(?i)March|(?i)April|(?i)May|(?i)June|(?i)JULY|(?i)August|(?i)September|(?i)October|(?i)November|(?i)December)[\s]*[^0-9a-zA-Z]*\d{2,4}',text)

When I try the same text in the online regex editor https://regex101.com/ with the above pattern, it highlights the required text. The highlighted text as below 4 dates are correctly shown.

  1. 30h Aug 2010
  2. 15' Feb 2008
  3. 14h July 2010
  4. 25" Aug 2010

However when I run the same regex pattern code in the python IDE Spyder, I get the output as below [Aug,Feb,July,Aug] ie., only the month text without DD and YYYY texts

Please tell me what I'm missing

  • What's with `(?i)` repeated over and over again? It should appear once at the very start of your regular expression if you want case insensitivity to apply for the regular expression (did you get a waning about this?). Once it is set, it remains set for the entire expression and specifying it again is senseless. And if you have it set, `[^0-9a-zA-Z]` should just be `[^0-9a-z]`. – Booboo Aug 17 '20 at 17:26
  • in your text there is single quote and double quotes withint same text. your text needs to be cleaned first. –  Aug 17 '20 at 17:48
  • @Cyber-Tech - This is output from OCR. hence such text. – Murugan John Aug 18 '20 at 09:20

2 Answers2

0

Just add enclosing brackets to your regex and select the first match

text= '''CMS Info Systems Pvt. Ltd Ref: CMS/HR/10-11/08/98 30th Aug 2010 Mr. Manohar an P Emp Code Designation DOJ : 46947 : FMS Engineer. : 015' Feb 2008 Chennai Dear Manoharan P, Reg: Acceptance of your Resignation We are in receipt of your resignation letter dated 14h July 2010, tendering your resignation thereof. As requested by you, we have accepted your resignation and you will be relieved of your assignment with us from the close of working hours on 25" Aug 2010'''

matches= re.findall('(\d{1,2}[a-z\s\"]*[^0-9a-zA-Z]*((?i)Jan|(?i)Feb|(?i)Mar|(?i)Apr|(?i)May|(?i)Jun|(?i)Jul|(?i)Aug|(?i)Sep|(?i)Oct|(?i)Nov|(?i)Dec|(?i)January|(?i)February|(?i)March|(?i)April|(?i)May|(?i)June|(?i)JULY|(?i)August|(?i)September|(?i)October|(?i)November|(?i)December)[\s]*[^0-9a-zA-Z]*\d{2,4})',text)

for match in matches:
    print(match[0])
Kunal Kukreja
  • 544
  • 3
  • 13
0

According to the documentation for `findAll':

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Your regex is:

\d{1,2}[a-z\s\"]*[^0-9a-zA-Z]*((?i)Jan|(?i)Feb|(?i)Mar|(?i)Apr|(?i)May|(?i)Jun|(?i)Jul|(?i)Aug|(?i)Sep|(?i)Oct|(?i)Nov|(?i)Dec|(?i)January|(?i)February|(?i)March|(?i)April|(?i)May|(?i)June|(?i)JULY|(?i)August|(?i)September|(?i)October|(?i)November|(?i)December)[\s]*[^0-9a-zA-Z]*\d{2,4}

Your Group 1 is just a month (either Jan, Feb, ... or Dec). If you made this a non-capturing group by using (?: ... ) rather than ( ... ), then you would have no capture groups and so findAll will use by default Group 0, which is the entire match:

import re

text= '''CMS Info Systems Pvt. Ltd Ref: CMS/HR/10-11/08/98 30th Aug 2010 Mr. Manohar an P Emp Code Designation DOJ : 46947 : FMS Engineer. : 015' Feb 2008 Chennai Dear Manoharan P, Reg: Acceptance of your Resignation We are in receipt of your resignation letter dated 14h July 2010, tendering your resignation thereof. As requested by you, we have accepted your resignation and you will be relieved of your assignment with us from the close of working hours on 25" Aug 2010'''

matches_text1= re.findall('(?i)\d{1,2}[a-z\s\"]*[^0-9a-z]*(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|JULY|August|September|October|November|December)[\s]*[^0-9a-z]*\d{2,4}',text)

print(matches_text1)

Prints:

['30th Aug 2010', "15' Feb 2008", '14h July 2010', '25" Aug 2010']

So you don't need to define more capture groups but rather fewer (i.e. none). Note that there is only one occurrence of (?i) at the start of the regex, and [^0-9a-zA-z] has been replaced with [^0-9a-z] because case-insensitivity in in effect.

Booboo
  • 18,421
  • 2
  • 23
  • 40