-3

I have dataframe df with a column (textline) that consists of text

df['textline'].iloc[0] = 'This is a test with 2018\n'
df['textline'].iloc[1] = 'This is a test with Jan 2018\n'
df['textline'].iloc[2] = 'This is a test with Feb 2018\n'

I want to use Regex extractall to run through the entire df['textline'] but it will only extract the year when there is no preceeding Month names. For example, in the above example, it will extract 2018 from the first line, but not the 2018 from the second or third line because it has the Jan or Feb (or other months).

df['textline'].str.extractall(r'<<Regex code>>')
  • Possible duplicate of [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – pault Oct 10 '18 at 22:52
  • That is what I am looking for: the regex pattern to do this. – John Wong Oct 10 '18 at 22:58
  • In addition, please extend the regex code to not pick up lines with content such as January 2018, Jan 2018, etc. I only want the lines with pure 2018 (no months) – John Wong Oct 10 '18 at 22:59
  • E.g. here is the attempt: df['textline'].str.extractall(r' ^(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (\d{4})\D') – John Wong Oct 10 '18 at 23:02

2 Answers2

0

I figured out the first part of the answer:

df['textline'].str.extractall(r'(?<!Jan|Feb) ([1-2][0-9]{3})')

The second part is how to use the same line for the entire word January , February so that it will work for both Feb 2018 and February 2018

-1

may you try this:

(?<=(\s))\d{4}(?=\D)

Matches:

This is a test with 2018\n

This is a test with Jan 2018\n

This is a test with Feb 2018\n