0

I have a pandas df column containing both string and date information. I'd like to replace this column with only the date information. I think I can use the df.replace() function along with the regex=True option, but I'm really not sure how to go about doing this. I've tried a few ways, but can't seem to figure out the logic. Any and all help would be greatly appreciated!

Here is an example of the values found in the column:

DATE_COLUMN

Foo 11-2008 Bar

Hip 12-2008 Hop

Some_text 01-2009 some_more_text

I'd like to only replace the above cells with only the MM-YYYY data.

I also have another df with a similar pattern (but if I can understand how to solve the above, I think I'll be able to figure out this one):

DATE_COLUMN

Foo 12-2007 to 11-2008 Bar

Hip 01-2008 to 12-2008 Hop

Some_text 02-2008 to 01-2009 some_more_text

In this case, I'd want to replace each cell with only the second MM-YYYY element.

Once again, any and all help is greatly appreciated. Thank you!

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
Jr Analyst
  • 33
  • 5

1 Answers1

0

Use the built in format extraction of to_datetime on cleaned data using str.extract:

df['date'] = df['date'].str.extract(r'.*(\d\d-\d\d\d\d)')
df["date"] = pd.to_datetime(df["date"], format="%m-%Y")

The .* in the regex will deal with the double dates by keeping the last occurrence only.

noah
  • 2,418
  • 7
  • 21
  • I didn't know about the str.extract method! Furthermore, I don't know why you were downvoted. I was able to modify your solution to work perfectly for me. Thank you very much! – Jr Analyst Nov 23 '20 at 16:31