Extract date from strings that contains names+dates

Question

I need to extract the dates from a series of strings like this:

'MIHAI MĂD2Ă3.07.1958'

or

'CLAUDIU-MIHAI17.12.1999'

How to do this?

Tried this:

for index,row in DF.iterrows():
    try:
        if math.isnan(row['Data_Nasterii']):
            match = re.search(r'\d{2}.\d{2}.\d{4}', row['Prenume'])
            date = datetime.strptime(match.group(), '%d.%m.%Y').date()
            s = datetime.strftime(datetime.strptime(str(date), '%Y-%m-%d'), '%d-%m-%Y')
            row['Data_Nasterii'] = s
    except TypeError:
        pass

`.` doesnt mean the cahracter dot, it means any character and needs to be escaped. try this: `r'\d+\.\d+\.\d+'` — Nullman, May 12 '19 at 10:17

Nullman · Accepted Answer · 2019-05-12T12:41:53.077

2

The . (dot) in regex doesn't mean the character dot, it means "anything" and needs to be escaped (\) to be an actual dot.
other than that your first group is \d{2} but some of your dates have a single digit day.
I would use the following:

re.search(r'(\d+\.\d+\.\d+)', row['Prenume'])

which means at least one number followed by a dot followed by at least one number.....
if you have some mixed characters in your day you can try the following (sub par) solution:

''.join(re.search(r'(\d*)(?:[^0-9\.]*)(\d*\.\d+\.\d+)', row['Prenume']).groups())

this will filter out up to one block in your "day", its not pretty but it works(and returns a string)

edited May 12 '19 at 12:41

answered May 12 '19 at 10:20

Nullman

3,086
2
11
27

Then, you can check my solution. – sentence May 12 '19 at 10:59
@Nullman Thank you so much! Only one question. For this example : 'MIHAI MĂD2Ă3.07.1958' it takes '3.07.1958' BUT it should be '23.07.1958'. The '2' digit is inside the name – PyRar May 12 '19 at 12:30

score 2 · Answer 2 · answered May 12 '19 at 10:24

2

You can use the str accessor along with a regex:

DF['Prenume'].str.extract(r'\d{1,2}\.\d{2}\.\d{4}')

answered May 12 '19 at 10:24

gmds

16,465
4
22
45

score 1 · Answer 3 · answered May 12 '19 at 10:32

You need to escape the dot (.) as \. or you can use it inside a character class - "[.]". It is a meta character in regex, which matches any character. If you need to validate more you can refer this!

eg: r'[0-9]{2}[.][0-9]{2}[.][0-9]{4}' or r'\d{2}\.\d{2}\.\d{4}'

text = 'CLAUDIU-MIHAI17.12.1999'
pattern = r'\d{2}\.\d{2}\.\d{4}'

if re.search(pattern, text):
    print("yes")

sentence · Answer 4 · 2019-05-12T10:57:41.740

1

Another good solution could be using dateutil.parser:

import pandas as pd
import dateutil.parser as dparser

df = pd.DataFrame({'A': ['MIHAI MĂD2Ă3.07.1958',
                         'CLAUDIU-MIHAI17.12.1999']})

df['userdate'] = df['A'].apply(lambda x: dparser.parse(x.encode('ascii',errors='ignore'),fuzzy=True))

output

                       A    userdate
0   MIHAI MĂD2Ă3.07.1958    1958-07-23
1   CLAUDIU-MIHAI17.12.1999 1999-12-17

edited May 12 '19 at 10:57

answered May 12 '19 at 10:32

sentence

5,556
4
20
33

thank you! Can I apply this on a single value not on a column? – PyRar May 12 '19 at 12:27
Of course. `s1 = 'asd 03.12.1999'`, then `print(dparser.parse(s1,fuzzy=True))` and you get `1999-03-12 00:00:00`. – sentence May 12 '19 at 15:49

Extract date from strings that contains names+dates

4 Answers4