2

I need to extract the dates from a series of strings like this:

'MIHAI MĂD2Ă3.07.1958'

or

'CLAUDIU-MIHAI17.12.1999'

How to do this?

Tried this:

for index,row in DF.iterrows():
    try:
        if math.isnan(row['Data_Nasterii']):
            match = re.search(r'\d{2}.\d{2}.\d{4}', row['Prenume'])
            date = datetime.strptime(match.group(), '%d.%m.%Y').date()
            s = datetime.strftime(datetime.strptime(str(date), '%Y-%m-%d'), '%d-%m-%Y')
            row['Data_Nasterii'] = s
    except TypeError:
        pass
sentence
  • 5,556
  • 4
  • 20
  • 33
PyRar
  • 501
  • 3
  • 14

4 Answers4

2

The . (dot) in regex doesn't mean the character dot, it means "anything" and needs to be escaped (\) to be an actual dot.
other than that your first group is \d{2} but some of your dates have a single digit day.
I would use the following:

re.search(r'(\d+\.\d+\.\d+)', row['Prenume'])

which means at least one number followed by a dot followed by at least one number.....
if you have some mixed characters in your day you can try the following (sub par) solution:

''.join(re.search(r'(\d*)(?:[^0-9\.]*)(\d*\.\d+\.\d+)', row['Prenume']).groups())

this will filter out up to one block in your "day", its not pretty but it works(and returns a string)

Nullman
  • 3,086
  • 2
  • 11
  • 27
  • Then, you can check my solution. – sentence May 12 '19 at 10:59
  • @Nullman Thank you so much! Only one question. For this example : 'MIHAI MĂD2Ă3.07.1958' it takes '3.07.1958' BUT it should be '23.07.1958'. The '2' digit is inside the name – PyRar May 12 '19 at 12:30
2

You can use the str accessor along with a regex:

DF['Prenume'].str.extract(r'\d{1,2}\.\d{2}\.\d{4}')
gmds
  • 16,465
  • 4
  • 22
  • 45
1

You need to escape the dot (.) as \. or you can use it inside a character class - "[.]". It is a meta character in regex, which matches any character. If you need to validate more you can refer this!

eg: r'[0-9]{2}[.][0-9]{2}[.][0-9]{4}' or r'\d{2}\.\d{2}\.\d{4}'

text = 'CLAUDIU-MIHAI17.12.1999'
pattern = r'\d{2}\.\d{2}\.\d{4}'

if re.search(pattern, text):
    print("yes")
Srivastava
  • 701
  • 8
  • 14
1

Another good solution could be using dateutil.parser:

import pandas as pd
import dateutil.parser as dparser

df = pd.DataFrame({'A': ['MIHAI MĂD2Ă3.07.1958',
                         'CLAUDIU-MIHAI17.12.1999']})

df['userdate'] = df['A'].apply(lambda x: dparser.parse(x.encode('ascii',errors='ignore'),fuzzy=True))

output

                       A    userdate
0   MIHAI MĂD2Ă3.07.1958    1958-07-23
1   CLAUDIU-MIHAI17.12.1999 1999-12-17
sentence
  • 5,556
  • 4
  • 20
  • 33