0

I'm using python to find dates in strings like :

string01='los mantenimientos acontecieron en los dias 3,06,8 ,9, 15 y 29 de diciembre de 2018.Por cada mantenimiento fué cobrado $1,300.00 códigos de mantenimiento: (3)A34,(2)C54,(1)D65'

('the manteinance sessions were in december 3,06,8 ,9, 15 and 29 of 2018')

I'm first trying with regex to find and split just the dates and (not the currency) then transform them to the expected result

expected result: ['3/12/2018','06/12/2018','08/12/2018','09/12/2018','15/12/2018','29/12/2018']

string02='los mantenimientos sucedieron en: 2,04,05,8,9,10,11,14,15,22,24, y 27 de junio de 2018.Valor de cada uno de los mantenimiento: $1,300.00, códigos de mantenimiento: (1)A35,(6)C54,(5)D65'

('the manteinance sessions happened in june 2,04,05,8,9,10,11,14,15,22,24, and 27 of 2018') expected result: ['02/06/2018','04/06/2018','05/06/2018','08/06/2018','09/06/2018','10/06/2018','11/06/2018','14/06/2018','15/06/2018','22/06/2018','24/06/2018','27/06/2018']

Ive tryied so far:

dias=re.compile(r"((\s?[0-3]?[0-9]\s?\,?\s?){1,9}[0-3][0-9]|\sy\s[0-3][0-9]\sde\s(?:diciembre|junio)\sde\s[2][0][0-2][0-9])")

dias_found=re.findall(dias,string01)

but I'm getting tuples and duplicated values:

[(' 3,06,8,9, 15', '9, '), (' y 29 de diciembre de 2018', '')]

shoud be ['3','06','8','9','15','29 de diciembre de 2018']

Any help will be greatly appreciated.

Thanks in advance.

Charles Duffy
  • 235,655
  • 34
  • 305
  • 356
user9910379
  • 189
  • 8
  • Honestly, trying to parse human-readable language is fraught with difficulty enough that relying on it for anything critical is probably a bad idea -- it'd be better to get your ops team to share their schedule in iCal format or something else built to be parsed programatically; that way, if they use slightly different wording next time and it's read incorrectly, that's their problem and not yours. – Charles Duffy Jul 05 '19 at 20:49
  • @abdusco thanks a lot, It's spanish actually – user9910379 Jul 05 '19 at 20:57

1 Answers1

2

You can use re module together with string manipulation to extract the dates easily

import requests
import re
import json

if __name__ == "__main__":
    texts = [
        'en los dias 3,06,8 ,9, 15 y 29 de diciembre de 2018.Por c',
        'n en: 2,04,05,8,9,10,11,14,15,22,24, y 27 de junio de 2018.Valor de',
    ]
    # select from the beginning of date-like text till the end of year
    pattern = r'\s*((\d+[\sy\,]*)+[\D\s]+20\d{2})'
    month_names = ['diciembre', 'junio']  # add others
    month_pattern = re.compile(f'({"|".join(month_names)})', flags=re.IGNORECASE)

    all_dates = []
    for item in texts:
        match = re.search(pattern, item)
        if not match:
            continue
        date_region: str = match.group(1)

        # find year
        year = re.search('(20\d{2})', date_region).group(1)

        # find month
        month_match = re.search(month_pattern, date_region)
        month = month_match.group(1)
        # remove everything after month
        date_region = date_region[: month_match.start()]
        # find all numbers, we're assuming they represent day of the month
        days = re.findall('(\d+)', date_region)

        found_dates = [f'{d}/{month}/{year}' for d in days]
        all_dates.append(found_dates)
    print(all_dates)


I don't know the month names in Portuguese? (edit: it was Spanish), but replacing those with numbers is a trivial task. output:

[['3/diciembre/2018',
  '06/diciembre/2018',
  '8/diciembre/2018',
  '9/diciembre/2018',
  '15/diciembre/2018',
  '29/diciembre/2018'],
 ['2/junio/2018',
  '04/junio/2018',
  '05/junio/2018',
  '8/junio/2018',
  '9/junio/2018',
  '10/junio/2018',
  '11/junio/2018',
  '14/junio/2018',
  '15/junio/2018',
  '22/junio/2018',
  '24/junio/2018',
  '27/junio/2018']]
abdusco
  • 3,823
  • 1
  • 14
  • 28