Problem with regex with many empty list python re.findall

Question

I'm using regex in python to find some dates in documents written in spanish and I'm getting the desired output plus lists of empty lists. I know the problem is in the () of the regex with I use re.findall, but if I use re.search I can extract just one date for each sentence and maybe there are two or more dates per sentence. How do you think I can solve this? The desired output would be: desired_output = ['01 DE FEBRERO DE 2001','15 DE 04 DE 2017','05 DE ENERO DE 2017'] I fixed a empty strings with the (?:WORD) but I still need to fix a few more and cannot found where the problem is.

Please let me know if you do not understand the question. I'd like to know the most efficent way to get the desired_output. If you do not know any efficent way to do this it doesn't matter, any method will be helpful. I can bring two ideas: 1) Modify the regex; 2) Running findall with the not modified regex and then delete the empty strings and the words like 'DE' in order to get the desired_output object.

Here's is the commented code for a greater explanation:

## Regex to find dates in spanish. I know the re.findall with () will bring this error but re.search does not extract every single date:

c2 = '([0-3]?[0-9] ((?:DÍAS )?)(?:DE|DEL MES DE) (?:(?:ENE|FEB|MAR|ABR|MAY|JUN|JUL|AGO|OCT|NOV|DIC)(?:[A-Z]{0,7})|(?:[0-3]?[0-9])) (?:DEL AÑO|DE|DEL) [1-2][0-9]{3})'


## Sample text
text = """
Que, el artículo 11 de las Normas de Registro, Control, Comercialización de Productos
Veterinarios de la Decisión del Acuerdo de Cartagena 483, publicada en el Registro Oficial Nro'
257 de 01 de Febrero de 2001.
Fecha de registro: 15 de 04 de 2017 Curp: XXXXXXX25XXXX17 Número: 55569880
Nombre del registrado: José Femando Carranza López
Fecha de nacimiento: 05 de Enero de 2017 Hora: 19:25
Lugar de Nacimiento: Calle 23 No. 345 Col Napoles Pantitlan.
    """

## We split the text into sentences or tokenise it (kind of)
text_split = text.upper().split(sep=".")
## Find regex for each token. See that we set the text to up
dates_found = [re.findall(c2,text_split[x]) for x in range(len(text_split))]
dates_found #see this output, it's a lists of lists of n-uples and many of them are empty


##With re.search this is better but I'm missing a date: 05 DE ENERO DE 2017 (January 5th , 2017)
dates_found2 = [re.search(c2,text_split[x]) for x in range(len(text_split))]
dates_found2

## My desired output would be to find the following dates. I findthem with re.findall but I miss "05 de enero de 2017" with re.search
desired_output = ['01 DE FEBRERO DE 2001','15 DE 04 DE 2017','05 DE ENERO DE 2017']

Thank you very much in advance!

Turn all capturing groups into non-capturing, and just use `re.findall(rx, text)`. — Wiktor Stribiżew, Jan 23 '20 at 17:41
You mean for instance turning (?:DE|DEL MES DE) into DE|DEL MES DE? This needs to be a captured group because if not the " | " operator would capture more strings than "DEL MES DE". — Tomas -, Jan 23 '20 at 17:45
I did not say "remove". I said "turn into". See https://ideone.com/2Ft5Tv. Ah, `re.I` makes it more flexible, no need to `upper()` the input string. — Wiktor Stribiżew, Jan 23 '20 at 17:52

Problem with regex with many empty list python re.findall

0 Answers0