Python regex to find blocks between tags that contains a word

Question

I have this text:

text = '''
    <TABLE>
        FINDME
        aaaa
    </TABLE>
    <TABLE>
        eeee
    </TABLE>
    <TABLE>
        FINDME
        iiii
    </TABLE>
    <TABLE>
        oooo
    </TABLE>
    <TABLE>
        FINDME
        uuuu
    </TABLE>
'''

How can I find the contents between tags if the the word "FINDME" is in them? I would like to obtain an array of all matches.

Expected output would be (omitting blanks):

>> ['FINDME\naaaa', 'FINDME\niiii', 'FINDME\nuuuu']

This is what I have so far:

pattern = re.compile("<TABLE>(.*FINDME.*)</TABLE>",re.DOTALL)
matches = re.findall(pattern, text)

Which returns:

>> ['\n        FINDME\n        aaaa\n    </TABLE>\n        eeee\n    <TABLE>\n        FINDME\n        iiii\n    </TABLE>\n        oooo\n    <TABLE>\n    <TABLE>\n        FINDME\n        uuuu\n    ']

It does not work as expected as it is greedy and puts everything in the same match.

I have also tried:

pattern = re.compile("<TABLE>(.*?)</TABLE>",re.DOTALL)
matches = re.findall(pattern, text)

This returns:

>> ['\n        FINDME\n        aaaa\n    ',
 '\n        eeee\n    ',
 '\n        FINDME\n        iiii\n    ',
 '\n        oooo\n    ',
 '\n        FINDME\n        uuuu\n    ']

It is non greedy but does not filter out matches without FINDME

I have also tried a combination of both trying to make it non greedy but clearly I am doing something wrong:

pattern = re.compile("<TABLE>(.*FINDME.*)</TABLE>?",re.DOTALL)
matches = re.findall(pattern, text)

But it behaves as the first one.

Would you please help me with this?

Thank you very much.

Don't use regular expressions to parse HTML, use Beautiful Soup. — Barmar, Apr 15 '21 at 15:55
Thanks. You are completely right. I am just trying to learn regex and this is just an example. — juancar, Apr 15 '21 at 16:01

Dr. Regex · Accepted Answer · 2021-04-15T17:40:02.497

The problem with your regex is that basically the greedy operator .* eats FINDME in r".*FINDME" and therefore FINDME gets completely ignored.

This is the correct pattern that you're looking for

pattern = re.compile(r"<TABLE>[\n\s]+FINDME.*?</TABLE>?")

Edit1: [\n\s]+ means match any combination of newline \n or whitespace \s. You also don't need ? after </TABLE> if your HTML code is valid.

Edit2: Maybe another (possibly better) explanation is that since you have multiple FINDME in the string, .*FINDME does aggressively look for as many FINDMEs as possible, and then matches everything until it finds the last FINDME. On the other hand, FINDME.* matches the first FINDME and then continues to match everything else (i.e. whatever that both .* and re.DOTALL allow).

Great! Thank you very much for the solution and for the explanation! — juancar, Apr 15 '21 at 17:46

Python regex to find blocks between tags that contains a word

1 Answers1