I have this text:
text = '''
<TABLE>
FINDME
aaaa
</TABLE>
<TABLE>
eeee
</TABLE>
<TABLE>
FINDME
iiii
</TABLE>
<TABLE>
oooo
</TABLE>
<TABLE>
FINDME
uuuu
</TABLE>
'''
How can I find the contents between tags if the the word "FINDME" is in them? I would like to obtain an array of all matches.
Expected output would be (omitting blanks):
>> ['FINDME\naaaa', 'FINDME\niiii', 'FINDME\nuuuu']
This is what I have so far:
pattern = re.compile("<TABLE>(.*FINDME.*)</TABLE>",re.DOTALL)
matches = re.findall(pattern, text)
Which returns:
>> ['\n FINDME\n aaaa\n </TABLE>\n eeee\n <TABLE>\n FINDME\n iiii\n </TABLE>\n oooo\n <TABLE>\n <TABLE>\n FINDME\n uuuu\n ']
It does not work as expected as it is greedy and puts everything in the same match.
I have also tried:
pattern = re.compile("<TABLE>(.*?)</TABLE>",re.DOTALL)
matches = re.findall(pattern, text)
This returns:
>> ['\n FINDME\n aaaa\n ',
'\n eeee\n ',
'\n FINDME\n iiii\n ',
'\n oooo\n ',
'\n FINDME\n uuuu\n ']
It is non greedy but does not filter out matches without FINDME
I have also tried a combination of both trying to make it non greedy but clearly I am doing something wrong:
pattern = re.compile("<TABLE>(.*FINDME.*)</TABLE>?",re.DOTALL)
matches = re.findall(pattern, text)
But it behaves as the first one.
Would you please help me with this?
Thank you very much.