I would like to match several expressions or words in a text as follows
patterns = [r'(\bbmw\w*\b)', # bmw
r'(\bopel\w?\b)', # opel
r'(\btoyota\w?\b\s+(\w+\s+){0,2}(\bcorolla\w?\b\s+\bdiesel\w?\b))' # toyota corolla
]
# assume here that I am dealing with hundreds of regex coming from different coders.
text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'
def checkPatternInText(text, patterns):
total_matches =[]
for pattern in patterns:
matches = re.findall(pattern, text)
if len(matches)>0:
print(type(matches))
if type(matches[0]) == type('astring'):
total_matches.append(matches[0])
else:
total_matches.append(matches[0][0])
print(matches)
return total_matches
result = (checkPatternInText(text, patterns))
The result of this method is:
['bmw', 'opel', 'toyota the nice corolla diesel']
I check the type of matches because if the match is a single word then the type is string and if the patter produced several matches the match is a tuple with all the matches -groups-. From this tuple of groups I want the longest one, which is the first in the tuple, hence matches[0][0].
Is there a more elegant way to do this without resorting to checking the variable type of the matches?
As second question: I had to add () around all the patterns in order to access the group 0 which is ALL THE MATCH. How would you proceed if the patters do not have () around?
It was suggested that this question has an answer here: re.findall behaves weird
The situation is not totally the same since I have here a COLLECTION OF PATTERNS some might be surrounded by () some others not. Some might have groups, some others might not. I am trying to get a more reliable solution as the one I proposed. When you deal with one single pattern you can always resort to modifying the pattern (as last resort), when you are dealing with a collection of patterns a more general solution might be required.
The solution of making 1 regex for the three cases is not applicable. The real case has around 100 different regex and more and more are being continuously added.