getting only total match in a regex method checking multiple patterns in python

Question

I would like to match several expressions or words in a text as follows

patterns = [r'(\bbmw\w*\b)', # bmw
            r'(\bopel\w?\b)', # opel
           r'(\btoyota\w?\b\s+(\w+\s+){0,2}(\bcorolla\w?\b\s+\bdiesel\w?\b))' # toyota corolla
           ]

# assume here that I am dealing with hundreds of regex coming from different coders.

text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'

def checkPatternInText(text, patterns):
        
    total_matches =[]
    
    for pattern in patterns:
        matches = re.findall(pattern, text)
        if len(matches)>0:
            print(type(matches))
        if type(matches[0]) == type('astring'):
            total_matches.append(matches[0])
        else: 
            total_matches.append(matches[0][0])
        print(matches)
   
    return total_matches
result = (checkPatternInText(text, patterns))

The result of this method is:

['bmw', 'opel', 'toyota the nice corolla diesel']

I check the type of matches because if the match is a single word then the type is string and if the patter produced several matches the match is a tuple with all the matches -groups-. From this tuple of groups I want the longest one, which is the first in the tuple, hence matches[0][0].

Is there a more elegant way to do this without resorting to checking the variable type of the matches?

As second question: I had to add () around all the patterns in order to access the group 0 which is ALL THE MATCH. How would you proceed if the patters do not have () around?

It was suggested that this question has an answer here: re.findall behaves weird

The situation is not totally the same since I have here a COLLECTION OF PATTERNS some might be surrounded by () some others not. Some might have groups, some others might not. I am trying to get a more reliable solution as the one I proposed. When you deal with one single pattern you can always resort to modifying the pattern (as last resort), when you are dealing with a collection of patterns a more general solution might be required.

The solution of making 1 regex for the three cases is not applicable. The real case has around 100 different regex and more and more are being continuously added.

Why are you checking the type of the values in `matches`? What is that telling you? — khelwood, Nov 09 '20 at 00:41

score 1 · Answer 1 · answered Oct 29 '20 at 09:39

1

You can achieve this in a single regex in re.findall using alternations:

\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b

RegEx Demo

Code:

>>> import re
>>> text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'
>>> print (re.findall(r'\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b', text))
['bmw', 'opel', 'toyota the nice corolla diesel']

RegEx Details:

\b: Word boundary
(?:: Start non-capture group
- bmw: Match bmw
- |: OR
- opel: Match opel
- |: OR
- toyota\s+(?:\w+\s+){0,2}corolla\s+diesel: Match toyota substring
): End non-capture group
\b: Word boundary

answered Oct 29 '20 at 09:39

anubhava

664,788
59
469
547

hi anubhava, thanks for your answer but I need the loop since I have hundreds of regex. I can not make a single one otherwise I loose control – JFerro Oct 29 '20 at 09:50
Applying dozens of regex on same input would be very inefficient for processing large amount of text – anubhava Oct 29 '20 at 09:52
I know anubhava but there is no other way around it because the patterns are gathered from different users almost on the fly. Speed is not important in this case. – JFerro Oct 29 '20 at 10:36

getting only total match in a regex method checking multiple patterns in python

1 Answers1