-2

I would like to match several expressions or words in a text as follows

patterns = [r'(\bbmw\w*\b)', # bmw
            r'(\bopel\w?\b)', # opel
           r'(\btoyota\w?\b\s+(\w+\s+){0,2}(\bcorolla\w?\b\s+\bdiesel\w?\b))' # toyota corolla
           ]

# assume here that I am dealing with hundreds of regex coming from different coders.

text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'

def checkPatternInText(text, patterns):
        
    total_matches =[]
    
    for pattern in patterns:
        matches = re.findall(pattern, text)
        if len(matches)>0:
            print(type(matches))
        if type(matches[0]) == type('astring'):
            total_matches.append(matches[0])
        else: 
            total_matches.append(matches[0][0])
        print(matches)
   
    return total_matches
result = (checkPatternInText(text, patterns))

The result of this method is:

['bmw', 'opel', 'toyota the nice corolla diesel']

I check the type of matches because if the match is a single word then the type is string and if the patter produced several matches the match is a tuple with all the matches -groups-. From this tuple of groups I want the longest one, which is the first in the tuple, hence matches[0][0].

Is there a more elegant way to do this without resorting to checking the variable type of the matches?

As second question: I had to add () around all the patterns in order to access the group 0 which is ALL THE MATCH. How would you proceed if the patters do not have () around?

It was suggested that this question has an answer here: re.findall behaves weird

The situation is not totally the same since I have here a COLLECTION OF PATTERNS some might be surrounded by () some others not. Some might have groups, some others might not. I am trying to get a more reliable solution as the one I proposed. When you deal with one single pattern you can always resort to modifying the pattern (as last resort), when you are dealing with a collection of patterns a more general solution might be required.

The solution of making 1 regex for the three cases is not applicable. The real case has around 100 different regex and more and more are being continuously added.

SE_net4 the downvoter
  • 21,043
  • 11
  • 69
  • 107
JFerro
  • 1,861
  • 3
  • 21
  • 47

1 Answers1

1

You can achieve this in a single regex in re.findall using alternations:

\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b

RegEx Demo

Code:

>>> import re
>>> text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'
>>> print (re.findall(r'\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b', text))
['bmw', 'opel', 'toyota the nice corolla diesel']

RegEx Details:

  • \b: Word boundary
  • (?:: Start non-capture group
    • bmw: Match bmw
    • |: OR
    • opel: Match opel
    • |: OR
    • toyota\s+(?:\w+\s+){0,2}corolla\s+diesel: Match toyota substring
  • ): End non-capture group
  • \b: Word boundary
anubhava
  • 664,788
  • 59
  • 469
  • 547
  • hi anubhava, thanks for your answer but I need the loop since I have hundreds of regex. I can not make a single one otherwise I loose control – JFerro Oct 29 '20 at 09:50
  • Applying dozens of regex on same input would be very inefficient for processing large amount of text – anubhava Oct 29 '20 at 09:52
  • I know anubhava but there is no other way around it because the patterns are gathered from different users almost on the fly. Speed is not important in this case. – JFerro Oct 29 '20 at 10:36