2

I have a big block of text within which I am trying to look for a phrase. The phrase can be structured in a number of different ways.

  1. First I want to look for a word from a set of words, let's call it set 1.
  2. After that there must be a space or comma (or maybe something else that separates words)
  3. Then there may be 0 or more words from set 2
  4. Again followed by the word separation characters as in point 2 above
  5. finally there should be a word from set 3

Ideally all of these should be in the same sentence.

set 1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)

set 2 = (for|to|of|full|a|be|complete|Internal)

set 3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

So I have this regex expression

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. e.g "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout."

as soon as I add in 'a' before 'complete' it fails. The same as if I add another 'complete'.

How do I specify to look for 0 or multiple words from a set?

Charlie Morton
  • 458
  • 4
  • 12

4 Answers4

3

Set 1: Matches any of the words in set 1 with 1 separator.

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times.

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
Jim Wright
  • 5,256
  • 1
  • 9
  • 30
  • 1
    `{1}` is useless. – Toto Jan 07 '19 at 14:03
  • I ran this in an online regex tester which came out find, but I've just tried to run it in a python script with the following 'import re text = "potential to modernise" regex = re.match("((Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation))", text) match = regex.groups() print(match) ' and I get the error ...... – Charlie Morton Jan 07 '19 at 21:42
  • ..... 'Traceback (most recent call last): File "/Users/Charlie/Documents/python/regex_potential.py", line 5, in match = regex.groups() AttributeError: 'NoneType' object has no attribute 'groups' ' – Charlie Morton Jan 07 '19 at 21:42
  • Case sensitivity... oops – Charlie Morton Jan 07 '19 at 21:45
2

Long alternatives in regular expressions can be quite slow. I'd suggest to take another approach. First segment the text (split to words) and the iterate over the array of words checking if subsequent sets of 3 words fulfil your requirements

Something like that (rather pseudocode than a real python):

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])
mrzasa
  • 21,673
  • 11
  • 52
  • 88
1

You have to use this regex:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Because you have one word from first set. After that you have one space or comma. Near you have 0 or more word from set 2. Then an other space or comma and finally one word from the last set.

0

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't.

In this case, you need the "zero or more" (*) operator on your second group. The result would be:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
dquijada
  • 1,544
  • 3
  • 13
  • 21
  • This will match `Potential,,,,fortooffullabecompleteInternal, ,,, , fortooffullabecompleteInternalfortooffullabecompleteInternalfortooffullabecompleteInternalrenovate` – Toto Jan 07 '19 at 14:03
  • I know, but since that's what he has on his regex I decided to modify it as little as possible (just in case he wants that behaviour for some reason) – dquijada Jan 07 '19 at 14:04