0

I'm trying to write a regex which finds all characters between a starting token ('MS' or 'PhD') and an ending token ('.' or '!'). What makes this tricky is that it's fairly common for both starting tokens to be present in my text data, I'm only interested in the characters bounded by the last starting token and first ending token. (And all such occurrences.)

start = 'MS|PhD'
end = '.|!'

input1 = "Candidate with MS or PhD in Statistics, Computer Science, or similar field."
output1 = "in Statistics, Computer Science, or similar field"

input2 = "Applicant with MS in Biology or Chemistry desired."
output2 = "in Biology or Chemistry desired"

Here's my best attempt, which is currently returning an empty list:

#          start  any char    end
pattern = r'^(MS|PhD) .* (\.|!)$'
re.findall(pattern,"candidate with MS in Chemistry.")

>>>
[]

Could someone point me in the right direction?

jbuddy_13
  • 204
  • 1
  • 10

1 Answers1

2

You could use a capturing group and match MS or PhD and the . or ! outside of the group.

\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]
  • \b(?:MS|PhD)\s* A word boundary, match either MS or phD followed by 0+ leading whitspace chars to not capture them in the group
  • ( capture group 1, which contains the desired value
    • (?: Non capture group
      • (?!\b(?:MS|PhD)\b). Match any char except a newline if it is not followed by either MS or phD
    • )* Close the non capture group and repeat it 0+ times
  • )[.,] Close group 1 and match either . or ,

Regex demo | Python demo

import re

regex = r"\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]"
s = ("Candidate with MS or PhD in Statistics, Computer Science, or similar field.\n"
    "Applicant with MS in Biology or Chemistry desired.")

matches = re.findall(regex, s)
print(matches)

Output

['in Statistics, Computer Science, or similar field', 'in Biology or Chemistry desired']
The fourth bird
  • 96,715
  • 14
  • 35
  • 52
  • This is awesome! (Will checkmark when min waiting period over.) Thanks for the interactive demo link. I don't understand much of it yet, but think I'll be able click around enough to infer what's happening. – jbuddy_13 Dec 21 '20 at 18:12
  • 1
    Please do not use the negative lookahead after the dot, use it before the dot, in the tempered greedy token. It will work here, but in general, it is safer to use it correctly. See [Tempered Greedy Token - What is different about placing the dot before the negative lookahead](https://stackoverflow.com/a/37343088/3832970). Use [`\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]`](https://regex101.com/r/WfiG0Y/2). – Wiktor Stribiżew Dec 21 '20 at 18:23
  • @WiktorStribiżew Thanks, updated and I will read the post again. – The fourth bird Dec 21 '20 at 18:32