To capture only the words that are in capital letters and between words begin
and end
, use this regex:
.*begin|end.*|[^e]*?\b([A-Z]{2,})\b
See online demo
When you replace end
with some other word, be sure to replace e
in [^e]*?
part with the first letter of this new word, e.g. when you want to replace end
with Stop
, then also replace [^e]*?
with [^S]*?
.
For the example in question, this regex becomes:
.*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b
See online demo
Note that you need to tell your regex engine to make .
(dot) match newline character:
- In Python it's
re.DOTALL
flag.
- In JavaScript you must replace all
.
(dots) in regex with [\s\S]
. [source]
Also note that the first and the last regex matches won't have anything in the first capture group, so you need to ignore those matches (see filter
call in python example below).
Python example
import re
text = """Suspendisse potenti:
Not MATCHED here. Por TOG esfet.
Treatments:
Pellentesque eget sollicitudin quam, id venenatis odio. Nam non tortor elit. Pras ultricies est urna, eu feugiat purus tempor a. Donec IBUPROFEN feugiat tristique ante, eget vulputate velit rhoncus ut. Morbi MATCHED HERE elementum leo a vulputate cursus. Sed at purus sit amet sapien COLCHICINE ullamcorper convallis.
Physical examination:
Also NOT MATCHED here at TO pulvinar mi, at vehicula libero. Nunc semper, neque sed tempor iaculis, nunc diam egestas lacus, Peget sodales sapien orci eget leo."""
results = re.findall(r".*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b", text, re.DOTALL)
words = list(filter(None, results))
print(words)
Run it