2

I am creating a parser of a text. The text contains two specific words that I do not want to match and between them I want to capture all the capital words that exist.

For example the text would be:

Treatments: IBUPROFEN\n\xe2\x80\xa2 COLCHICINE .... Physical examination

I have tried with this (?<=Treatments)(?:.*?)(\b[A-Z]+\b)(?:.*?)(?=Physical) but it didn't work.

I would like to capture just the words that are in capital letters between treatments and physical examination

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • I tried `([A-Z]{2,})` and it worked on provided string. It's not very generic though. – kkotula Mar 31 '20 at 18:55
  • 1
    Note that the string is a part of a bigger string with other words in upper case, I don't want to match all the upper case words, just the ones which are between Treatments and Physical – Adrián Valls Carbó Mar 31 '20 at 20:42
  • Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Mar 31 '20 at 21:25
  • I have tried many tutorials as Wiktor suggest but none of them solve my problem – Adrián Valls Carbó Mar 31 '20 at 21:43
  • *I am creating a parser of a text.* - What language/tool? – Wiktor Stribiżew Apr 01 '20 at 00:29
  • I am using Python – Adrián Valls Carbó Apr 01 '20 at 17:58
  • 1
    I can't post it as answer, because the question has been Closed :/ The regex that you're looking for is: `.*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b` ([see online demo](https://regex101.com/r/T1auvh/2/)). Note that you need to tell your regex engine to make `.`(dot) match new line character (in python it's `re.DOTALL` flag). [See python demo](https://onlinegdb.com/Hk9p7vzvL). Note also that the first and the last regex matches won't have anything in first capture group, so you need to ignore those matches (see `filter` call in python example above). – TeWu Apr 01 '20 at 19:04
  • @TeWu You may post the answer. – Wiktor Stribiżew Apr 02 '20 at 06:52
  • @WiktorStribiżew Thanks :) - posted. – TeWu Apr 02 '20 at 12:30

2 Answers2

1

To capture only the words that are in capital letters and between words begin and end, use this regex:

.*begin|end.*|[^e]*?\b([A-Z]{2,})\b

See online demo

When you replace end with some other word, be sure to replace e in [^e]*? part with the first letter of this new word, e.g. when you want to replace end with Stop, then also replace [^e]*? with [^S]*?.

For the example in question, this regex becomes:

.*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b

See online demo

Note that you need to tell your regex engine to make .(dot) match newline character:

  • In Python it's re.DOTALL flag.
  • In JavaScript you must replace all .(dots) in regex with [\s\S]. [source]

Also note that the first and the last regex matches won't have anything in the first capture group, so you need to ignore those matches (see filter call in python example below).

Python example

import re

text = """Suspendisse potenti:
Not MATCHED here. Por TOG esfet.

Treatments:
Pellentesque eget sollicitudin quam, id venenatis odio. Nam non tortor elit. Pras ultricies est urna, eu feugiat purus tempor a. Donec IBUPROFEN feugiat tristique ante, eget vulputate velit rhoncus ut. Morbi MATCHED HERE elementum leo a vulputate cursus. Sed at purus sit amet sapien COLCHICINE ullamcorper convallis.

Physical examination:
Also NOT MATCHED here at TO pulvinar mi, at vehicula libero. Nunc semper, neque sed tempor iaculis, nunc diam egestas lacus, Peget sodales sapien orci eget leo."""
results = re.findall(r".*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b", text, re.DOTALL)
words = list(filter(None, results))

print(words)

Run it

TeWu
  • 4,006
  • 1
  • 19
  • 34
0

This seems to work in Java. Here's what is used.

  • ?msd multiline mode, dotall mode, and unix newline mode
  • \b word boundary (need to do \\b for Java Strings)
  • (?<=) positive look behind
  • (?=) positive look ahead.
        String str =
                "Treatments: IBUPROFEN\n\\xe2\\x80\\xa2 COLCHICINE .... Physical examination";

         pat = "(?msd:(?<=Treatments:.*)\\b([A-Z]+)\\b(?=.*Physical examination))";
        // iterate until no matches found.
        Matcher m = Pattern.compile(pat).matcher(str);
        while(m.find()) {
            System.out.println(m.group(1));
        }

Prints

IBUPROFEN
COLCHICINE
WJS
  • 22,083
  • 3
  • 14
  • 32
  • For JS it'd be `someString.match(/\b[A-Z]+\b/gm)` – Michael Malinovskij Mar 31 '20 at 20:32
  • 1
    The problem is the string is just a part of a bigger string which also contains words in upper case... I need to capture the strings in upper case which are between Treatments and Physical – Adrián Valls Carbó Mar 31 '20 at 20:39
  • @AdriánVallsCarbó just get a substring first using something like this: `/Treatments(.+)Physical/` and then match in the first group using the pattern above. Probably there is a way to make more complex single step solution, but with this approach you gonna keep patterns simpler, which is good. – Michael Malinovskij Mar 31 '20 at 22:04
  • @AdriánVallsCarbó I modified my answer. Hope it works for you. – WJS Mar 31 '20 at 22:23