0

I am making regular expressions to extract dosage instructions from a pharmaceutical catalog. I am getting information from many different brands, and formatting is not consistent even within a brand so my expression has to be kind of lenient. The regular expressions are being implemented in Ruby (but not by me).

My regex is as follows:

/(take|chew\s|usage:|use:|intake:|dosage:?|dose:|directions:|recommendations:|adults:)\s*(.*take\s+|.*chew\s+|.*mix\s+|.*supplement,\s+)?(?<dosage_amount>\S+(\sto\s\S+)?(\sor\s\S+)?(\s\(\d+\)\s)?\b)[\s,](?<dosage_format>\S+\b(\s\([\w\-\.]+\))?)?[\s,]*?(?<dosage_frequency>[\S\s]*(daily|per day|a day|needed|morning|evening))?[\s,]?\s?(daily\s)?(?<dosage_permutation>(with|on|at|in|before|after|taken)[,\w\s\-]*)?(?=or as|\.)?/

An example of the code working correctly would be with the following description --

"Suggested use: As a dietary supplement, take 1-3 capsules daily,in divided doses, before a meal."

-- where I get dosage_amount= 1-3, dosage_format= capsule, dosage_frequency= once per day, and dosage_permutation= "in divided doses, before a meal".

However, I am getting problems with descriptions like:

"Directions: For adults, take one (1) tablet daily, preferably with a meal or follow the advice of your health care professional. Let tablets dissolve on tongue before swallowing. As a reminder, discuss the supplements and medications you take with your health care providers. "

The problem is where the word "take" is used more than once in the description. I will get dosage_amount= with, and dosage_format= your. (It is matching the second 'take', and not the first.)

Is there a way to force regex to only match the first 'take' in the description? I have tried experimenting with making it greedy vs. non-greedy as outlined here, but I can not make it work.

Thank you.

Community
  • 1
  • 1
mudfaerie
  • 3
  • 2
  • Please show us your attempt to make it non-greedy, because I think that should do it. We need to see what you tried so we can help you understand where you went wrong. – Barmar Jul 20 '15 at 19:33
  • Sleafar's answer worked - I'd tried to make the 'take' itself non-greedy instead of the characters before it. Thank you. – mudfaerie Jul 20 '15 at 19:38

1 Answers1

0

Try to replace the greedy part here:

.*take

with a non greedy version:

.*?take

The first variant consumes as many characters as possible, the second as few as possible.

Sleafar
  • 1,519
  • 6
  • 10