2

I have a regex trying to divide questions by speciality. Say I have the following regex:

(?P<speciality>[0-9x]+)

It works fine for this question (correct match: 7)

(7)Which of the following is LEAST to be considered as a risk factor for esophageal cancer?;

And for this (correct match: 8 and 13)

(8,13)30 year old woman with amenorrhea, low serum estrogen and high serum LH/FSH, the most likely diagnosis is:

But not for this one (incorrect match: 20).

First trimester spontaneous abortion (before 20 wk) is most commonly due to:

I only need the numbers in parentheses at the beginning of the question, all other parentheses should be ignored. Is this possible with a regex alone (lookahead?).

Jan
  • 38,539
  • 8
  • 41
  • 69
  • what means the x in your specialities? [0-9x]+ will match strings like xxxxx, x9x9x9 etc. – 1010 Jan 18 '15 at 16:58
  • That's exactly what I want. I have a speciality numbered 1-16 and x for unknown specialities. – Jan Jan 18 '15 at 17:54

3 Answers3

3

If your regex flavor supports \G continuous matching and \K reset beginning of match, try:

(?:^\(|\G,)\K[\dx]+

^\( would match parenthesis at start | OR \G match , after last match. Then \K resets and match + one or more of [\dx]. (\d is a shorthand for [0-9]). Matches will be in $0.

Test at regex101.com; Regex FAQ


PHP example

$str = "(1x,2,3x) abc (1,2x,3) d";

preg_match_all('~(?:^\(|\G,)\K[\dx]+~', $str, $out);

print_r($out[0]);

Array
(
    [0] => 1x
    [1] => 2
    [2] => 3x
)

Test at eval.in

Community
  • 1
  • 1
Jonny 5
  • 11,051
  • 2
  • 20
  • 42
1

Perhaps something like this will work (you don't mention the regex flavor that you're using, though I am guessing it is PCRE by the use of the named group - and yes, it does use positive lookahead):

^\((?P<speciality>(?:[0-9x]+,?)+)(?=\))/mg

The caret ^ combined with the multiline modifier \m (which causes the anchors ^ and $ to match the beginning and end of lines, respectively, instead of the beginning and end of the string) will ensure that what is matched is at the start of the paragraph. The specialties will be captured in the specialty named capture group; the only caveat is that if more than one specialty is given (as in your example starting (8,13)) the capture will be a comma-delimited list, just as the specialty is a comma-delimited list (to use the same example, the capture will be 8,13 in that case).

Please see Regex Demo here.

David Faber
  • 11,549
  • 2
  • 25
  • 40
1

(?P<speciality>[0-9x]+) matches any nonempty sequence of digits anywhere in the input. the parentheses just delimit the capturing group but are not part of the match.

to match a number (or more separated by commas) between parentheses at the beginning of the line you could use something like this

^\((\d+)(,(\d+))*\)

EDIT

it seems repeated capturing groups, as in (,(\d+))*, will only return the last match. so to get the values it'd be necessary to catch the complete list of numbers and parse it afterwards:

^\((?P<specialities>(\d+)(,(\d+))*)\)

will catch one or more numbers separated by commas, between parentheses.

added the start of line anchor so it is at the beginning of the line.

Demo

1010
  • 1,681
  • 15
  • 26
  • If you make the comma optional in `(,(\d+))*` it should work to get all the matches: `(,?(\d+))+`. – David Faber Jan 18 '15 at 12:05
  • it doesn't seem to capture all the numbers. @Johnny-5's answer solves the question and captures all the matches. – 1010 Jan 18 '15 at 16:56