1

How can I match the 2 different groups in between the () in the below string

data foo (drop = DISCOUNT price RENAME = ( PROV_NM1= PROV_NM PROV_ST_NM1 = PROV_ST_NM) where = ( product = 'whizmo' and product < 10 )) bar( drop= DISCOUNT price rename= ( startDate = beginDate ) );

I need to match this to get 2 groups:

  1. foo (drop = DISCOUNT price RENAME = ( PROV_NM1= PROV_NM PROV_ST_NM1 = PROV_ST_NM) where = ( product = 'whizmo' and product < 10 ))
  2. bar( drop= DISCOUNT price rename= ( startDate = beginDate ) )

I have been trying it from quiet a few days now and has come up with this regex: (?i)(data)\s+((\w+)(?=(\s*))(?:\4\w+))?\s*(\(((.|\n)*?)\);)? It can be seen here: regex demo

It is working for most of the cases but doesn't give 2 separate groups in case of above example as it matches everything inside brackets in a single group.

I have tried a few recursive patterns too but sadly unable to figure it out. Any help or guidance on the same is appreciated. Thank you.

frosty
  • 277
  • 1
  • 8
  • 2
    Looks like you'll need recursion or a stack here. Probably easier to just write a simple parser, or if this is some standard notation, use an existing, industrial-strength parser. It's not entirely clear what the spec is for this notation though. – ggorlen Sep 10 '20 at 16:34
  • Perhaps this page can be helpful https://stackoverflow.com/a/35271017/5424988 – The fourth bird Sep 10 '20 at 16:36
  • I did try using recursion for this actually but it's not working as expected and everything we tried either everything is coming in same group as one or not matching as expected. One of the simple recursion we tried is: `(?i)((\w+)(\s*=\s*)(\((.*?)\))?)(?5)` but it doesn't match as expected. Keeping parser development as a last option here as a regex would help for multiple patterns that can be encountered. – frosty Sep 10 '20 at 16:46
  • For specs, we are actually trying to parse SAS Data step statements for some conversions. – frosty Sep 10 '20 at 16:47

3 Answers3

2

In PCRE you can use this recursive regex to capture what you want:

~(?: ^data | (?!^)\G ) \h+ ( \w+ \h* ( \( (?: [^()]*+ | (?-1) )* \) ) )~xi

RegEx Demo

Your match is available in captured group #1

RegEx Details:

  • (?: ^data | (?!^)\G ): Start with data in a line or else match from end of previous match i.e. \G
  • \h+: Match 1+ whitespaces
  • (: Start capture group #1
    • \w+: Match 1+ word characters
    • \h*: Match 0+ whitespaces
    • (: Start capture group #2
      • \(: Match literal ( (opening)
      • (?:: Start non-capture group
        • [^()]*+: Match 0 or more of any characters that are not ( and )
        • |: OR
        • (?-1): Recurse the match with latest group i.e. #2
      • )*: End non-capture group. Match 0 or more of this group
    • ): capture group #2
  • ): capture group #1

Reference: RegEx Expression Recursion

anubhava
  • 664,788
  • 59
  • 469
  • 547
  • 1
    That's an awesome pattern using `\G` and recurse the last group ++ – The fourth bird Sep 10 '20 at 19:16
  • 1
    Yes..!! It did. Thanks Anubhava. There are many new things too i learned from this. Is there any reference or book that you suggest to get better with regex? I have been looking for some references but couldn't really find good ones. Thanks again. :) – frosty Sep 11 '20 at 04:59
  • Glad it helped. I have added a good reference of regex recursion. You will get to read tons of very informative pages on same website. – anubhava Sep 11 '20 at 10:03
0

This handles a maximum of 1 level of nesting of bracketed input:

\w+\s*\((?:\([^)]+\)|[^)])*?\)

See live demo.

It matches a word followed by bracketed input, but contains an alternation that preferentially consumes inner bracketed input within the outer brackets before trying the simpler match of a non-closing bracket.

Bohemian
  • 365,064
  • 84
  • 522
  • 658
0

I am no expert but i would go for much simple options

foo = ‘(foo.+?)bar’
bar = ‘(bar.+);’

#or combine both 

‘(bar.+);|(foo.+?)bar’
Assad Ali
  • 194
  • 9