How to extract regex query until a specific word?

Question

I'm trying to extract certain data from LookML, a specific markup language. If this is example code:

explore: explore_name {}
explore: explore_name1 {
  label: "name"
  join: view_name {
      relationship: many_to_one
      type: inner
      sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
  }
}
explore: explore_name3 {}

Then I would receive a list looking like:

explore: character_balance {}

label: "name"
join: activity_type {
  relationship: many_to_one
  type: inner
  sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
}```

explore: explore_name4 {}

Essentially, I start a match at "explore" and end it when I find another "explore" - which would then begin the next match.

Here's what I had before, which matches across all the lines until it finds a ;, and this works perfectly fine: 'explore:\s[^;]*'. But, this stops at a ';', assuming there is one.

How would I change this so that it takes out everything between 'explore' and 'explore'? Simply replacing the ';' in my regex with 'explore' instead stops whenever it finds a letter that matches anything in [e,x,p,l,o,r,e] - which is not the behavior I want. Removing the square brackets and the ^ ends up breaking everything so that it can't query across multiple lines.

What should I do here?

Casimir et Hippolyte · Accepted Answer · 2019-07-01T21:30:27.797

A naive approach consists to reach the next "explore" word. But if for any reason, a string value contains this word, you will get wrong results. Same problem if you try to stops using curly brackets when the string contains nested brackets.

That's why I suggest a more precise description of the syntax of your string that takes in account strings and nested curly brackets. Since the re module doesn't have the recursion feature (to deal with nested structure), I will use the pypi/regex module instead:

import regex

pat = r'''(?xms)
    \b explore:
    [^\S\r\n]* # optional horizontal whitespaces
    [^\n{]* # possible content of the same line
    # followed by two possibilities
    (?: # the content stops at the end of the line with a ;
        ; [^\S\r\n]* $
      | # or it contains curly brackets and spreads over eventually multiple lines
        ( # group 1
            {
                [^{}"]*+ # all that isn't curly brackets nor double quotes
                (?:
                    " [^\\"]*+ (?: \\. [^\\"]* )*+ " # contents between quotes
                    [^{}"]*

                  |
                    (?1) # nested curly brackets, recursion in the group 1
                    [^{}"]*
                )*+
            }
        )
    )'''

results = [x.group(0) for x in regex.finditer(pat, yourstring)]

demo

To be more rigorous, you can add supports for single quoted string, and also prevent that the "explore:" at the start of the pattern isn't in a string using a (*SKIP)(*FAIL) construct.

tzaman · Answer 2 · 2019-07-01T20:35:59.667

0

You can use a non-greedy match with a lookahead assertion to check for the presence of another explore: or the end of the string. Try:

'explore:.*?(?=explore|$)'

edited Jul 01 '19 at 20:35

answered Jul 01 '19 at 20:29

tzaman

42,181
9
84
108

score 0 · Answer 3 · answered Jul 01 '19 at 20:30

While it is do-able in Regex, you should use a parser that understands the format as Regex solution would be pretty fragile.

Having said that, here's a Regex solution with DOTALL mode (where . matches any character including newline) enabled:

re.findall(r'explore:.*?\}', text, re.DOTALL)

explore: matches literally
.*?\} non-greedily matches upto next }

Example:

In [1253]: text = '''explore: character_balance {} 
      ...: explore: tower_ends { 
      ...:   label: "Tower Results" 
      ...:   join: activity_type { 
      ...:       relationship: many_to_one 
      ...:       type: inner 
      ...:       sql_on: ${activity_type.activity_name}=${wba_fact_activity.activity_name} ;; 
      ...:   } 
      ...: } 
      ...: explore: seven11_core_session_start {}'''                                                                                                                                                        

In [1254]: re.findall(r'explore:.*?\}', text, re.DOTALL)                                                                                                                                     
Out[1254]: 
['explore: character_balance {}',
 'explore: tower_ends {\n  label: "Tower Results"\n  join: activity_type {\n      relationship: many_to_one\n      type: inner\n      sql_on: ${activity_type.activity_name}',
 'explore: seven11_core_session_start {}']

How to extract regex query until a specific word?

3 Answers3