3

Suppose I am given the following kind of string:

"(this is (haha) a string(()and it's sneaky)) ipsom (lorem) bla"

and I want to extract substrings contained within a topmost layer of parentheses. I.e. I want to obtain the strings:"this is (haha) a string(()and it's sneaky)" and "lorem".

Is there a nice pythonic method to do this? Regular expressions are not obviously up to this task, but maybe there is a way to get an xml parser to do the job? For my application I can assume the parentheses are well formed, i.e. not something like (()(().

NWMT
  • 135
  • 7
  • 3
    I think you should define a function for this. In that function, traverse string and maintain a flag to check if you are within a topmost layer of parentheses. Using this method,you can get index if start and end and then you can extract string and concatenate to final answer – dazzieta Jul 05 '16 at 20:06
  • 1
    Would this be considered a "pythonic" method? I would go about it by using a ctr which would increment on hitting '(' and decrement on hitting ')'. When it hits 0 after hitting at least 1 '(' you can take the substring between the initial and final positions and append it to a list. – Vaibhav Bajaj Jul 05 '16 at 20:08
  • Hi utkarsh13. Thanks for that. It's more or less the solution I had in mind, but I was wondering if there was a faster was some functionality built into python that did it in a couple easy to read lines. – NWMT Jul 05 '16 at 20:12
  • Hi Vaibhav Bajaj. Thanks. Right this function sort of needs two levels. One that starts once you enter a parenthesis and then outputs once the parenthesis "count" drops to 0. Maybe it's not that bad. – NWMT Jul 05 '16 at 20:14
  • 1
    @user177955 Quick and dirty: `print re.match(string.replace(")",").").replace("(",".("), string).groups()[0::4]`. Sorry I couldn't resist it: the string looked way too much like a regex, that I made it into a regex. :P That being said, you should really write your own stack or follow something like what utkarsh said. – UltraInstinct Jul 05 '16 at 20:16
  • Is the leading `(` always at the start? – Padraic Cunningham Jul 05 '16 at 20:19
  • `(` is not necessarily at the start. – NWMT Jul 05 '16 at 20:23
  • 1
    @SuperSaiyan Sure, for any string there exists an arbitrarily complicated RE that will do the job :-P – NWMT Jul 05 '16 at 20:32
  • this question doesn't appear to be xml related... – vtd-xml-author Jul 05 '16 at 20:42

4 Answers4

7

This is a standard use case for a stack: You read the string character-wise and whenever you encounter an opening parenthesis, you push the symbol to the stack; if you encounter a closing parenthesis, you pop the symbol from the stack.

Since you only have a single type of parentheses, you don’t actually need a stack; instead, it’s enough to just remember how many open parentheses there are.

In addition, in order to extract the texts, we also remember where a part starts when a parenthesis on the first level opens and collect the resulting string when we encounter the matching closing parenthesis.

This could look like this:

string = "(this is (haha) a string(()and it's sneaky)) ipsom (lorem) bla"

stack = 0
startIndex = None
results = []

for i, c in enumerate(string):
    if c == '(':
        if stack == 0:
            startIndex = i + 1 # string to extract starts one index later

        # push to stack
        stack += 1
    elif c == ')':
        # pop stack
        stack -= 1

        if stack == 0:
            results.append(string[startIndex:i])

print(results)
# ["this is (haha) a string(()and it's sneaky)", 'lorem']
poke
  • 307,619
  • 61
  • 472
  • 533
  • @poke. Thanks for writting up utkarsh13 and Vaibhav Bajaj's comments. I do have a mini question how does `for i,c in enumerate(string)` work? – NWMT Jul 05 '16 at 20:36
  • 1
    @user177955 Iterating over [`enumerate(x)`](https://docs.python.org/3/library/functions.html#enumerate) will give you a two-tuple on each iteration with the index in addition to the value of the iterable. So instead of getting just every character from the string, we get the character paired with its index in the string. – poke Jul 05 '16 at 20:46
0

Are you sure regex isn't good enough?

>>> x=re.compile(r'\((?:(?:\(.*?\))|(?:[^\(\)]*?))\)')
>>> x.findall("(this is (haha) a string(()and it's sneaky)) ipsom (lorem) bla")
["(this is (haha) a string(()and it's sneaky)", '(lorem)']
>>> x.findall("((((this is (haha) a string((a(s)d)and ((it's sneaky))))))) ipsom (lorem) bla")
["((((this is (haha) a string((a(s)d)and ((it's sneaky))", '(lorem)']
Delioth
  • 1,489
  • 7
  • 16
  • 1
    I didn't downvote. But regex is just not a tool for places where a stack is needed. I should be ashamed for having proposed the same in comments too (but it was just for fun ;)) – UltraInstinct Jul 05 '16 at 20:20
  • afaik there is some builtin regexp package (literally `import regexp` I think) that has extended support for things needing a stack .... afaik ... I still dont approve of regex for this solution imho) – Joran Beasley Jul 05 '16 at 20:23
  • @JoranBeasley this is less of "you should use this blindly since it's regex and it's good" and more proof the statement "regular expressions are **obviously** not up to this task" is completely wrong, as they _can_ do it. – Delioth Jul 05 '16 at 20:25
  • I can give you a string that breaks that regex im pretty sure ... the look ahead look around stuff makes it hard to guess (I certainly didnt downvote and if regex works then great :P) – Joran Beasley Jul 05 '16 at 20:27
  • consider `"((((this is (haha) a string((a(s)d)and ((it's sneaky))))))) ipsom (lorem) bla"` ... unless you 100% know for sure the maximum nesting depth ... and even then the regex gets pretty ugly – Joran Beasley Jul 05 '16 at 20:35
  • It doesn't look around at all, it just finds a start paren, any number of matching parens inside, and a closing paren- there was a floating "find ANYthing" in there that's now fixed- due to regex not matching overlapping strings it won't get any inner groups, only the outermost. Answer edited with the updated regex which does things properly. – Delioth Jul 05 '16 at 20:42
  • @JoranBeasley the question asked to find whatever is inside the outermost parens- if more parens are in there should they be discarded? it's rather ambiguous, and this won't remove any parens. – Delioth Jul 05 '16 at 20:44
  • @Delioth I actually really like regexes, but for one thing that regex string is really hard for _me_ to parse. Although the language of well formed parentheses is not regular it is true that what I'm asking is strictly weaker. Given the assumption that the parens are well-formed, I guess this thing actually works. I upvoted your answer and commend your fresh regex skillz. – NWMT Jul 06 '16 at 14:12
  • @Delioth Hey! wait a minute:`>>> import re, >>> x.findall("(this is (haha) a string(()and it's sneaky)) ipsom (lorem) bla"), ['(haha)', "(()and it's sneaky))", '(lorem)']` This doesn't work! – NWMT Aug 18 '16 at 14:23
0

this isnt very "pythonic"...but

def find_strings_inside(what_open,what_close,s):
    stack = []
    msg = []
    for c in s:
        s1=""
        if c == what_open:
           stack.append(c)
           if len(stack) == 1:
               continue
        elif c == what_close and stack:
           stack.pop()
           if not stack:
              yield "".join(msg)
              msg[:] = []
        if stack:
            msg.append(c)

x= list(find_strings_inside("(",")","(this is (haha) a string(()and it's sneaky)) ipsom (lorem) bla"))

print x
Joran Beasley
  • 93,863
  • 11
  • 131
  • 160
0

This more or less repeats what's already been said, but might be a bit easier to read:

def extract(string):
    flag = 0
    result, accum = [], []
    for c in string:
        if c == ')':
            flag -= 1
        if flag:
            accum.append(c)
        if c == '(':
            flag += 1
        if not flag and accum:
            result.append(''.join(accum))
            accum = []
    return result

>> print extract(test)
["this is (haha) a string(()and it's sneaky)", 'lorem']