bounding strings between two characters in regex

Question

I am using <[^<>]+> in order to extract substrings between < and >, as the following:

<abc>, <?.sdfs/>, <sdsld\>, etc.

I am not trying to parse HTML tags, or something similar. My only issue is extracting strings between < and >.

But sometimes, there might be substrings like the following:

</</\/\asa></dsdsds><sdsfsa>>

In that case, all string should be matched, instead of 3 substrings. Because all string is covered by < and >.

How can I modify my regex to do that?

Such strings aren’t part of a regular language, so a regex is probably the wrong approach. — Sebastian Simon, Mar 08 '17 at 08:55
Depending on the language you are using, there are special regex constructs to do it. C# for example has one (http://stackoverflow.com/questions/17003799/what-are-regular-expression-balancing-groups) — xanatos, Mar 08 '17 at 08:56
There is an old question about this: [Matching Nested Structures With Regular Expressions in Python](http://stackoverflow.com/q/1099178/613130) — xanatos, Mar 08 '17 at 08:57
And there is a newer python regex library that should do it: https://pypi.python.org/pypi/regex (the *Nested sets and set operations are supported.* part) — xanatos, Mar 08 '17 at 09:00
@yusuf: With PyPi `regex` module, you may use this - [`]++|(?R))*>`](https://regex101.com/r/qT5FyU/1) — Wiktor Stribiżew, Mar 08 '17 at 09:01
@yusuf: **Only** with PyPi regex module. Or write your own method using stack. — Wiktor Stribiżew, Mar 08 '17 at 09:08

Abhishek Jebaraj · Accepted Answer · 2017-03-08T09:28:39.930

Don't use regex. Use the traditional way to do this. Make a stack and if there's more than one '<' keep appending else break and append the whole thing.

But just make sure to handle the double back slashes that somehow crop up :-/

def find_tags(your_string)
    ans = []
    stack = []
    tag_no = 0

    for c in your_string:
        if c=='<':
            tag_no+=1
            if tag_no>1:
                stack.append(c)
        elif c=='>':
            if tag_no==1:
                ans.append(''.join(stack))
                tag_no=0
                stack=[]
             else:
                  tag_no = tag_no-1
                  stack.append(c)
        elif tag_no>0:
             stack.append(c)
    return ans

Output below

find_tags(r'<abc>, <?.sdfs/>, <sdsld\>')
['abc', '?.sdfs/', 'sdsld\\']
find_tags(r'</</\/\asa></dsdsds><sdsfsa>>')
['/</\\/\\asa></dsdsds><sdsfsa>']

Note: Works in O(n) as well.

abhishek, I want to include < and >. – yusuf Mar 08 '17 at 09:38 — yusuf, Mar 08 '17 at 09:38

score 1 · Answer 2 · edited May 23 '17 at 12:00

Refer this Regular Expression to match outer brackets I'm trying to implement the same using < & >.

Or How about a small method for this:

def recursive_bracket_parser(s, i):
while i < len(s):
    if s[i] == '<':
        i = recursive_bracket_parser(s, i+1)
    elif s[i] == '>':
        return i+1
    else:
        # process whatever is at s[i]
        i += 1
return i

Source: How can I match nested brackets using regex?

bounding strings between two characters in regex

2 Answers2