1

I am using <[^<>]+> in order to extract substrings between < and >, as the following:

<abc>, <?.sdfs/>, <sdsld\>, etc.

I am not trying to parse HTML tags, or something similar. My only issue is extracting strings between < and >.

But sometimes, there might be substrings like the following:

</</\/\asa></dsdsds><sdsfsa>>

In that case, all string should be matched, instead of 3 substrings. Because all string is covered by < and >.

How can I modify my regex to do that?

xanatos
  • 102,557
  • 10
  • 176
  • 249
yusuf
  • 3,062
  • 5
  • 31
  • 77

2 Answers2

1

Don't use regex. Use the traditional way to do this. Make a stack and if there's more than one '<' keep appending else break and append the whole thing.

But just make sure to handle the double back slashes that somehow crop up :-/

def find_tags(your_string)
    ans = []
    stack = []
    tag_no = 0

    for c in your_string:
        if c=='<':
            tag_no+=1
            if tag_no>1:
                stack.append(c)
        elif c=='>':
            if tag_no==1:
                ans.append(''.join(stack))
                tag_no=0
                stack=[]
             else:
                  tag_no = tag_no-1
                  stack.append(c)
        elif tag_no>0:
             stack.append(c)
    return ans

Output below

find_tags(r'<abc>, <?.sdfs/>, <sdsld\>')
['abc', '?.sdfs/', 'sdsld\\']
find_tags(r'</</\/\asa></dsdsds><sdsfsa>>')
['/</\\/\\asa></dsdsds><sdsfsa>']

Note: Works in O(n) as well.

Abhishek Jebaraj
  • 1,861
  • 2
  • 14
  • 20
1

Refer this Regular Expression to match outer brackets I'm trying to implement the same using < & >.

Or How about a small method for this:

def recursive_bracket_parser(s, i):
while i < len(s):
    if s[i] == '<':
        i = recursive_bracket_parser(s, i+1)
    elif s[i] == '>':
        return i+1
    else:
        # process whatever is at s[i]
        i += 1
return i

Source: How can I match nested brackets using regex?

Community
  • 1
  • 1
NikhilGoud
  • 459
  • 5
  • 20