Regex pattern recursively - in python

Question

I've been sitting on this for 2 days now, and still didn't find an effective way to do so. Lets say we have the string:

<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>

The output I want:

<562947621914222412421>

Well, recursively there might be data inside <> brackets, it could consist of numbers and letters - but the first level consists only of numbers. I bolded out the data I want to extract.

I want to do this in Python. The naive way is to implement a bracket stack of course (that way I'll know if I'm inside an inner bracket or I'm in the 1st level) - but it's very inefficient going char by char. I believe there is a good regex pattern I could use, but I haven't come up with something that works.

Can some with enough experience with regex give a little help?

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

How deeply is your real data nested? – Stefan Pochmann Dec 21 '19 at 17:08 — Stefan Pochmann, Dec 21 '19 at 17:08

bobble bubble · Answer 1 · 2019-12-21T22:32:33.617

4

There is an alternative regex module available in Python that allows recursive patterns.

With this you could use such pattern for balanced brackets and replace with empty string.

regex.sub(r'(?!^)<(?:[^><]*|(?R))+>', '', s)

See this regex101 demo or a Python demo, results in

<562947621914222412421>

At (?R) the pattern is pasted from start, same like (?0). Used (?!^) to omit first <

edited Dec 21 '19 at 22:32

answered Dec 21 '19 at 16:36

bobble bubble

11,968
2
22
34

kaya3 · Accepted Answer · 2019-12-21T21:09:45.683

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.

Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

score 3 · Answer 3 · answered Dec 21 '19 at 15:42

This is one way; it recursively removes inner complete tags <[^<>]*> until only the outer level elements remain:

def parse(string):
    while True:
        output = re.sub(r'(?<!^)<([^<>]*)>(?!$)', '', string)
        if output == string:
            break
        string = output
    return output

>>> string = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> parse(string)
'<562947621914222412421>'

>>> %timeit parse(string)
6.57 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

There should also be a way to do the recursion in regex itself, but I couldn't quite make it work. The built-in module re does not support this, but regex does. Here are some ideas on that.

score 2 · Answer 4 · answered Dec 21 '19 at 15:59

You can use recursion to create a nested list from the structure. Then, to produce the desired result, you can simply find all the top level strings in the list:

import re
data = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
def get_data(d):
  if (val:=next(d, None)) not in {'<', '>'}:
     yield val
  if val == '<':
     yield list(get_data(d))
  if val is not None and val != '>':
     yield from get_data(d)

result = '<'+''.join(i for i in list(get_data(iter(re.findall('[^\<\>]+|[\<\>]', data))))[0] if isinstance(i, str))+'>'

Output:

'<562947621914222412421>'

Regex pattern recursively - in python

4 Answers4