5

I've been sitting on this for 2 days now, and still didn't find an effective way to do so. Lets say we have the string:

<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>

The output I want:

<562947621914222412421>

Well, recursively there might be data inside <> brackets, it could consist of numbers and letters - but the first level consists only of numbers. I bolded out the data I want to extract.

I want to do this in Python. The naive way is to implement a bracket stack of course (that way I'll know if I'm inside an inner bracket or I'm in the 1st level) - but it's very inefficient going char by char. I believe there is a good regex pattern I could use, but I haven't come up with something that works.

Can some with enough experience with regex give a little help?

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

kaya3
  • 31,244
  • 3
  • 32
  • 61
regexer178
  • 53
  • 3

4 Answers4

4

There is an alternative regex module available in Python that allows recursive patterns.

With this you could use such pattern for balanced brackets and replace with empty string.

regex.sub(r'(?!^)<(?:[^><]*|(?R))+>', '', s)

See this regex101 demo or a Python demo, results in

<562947621914222412421>

At (?R) the pattern is pasted from start, same like (?0). Used (?!^) to omit first <

bobble bubble
  • 11,968
  • 2
  • 22
  • 34
4

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.


Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

kaya3
  • 31,244
  • 3
  • 32
  • 61
3

This is one way; it recursively removes inner complete tags <[^<>]*> until only the outer level elements remain:

def parse(string):
    while True:
        output = re.sub(r'(?<!^)<([^<>]*)>(?!$)', '', string)
        if output == string:
            break
        string = output
    return output
>>> string = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> parse(string)
'<562947621914222412421>'
>>> %timeit parse(string)
6.57 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

There should also be a way to do the recursion in regex itself, but I couldn't quite make it work. The built-in module re does not support this, but regex does. Here are some ideas on that.

Seb
  • 3,686
  • 8
  • 20
2

You can use recursion to create a nested list from the structure. Then, to produce the desired result, you can simply find all the top level strings in the list:

import re
data = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
def get_data(d):
  if (val:=next(d, None)) not in {'<', '>'}:
     yield val
  if val == '<':
     yield list(get_data(d))
  if val is not None and val != '>':
     yield from get_data(d)

result = '<'+''.join(i for i in list(get_data(iter(re.findall('[^\<\>]+|[\<\>]', data))))[0] if isinstance(i, str))+'>'

Output:

'<562947621914222412421>'
Ajax1234
  • 58,711
  • 7
  • 46
  • 83