Selectively replace specific nested delimiters (brackets) in strings, while respecting nesting

Question

I have many strings where I'm trying to selectively replace all instances of f[--whatever--] with f.__getitem__(--whatever--, x=x). This is the last option left to me to patch some old complicated code using eval calls that I'm unfortunately stuck with. It's easy to replace the f[, but it's hard to know whether instances of ] are associated with this pattern or some other miscellaneous patterns like lists [--whatever--] or indexing .loc[--whatever--]. There are no isolated cases of ] that are not part of a full [] in my strings.

My latest attempt at a solution uses regex: 1) sub ([^f])[(.+?)] with \1openbracket\2closebracket to preserve [] that isn't part of f[] 2) the remaining [] 3) sub back openbracket & closebracket with []

The problem is that this doesn't handle many nested cases like the example below. I'm looking for a more comprehensive solution to establish whether a given ] is associated with f[] or some other structure. Is there a way to do this with pyparsing or some other module?

Example

f[r@ndom t3xt] + [some r@ndom t3xt] + [f[more r@ndom t3xt] / f[more t3xt]] + [f[f[more t3xt] + 3]]

should become

f.__getitem__(r@ndom t3xt, x=x) + [some r@ndom t3xt] + [f.__getitem__(more r@ndom t3xt, x=x) / f.__getitem__(more t3xt, x=x)] + [f.__getitem__(f.__getitem__(more t3xt) + 3)]

Try this [`f\[([^\]]+)\]`](https://regex101.com/r/KdPP4q/1/) — Code Maniac, Sep 18 '19 at 01:08
Thanks @CodeManiac but this still misses some nested cases. I added to my example text - see [f[f[more t3xt] + 3]] — , Sep 18 '19 at 02:03
Well in such cases you need a recursive way to do it, as a suggestion you can make the RegEx greedy and inside you call back you can again check if it has any nested pattern recursively — Code Maniac, Sep 18 '19 at 03:30

score 1 · Answer 1 · answered Sep 18 '19 at 02:18

Nested []'s make this a non-trivial problem. pyparsing has a "crutch" expression method called nestedExpr that makes it easy to match nested delimiters like ()'s and []'s. pyparsing also has the transformString method, for converting as-parsed data into a different form. We can use a parse-time callback (or "parse action") to repeatedly convert any nested f[zzz] terms until all have been transformed:

import pyparsing as pp

fname = pp.Keyword('f')
index_expr = pp.nestedExpr('[', ']')
# nestedExpr will give a nested list by default, we just want the original raw text
f_expr = fname + pp.originalTextFor(index_expr)("index_expr")

# define a parse action to convert the f[aaa] format to f._getitem__(aaa, x=x)
def convert_to_getitem(t):
    # get the contents of the index_expr, minus the leading and trailing []'s
    index_expr = t.index_expr[1:-1]

    # repeatedly call transform string to get further nested f[] expressions, until 
    # transformString stops returning a modified string
    while True:
        transformed = f_expr.transformString(index_expr)
        if transformed == index_expr:
            break
        index_expr = transformed

    # reformat to use getitem
    return "f.__getitem__({}, x=x)".format(transformed)

# add the parse action to f_expr
f_expr.addParseAction(convert_to_getitem)


# use transformString to convert the input string with nested expressions
sample = "f[r@ndom t3xt] + [some r@ndom t3xt] + [f[more r@ndom t3xt] / f[more t3xt]] + [f[f[more t3xt] + 3]]"
print(f_expr.transformString(sample))

Prints:

f.__getitem__(r@ndom t3xt, x=x) + [some r@ndom t3xt] + [f.__getitem__(more r@ndom t3xt, x=x) / f.__getitem__(more t3xt, x=x)] + [f.__getitem__(f.__getitem__(more t3xt, x=x) + 3, x=x)]

This should also handle '[]'s that might occur in quoted strings.

score 0 · Answer 2 · answered Sep 18 '19 at 01:23

Maybe,

f\[([^]]*)\]

and a re.sub with,

f.__getitem__($1, x=x)

might simply work.

Test

import re

regex = r"f\[([^]]*)\]"

string = """
f[r@ndom t3xt] + [some r@ndom t3xt] + [f[more r@ndom t3xt] / f[more t3xt]]
f[] + [] + [f[] / f[]]

"""

subst = "f.__getitem__($1, x=x)"

print(re.sub(regex, subst, string))

Output

f.__getitem__($1, x=x) + [some r@ndom t3xt] + [f.__getitem__($1, x=x) / f.__getitem__($1, x=x)]
f.__getitem__($1, x=x) + [] + [f.__getitem__($1, x=x) / f.__getitem__($1, x=x)]

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Charif DZ · Accepted Answer · 2019-10-02T09:43:17.857

0

A solution using Regex:

import re

string1 = "f[r@ndom t3xt] + [some r@ndom t3xt] + 3[f2[more r@ndom t3xt] / f[more t3xt]] + [f[f[more t3xt] + 3]]"
string3 = '''f[text([0,[1,2],3, x["text3"]])]'''


def get_repl(match):
    if match.groups()[-1]:
        # replace nested [ and ]  with special characters
        return match.groups()[-1].replace('[', '##1##').replace(']', '##2##')
    else:
        return '{}.__getitem__({}, x=x)'.format(*match.groups()[:-1])

def place_by_getitem(string):
    pattern = '(?<!\w)(f)\[([^\[]+?)\]|(\[[^\[]+?\])'
    while re.search(pattern, string):
        string = re.sub(pattern, get_repl, string)

    return string.replace('##1##', '[').replace('##2##', ']')


print(place_by_getitem(string1))
print(place_by_getitem(string3))

Output:

f.__getitem__(r@ndom t3xt, x=x) + [some r@ndom t3xt] + 3[f2.__getitem__(more r@ndom t3xt, x=x) / f.__getitem__(more t3xt, x=x)] + [f.__getitem__(f.__getitem__(more t3xt, x=x) + 3, x=x)]
f.__getitem__(text([0,[1,2],3, x.__getitem__("text3", x=x)]), x=x)

edited Oct 02 '19 at 09:43

answered Sep 18 '19 at 08:30

Charif DZ

13,500
3
13
36

Thanks! This works as well for the cases I've tried and seems faster ```%timeit -n 3 f_expr.transformString(string1) # pyparsing 9.91 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 3 loops each) %timeit -n 3 place_by_getitem(string1) # regex The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached. 28.7 µs ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)``` – Sep 18 '19 at 11:32
I didn't understand what you mean, can you explain what is the problem with this solution? – Charif DZ Sep 18 '19 at 11:43
I'm using Regex it should be faster. – Charif DZ Sep 18 '19 at 11:46
On more testing, this method fails in cases with quotes `string3 = '''f["text([0,1,2,3])"]'''` `place_by_getitem(string3) # regex` `'f["text([0,1,2,3])"]'` `f_expr.transformString(string3) # pyparsing` `'f.__getitem__("text([0,1,2,3])", x=x)'` – Sep 18 '19 at 11:59
I think I found a solution for this problem, this has an advantage over the accepted answer because it accepts more letters not just f could be `f2, x, fx a valid identifier in ASCII letter starting with letter` and match faster, check my edits – Charif DZ Sep 18 '19 at 13:21
I checked your edits. I only want to match `f[..]`, not `f2[...]` etc. More specifically, I want to match, in regex syntax: `(? – Sep 18 '19 at 13:43
So just : `pattern = '(f)\[([^\[]+?)\]|(\[[^\[]+?\])'` will do the job, Hope this is what you needed, I loved your question and I wanted to find a solution with Regex if there is, I think it should work for you – Charif DZ Sep 18 '19 at 13:47
Please add that to your answer (either replace or at the end for my specific case) and I will accept it. Thanks for spending time to find a creative solution! I tried across my 500 examples and it yields the same result but is on average 150x faster than using pyparsing, likely benefiting from re's C optimzations – Sep 18 '19 at 13:56
Yes I really wanted to know if it work for you, I'm working on my skills in Regex and your question was very challenging. glade I helped you thanks – Charif DZ Sep 18 '19 at 13:59
1

FYI I ended up using `pattern = '(? – Sep 18 '19 at 14:14

Selectively replace specific nested delimiters (brackets) in strings, while respecting nesting

3 Answers3

Test

Output