1

I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.

For example, from the following string:

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].

The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.

When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.

This is what I have so far:

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
    print(match)

Here is a demo: https://regex101.com/r/INPRdN/2

The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.

I would appreciate if someone could provide me with a solution (regex or AST) to this problem.

Thank you.

AstrOne
  • 2,831
  • 3
  • 25
  • 46
  • Ordinary regex definitely can't do this, because it's impossible to write a regex that matches all forms of `foo[[[[[[[[[ [0] ]]]]]]]`. Focusing on an AST-based solution seems like a good idea to me. – Kevin Oct 04 '19 at 13:26
  • *"`visit_Name` is also called for functions"* - function names *are* also variables. What's the goal here? – jonrsharpe Oct 04 '19 at 13:31
  • @Kevin I was thinking about that, so I thought that I could just match the first outer pair of brackets, and then keep applying the regex. For example if I have this `foo[bar[baz[0]]] + 10`, I would like to extract `foo[bar[baz[0]]]`. Then, I can use some basic string manipulation and isolate the subscript: `bar[baz[0]]`. Then I can apply the regex again, and so on. But if you are staying this is not possible, then I guess I have to go with ASTs! – AstrOne Oct 04 '19 at 13:40
  • @jonrsharpe It is a bit hard to explain fully, but very briefly: I am writing a software where the user can input her/his own python expressions and assign them to a "parameter" of a mathematical model. These parameters can also appear in an expression as variables. So we end-up with parameters that depend on other parameters. My aim is to parse all the Python expressions, extract all variables (parameters), and create a dependency tree with all the parameters. Using this information I can then perform topological sorting to evaluate the expressions in the correct order. – AstrOne Oct 04 '19 at 13:54
  • 1
    I wonder if you might be better off simply extracting every name from the expression, and distinguishing parameter names from function names by comparing each name against a whitelist of approved function names. When it comes to evaluating user input, it's usually better to restrict what they're allowed to call anyway. (Not that this is sufficient to completely safeguard your environment, mind you, but it helps) – Kevin Oct 04 '19 at 13:59
  • It is not worth it, see https://rextester.com/JXBL26340 – Wiktor Stribiżew Oct 04 '19 at 16:01

2 Answers2

1

I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
    """ extract all identifier and getitem expression in the given order."""

    def remove_brackets(text):
        # 1. handle `[...]` expression replace them with #{#...#}#
        # so we don't confuse them with word[...]
        pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
        # keep extracting expression until there is no expression
        while re.search(pattern, text):
            text = re.sub(pattern, r'\1#{#\3#}#', string)
        return text

    def get_ordered_subexp(exp):
        """ get index of nested expression."""
        index = int(exp.replace('#', ''))
        subexp = RE_INDEX.findall(expressions[index])
        if not subexp:
            return exp
        return exp + ''.join(get_ordered_subexp(i) for i in subexp)

    def replace_expression(match):
        """ save the expression in the list, replace it with special key and it's index in the list."""
        match_exp = match.group(0)
        current_index = len(expressions)
        expressions.append(None)  # just to make sure the expression is inserted before it's inner identifier
        # if the expression contains identifier extract too.
        if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
            match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
        expressions[current_index] = match_exp
        return '##{}##'.format(current_index)

    def fix_expression(match):
        """ replace the match by the corresponding expression using the index"""
        return expressions[int(match.group(2))]

    # result that will contains
    expressions = []

    string = remove_brackets(string)

    # 2. extract all expression and keep track of there place in the original code
    pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
    # keep extracting expression until there is no expression
    while re.search(pattern, string):
        # every exression that is extracted is replaced by a special key
        string = re.sub(pattern, replace_expression, string)
        # some times inside brackets can contains getitem expression
        # so when we extract that expression we handle the brackets
        string = remove_brackets(string)

    # 3. build the correct result with extracted expressions
    result = [None] * len(expressions)
    for index, exp in enumerate(expressions):
        # keep replacing special keys with the correct expression
        while RE_INDEX_ONLY.search(exp):
            exp = RE_INDEX_ONLY.sub(fix_expression, exp)
        # finally we don't forget about the brackets
        result[index] = exp.replace('#{#', '[').replace('#}#', ']')

    # 4. Order the index that where extracted
    ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
    # convert it to integer
    ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

    # 5. fix the order of expressions using the ordered indexes
    final_result = []
    for exp_index in ordered_index:
        final_result.append(result[exp_index])

    # for debug:
    # print('final string:', string)
    # print('expression :', expressions)
    # print('order_of_expresion: ', ordered_index)
    return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

OUTPU:

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.

Charif DZ
  • 13,500
  • 3
  • 13
  • 36
  • 1
    Wow! I have to say I am impressed! Very interesting solution, and your comments really help to understand what is going on! Thank you my friend! – AstrOne Oct 06 '19 at 01:15
0

Regex is not a powerful enough tool to do this. If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.

This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do

If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this. You would be best off trying to build a recursive function to do the parsing.