I am trying to tokenize an expression with the following rules:
The separators are '}}' and '{{'
The strings between separators should be kept intact (excluding single spaces that are discarded (could be done in the parser)
The separators can be embedded and the order should be kept
Single occurences of '{' and '}' should be kept untouched and not used as separators (see last test).
There should be no empty strings in the result (could be done in the parser)
The couple of exceptions (marked in parens) can be done by post-processing the (correct) result in the parse. The result will be fed to a recursive descent parser.
Here are a few trials, which are not passing the unit tests that I have included. The find_all
function is the closest to match but still manages to drop some parts. I am not using re.split()
in the code below (it would keep the empty strings), but I tried it with no better luck. I am hoping that regex will work to avoid scanning the strings character by character in my code.
def tokenize_search(line):
token_pat = re.compile(r'({{)([^{(?!{)]*|[^}(?!})]*)(}})')
tokens = re.search(token_pat, line).groups()
return list (tokens)
def tokenize_findall(line):
token_pat = re.compile(r'({{)([^{(?!{)]*|[^}(?!})]*)(}})')
tokens = re.findall(token_pat, line)
return tokens
def check(a1, a2):
print(a1 == a2)
def main():
check(tokenize_search('{{}}'), ['{{', '}}'])
check(tokenize_search('aaa {{}}'), ['{{', '}}'])
check(tokenize_search('{{aaa}}'), ['{{', 'aaa', '}}'])
check(tokenize_search('{{aa}} {{bbb}}'), ['{{', 'aa', '}}', '{{', 'bbb', '}}'])
check(tokenize_search('{{aaa {{ bb }} }}'), ['{{', 'aaa ', '{{', ' bb ', '}}', '}}'])
check(tokenize_search('{{aa {{ bbb {{ c }} }} }}'), ['{{', 'aa ', '{{', ' bbb ', '{{', ' c ', '}}', '}}', '}}'])
check(tokenize_search('{{a{a}{{ b{b}b {{ c }} }} }}'), ['{{', 'a{a}', '{{', ' b{b}b ', '{{', ' c ', '}}', '}}', '}}'])
UPDATE
Thanks to Olivier for providing a solution that works. I am still hoping a regex solution could work if I could better understand regex lookout. If I use the tokenize_finditer
method below, it passes the tests and all it does is filling the skipped
groups with what's in-between (at the exception of the space which I could have post-processed to make code simpler). So my hope it that I could add an or
clause to the '({{)|(}})'
regex that says: `or get any character followed by any character that doesn't match '}}' or '{{'. Unfortunately, I can't succeed writing this matcher. I've seen examples of regex able to even do recursive matching, and because this is not recursive, it sounds even more doable.
def tokenize_finditer(line):
token_pat = re.compile(r'({{)|(}})')
result = []
if re.search(token_pat, line):
prev = len(line)
for match in re.finditer(token_pat, line):
start, end = match.span()
if start > prev:
expr = line[prev:start]
if not expr.isspace():
result.append(expr)
prev = end
result.append(match.group())
return result