Tokening a string with multiple character separator

Question

I am trying to tokenize an expression with the following rules:

The separators are '}}' and '{{'
The strings between separators should be kept intact (excluding single spaces that are discarded (could be done in the parser)
The separators can be embedded and the order should be kept
Single occurences of '{' and '}' should be kept untouched and not used as separators (see last test).
There should be no empty strings in the result (could be done in the parser)

The couple of exceptions (marked in parens) can be done by post-processing the (correct) result in the parse. The result will be fed to a recursive descent parser.

Here are a few trials, which are not passing the unit tests that I have included. The find_all function is the closest to match but still manages to drop some parts. I am not using re.split() in the code below (it would keep the empty strings), but I tried it with no better luck. I am hoping that regex will work to avoid scanning the strings character by character in my code.

def tokenize_search(line):
    token_pat = re.compile(r'({{)([^{(?!{)]*|[^}(?!})]*)(}})')
    tokens = re.search(token_pat, line).groups()
    return list (tokens)

def tokenize_findall(line):
    token_pat = re.compile(r'({{)([^{(?!{)]*|[^}(?!})]*)(}})')
    tokens = re.findall(token_pat, line)
    return tokens

def check(a1, a2):
    print(a1 == a2)

def main():
    check(tokenize_search('{{}}'), ['{{', '}}'])
    check(tokenize_search('aaa {{}}'), ['{{', '}}'])
    check(tokenize_search('{{aaa}}'), ['{{', 'aaa', '}}'])
    check(tokenize_search('{{aa}} {{bbb}}'), ['{{', 'aa', '}}', '{{', 'bbb', '}}'])
    check(tokenize_search('{{aaa {{ bb }} }}'), ['{{', 'aaa ', '{{', ' bb ', '}}', '}}'])
    check(tokenize_search('{{aa {{ bbb {{ c }} }} }}'), ['{{', 'aa ', '{{', ' bbb ', '{{', ' c ', '}}', '}}', '}}'])
    check(tokenize_search('{{a{a}{{ b{b}b {{ c }} }} }}'), ['{{', 'a{a}', '{{', ' b{b}b ', '{{', ' c ', '}}', '}}', '}}'])

UPDATE

Thanks to Olivier for providing a solution that works. I am still hoping a regex solution could work if I could better understand regex lookout. If I use the tokenize_finditer method below, it passes the tests and all it does is filling the skipped groups with what's in-between (at the exception of the space which I could have post-processed to make code simpler). So my hope it that I could add an or clause to the '({{)|(}})' regex that says: `or get any character followed by any character that doesn't match '}}' or '{{'. Unfortunately, I can't succeed writing this matcher. I've seen examples of regex able to even do recursive matching, and because this is not recursive, it sounds even more doable.

def tokenize_finditer(line):
    token_pat = re.compile(r'({{)|(}})')
    result = []
    if re.search(token_pat, line):
        prev = len(line)
        for match in re.finditer(token_pat, line):
            start, end = match.span()
            if start > prev:
                expr = line[prev:start]
                if not expr.isspace():
                    result.append(expr)
            prev = end
            result.append(match.group())

    return result

You should be writing a rudimentary parser to handle this. Regex probably isn't the best tool to be using here, I think. — Tim Biegeleisen, Jun 24 '18 at 03:38
Your second rule is contradicting your expected result for test 4. The space is not kept intact. — Olivier Melançon, Jun 24 '18 at 04:01
Thanks for catching this, this is a typo. I updated the rule. — Laurent, Jun 24 '18 at 06:33

score 1 · Accepted Answer · answered Jun 24 '18 at 04:09

This problem is not quite a parentheses matching problem, but it is close enough for me to recommend not trying to solve it with a regex.

Since what you want to do is to partition you string with the given separators, then we can write a solution based on a partition function with some tweaking to fit all rules.

import re

def partition(s, sep):
    tokens = s.split(sep)

    # Intersperse the separator betweem found tokens
    partition = [sep] * (2 * len(tokens) - 1)
    partition[::2] = tokens

    # We remove empty and whitespace-only tokens
    return [tk for tk in partition if tk and not tk.isspace()]


def tokenize_search(line):
    # Only keep what is inside brackets
    line = re.search(r'{{.*}}', line).group() or ''

    return [tk for sub in partition(line, '{{') for tk in partition(sub, '}}')]

The above code passes all tests. You will need to feed that result to a parser to check parentheses matching.

Nice touch about using the 3 parameter of a slice. I capitulated about figuring out the regex because it seems not possible without using recursion. — Laurent, Jun 25 '18 at 05:16

wp78de · Answer 2 · 2018-06-25T15:29:01.027

I believe Olivier Melançon's partitioning approach is the way to go. However, there is still some use for a regex, e.g. checking if the pattern in question is properly balanced or extracting the balanced from a larger string (as indicated by the 2nd example).

Doing so requires a recursive regex like that:

{{((?>(?:(?!{{|}}).)++|(?R))*+)}}

Demo

Since Python re module does not support regex recursion you will need to rely on the alternative regex module to make use of it.

To further process the match result, you would need to look at the inner part in $1 and go deeper one level at a time, e.g. \w+|{{((?>(?:(?!(?:{{|}})).)++|(?R))*+)}} but that's cumbersome.

score 0 · Answer 3 · answered Aug 30 '18 at 09:11

just got your message on twitter :) I know I'm 2 months late to the party, but I have a couple of new ideas in case you're interested.

I looked over the examples and noticed that you could pretty much get away with matching and capturing all "{{" or "}}" or "a token that is in the middle of a {{ }} pair". Fortunately, this is rather simple to express:

/({{|}}|(?:(?!{{|}})[^ ])+(?!({{(?2)*?}}|(?:(?!{{|}}).)*)*$))/g

On regex101 using your examples

"in the middle of a {{ }} pair" is the only tricky part. For this, I used that negative lookahead to make sure we were NOT in a position that is followed by a balanced number of (potentially nested) {{ }} pairs, then the end of the string. This would, for well-balanced input, ensure all matched tokens are inside {{ }} pairs.

Now, you ask, what about the "well-balanced input" part? If the input is invalid, then an example such as "aaa}}" would yield ["aaa", "}}"] as a result. Not ideal. You could validate the input separately; or, if you wish to turn this into a untameable monster, then you can go for something like this:

/(?:^(?!({{(?1)*?}}|(?:(?!{{|}}).)*)*+$)(*COMMIT)(*F))?({{|}}|(?:(?!{{|}})[^ ])+(?!({{(?3)*?}}|(?:(?!{{|}}).)*)*+$))/g

Unleashed on regex101

This is really just for show. I agree with the other suggestions recommending a parser or some other more maintainable tool. But if you've seen my blog then you understand I have an affinity for these monsters :)

Tokening a string with multiple character separator

3 Answers3