Am I reinventing the wheel with this token replacing code?

Question

I have use case where I have a line of text containing nesting tokens (like { and }), and I wish to transform certain substrings nested at specific depths.

Example, capitalize the word moo at depth 1:

moo [moo [moo moo]] moo ->

moo [MOO [moo moo]] moo

Achieved by:

replaceTokens(input, 1, "[", "]", "moo", String::toUpperCase);

Or real world example, supply "--options" not already colored with the color sequence cyan:

@|blue --ignoreLog|@ works, but --ignoreOutput silences everything. ->

@|blue --ignoreLog|@ works, but @|cyan --ignoreOutput|@ silences everything.

Achieved by:

replaceTokens(input, 0, "@|", "|@", "--\\w*", s -> format("@|cyan %s|@", s));

I have implemented this logic and though I feel pretty good about it (except performance probably), I also feel I reinvented the wheel. Here's how I implemented it:

set currentPos to zero

while (input line not fully consumed) {
    take the remaining line

    if the open token is matched, add to output, increase counter and advance pos accordingly
    else if the close token is matched, add to output, decrease counter and advance pos accordingly
    else if the counter matches provided depth and given regex matches, invoke replacer function and advance pos accordingly
    else just record the next character and advance pos by 1
}

Here's the actual implementation:

public static String replaceNestedTokens(String lineWithTokens, int nestingDepth, String tokenOpen, String tokenClose, String tokenRegexToReplace, Function<String, String> tokenReplacer) {
    final Pattern startsWithOpen = compile(quote(tokenOpen));
    final Pattern startsWithClose = compile(quote(tokenClose));
    final Pattern startsWithTokenToReplace = compile(format("(?<token>%s)", tokenRegexToReplace));

    final StringBuilder lineWithTokensReplaced = new StringBuilder();

    int countOpenTokens = 0;
    int pos = 0;

    while (pos < lineWithTokens.length()) {
        final String remainingLine = lineWithTokens.substring(pos);

        if (startsWithOpen.matcher(remainingLine).lookingAt()) {
            countOpenTokens++;
            lineWithTokensReplaced.append(tokenOpen);
            pos += tokenOpen.length();
        } else if (startsWithClose.matcher(remainingLine).lookingAt()) {
            countOpenTokens--;
            lineWithTokensReplaced.append(tokenClose);
            pos += tokenClose.length();
        } else if (countOpenTokens == nestingDepth) {
            Matcher startsWithTokenMatcher = startsWithTokenToReplace.matcher(remainingLine);
            if (startsWithTokenMatcher.lookingAt()) {
                String matchedToken = startsWithTokenMatcher.group("token");
                lineWithTokensReplaced.append(tokenReplacer.apply(matchedToken));
                pos += matchedToken.length();
            } else {
                lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
            }
        } else {
            lineWithTokensReplaced.append(lineWithTokens.charAt(pos++));
        }
        assumeTrue(countOpenTokens >= 0, "Unbalanced token sets: closed token without open token\n\t" + lineWithTokens);
    }
    assumeTrue(countOpenTokens == 0, "Unbalanced token sets: open token without closed token\n\t" + lineWithTokens);
    return lineWithTokensReplaced.toString();
}

I couldn't make it work with a regex like this or this (or Scanner) solution, but I feel I'm reinventing the wheel and could solve this with (vanilla Java) out-of-the-box classes with less code. Also, I'm pretty sure this is a performance nightmare with all the inline patterns/matcher instances and substrings.

Suggestions?

jordiburgos · Answer 1 · 2018-08-25T09:35:54.100

0

You could be using a parser like ANTLR to create a grammar to describe your language or syntax. Then use a listener or visitor to make an interpreter of tokens.

A sample of the grammar would be like this (what I can infer from your code):

grammar Expr;       
prog:   (expr NEWLINE)* ;
expr:   id '[' expr ']'
    |   '@|' expr '|@'
    |   '--ignoreLog' expr
    |   '--ignoreOutput' expr
    |   string
    ;
string: [a-zA-Z0-9];
NEWLINE : [\r\n]+ ;

edited Aug 25 '18 at 09:35

answered Aug 25 '18 at 09:15

jordiburgos

4,742
3
38
65

That's not enough detail for an answer. Example, please. – rustyx Aug 25 '18 at 09:26

Am I reinventing the wheel with this token replacing code?

1 Answers1