Tokenizing a String but ignoring delimiters within quotes

Question

I wish to have have the following String

!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"

to become an array of the following

{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }

I tried

new StringTokenizer(cmd, "\"")

but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.

Thanks.

EDIT: I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.

http://java.sun.com/developer/technicalArticles/Programming/stringtokenizer/ — Bertrand Marron, Jul 29 '10 at 19:40

score 62 · Accepted Answer · edited May 23 '17 at 12:09

It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.

That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.

Here's an example:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
    // 1 2 "333 4" 55 6    "77" 8 999

    String regex = "\"([^\"]*)\"|(\\S+)";

    Matcher m = Pattern.compile(regex).matcher(text);
    while (m.find()) {
        if (m.group(1) != null) {
            System.out.println("Quoted [" + m.group(1) + "]");
        } else {
            System.out.println("Plain [" + m.group(2) + "]");
        }
    }

The above prints (as seen on ideone.com):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

The pattern is essentially:

"([^"]*)"|(\S+)
 \_____/  \___/
    1       2

There are 2 alternates:

The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
The second alternate matches any sequence of non-whitespace characters, captured in group 2
The order of the alternates matter in this pattern

Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.

References

regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus

Appendix

Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.

Related questions

Difference between a Deprecated and Legacy API?
Scanner vs. StringTokenizer vs. String.Split
Validating input using java.util.Scanner - has many examples

We have a winner! :) Thanks so much, works perfectly. Thanks for everyone else's input too, I just find this most suitable. :) — Ploo, Jul 29 '10 at 20:29
If I'd have asked the question that would be the answer I would accept. Thanks for this, I knew there must be some better than old-fashioned way! — Rekin, Jul 29 '10 at 20:30
@Ploo: An example of another pattern that may be of interest: `"([^"]*)"|'([^']*)'|([^"' ]+)` http://www.rubular.com/r/cjzuqus7oa : i.e. double quoted (group 1) or single quoted (group 2) or just plain (group 3). No quote escaping. — polygenelubricants, Jul 29 '10 at 20:58
TBH I was just browsing recent questions, but I had to +1 for such a well written and comprehensive answer! — Adam, Jul 29 '10 at 21:06
@polygenelubricants if you don't mind stripping quotes ad-hoc, you can simplify the regex down to one capture group like so: `("[^"]*"|'[^']*'|[^"' ]+)` -- then you can decide whether or not you want to keep or nix captured quotes, depending on your requirements. — kayleeFrye_onDeck, Sep 21 '18 at 02:01
Great answer. Readers beware: This does not work for the use case ` !cmd var="value string"`, ie if the quote is allowed to start inside a string. OP did not have this condition. — Sid Datta, Nov 16 '18 at 23:36

score 7 · Answer 2 · answered Jul 29 '10 at 20:11

7

Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.

Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.

answered Jul 29 '10 at 20:11

GrandmasterB

3,220
1
20
21

1

Ditto. I've written simple parsers lke this a bazillion times. Sure, you can find some open source library to do it, or come up with a clever regex, but then you've added more complexity. Why not solve simple problems with simple tools? When I need to put in a screw, I use a screwdriver, I don't search for a solar-powered fully automated screw-putter-inner robot. – Jay Jul 29 '10 at 20:43
FYI For people reading this in the future. I created an FSM for this in another response here. – deadfire19 Aug 27 '16 at 15:29

mike rodent · Answer 3 · 2018-06-22T18:40:28.533

Apache Commons to the rescue!

import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
@Grab(group='org.apache.commons', module='commons-text', version='1.3')

def str = /is this   'completely "impossible"' or """slightly"" impossible" to parse?/

StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )

println st.tokenList

Output:

[is, this, completely "impossible", or, "slightly" impossible, to, parse?]

A few notes:

this is written in Groovy... it is in fact a Groovy script. The @Grab line gives a clue to the sort of dependency line you need (e.g. in build.gradle) ... or just include the .jar in your classpath of course
StringTokenizer here is NOT java.util.StringTokenizer ... as the import line shows it is org.apache.commons.text.StringTokenizer
the def str = ... line is a way to produce a String in Groovy which contains both single quotes and double quotes without having to go in for escaping
StringMatcherFactory in apache commons-text 1.3 can be found here: as you can see, the INSTANCE can provide you with a bunch of different StringMatchers. You could even roll your own: but you'd need to examine the StringMatcherFactory source code to see how it's done.
YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!

PS why is it better to use Apache Commons than any other solution? Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:

The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...

PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.

later

Sufficiently intrigued by ArtB's question I had a look at the source:

in StringMatcherFactory.java we see:

private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
            "'\"".toCharArray());

... rather dull ...

so that leads one to look at StringTokenizer.java:

public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
        if (quote != null) {
            this.quoteMatcher = quote;
        }
        return this;
}

OK... and then, in the same java file:

private int readWithQuotes(final char[] srcChars ...

which contains the comment:

// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.

... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\\"|s into |\"\"|s... (i.e. where you replace each |\"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.

I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleos through which you achieve programming immortality...

How would one go about changing the syntax from `"""` to `\"` for quotation marks within strings? Currently my hack is running `s.replaceAll( "\\\\\"", "\"\"\"" )` but seems it should be configurable some how... — Sled, Jun 18 '18 at 15:37

score 2 · Answer 4 · edited Feb 19 '17 at 13:50

In an old fashioned way:

public static String[] split(String str) {
    str += " "; // To detect last token when not quoted...
    ArrayList<String> strings = new ArrayList<String>();
    boolean inQuote = false;
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        char c = str.charAt(i);
        if (c == '"' || c == ' ' && !inQuote) {
            if (c == '"')
                inQuote = !inQuote;
            if (!inQuote && sb.length() > 0) {
                strings.add(sb.toString());
                sb.delete(0, sb.length());
            }
        } else
            sb.append(c);
    }
    return strings.toArray(new String[strings.size()]);
}

I assume that nested quotes are illegal, and also that empty tokens can be omitted.

score 1 · Answer 5 · answered Apr 02 '19 at 07:00

Recently faced a similar question where command line arguments must be split ignoring quotes link.

One possible case:

"/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force\""

This had to be split to

/opt/jboss-eap/bin/jboss-cli.sh
--connect
--controller=localhost:9990
-c
command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"

Just to add to @polygenelubricants's answer, having any non-space character before and after the quote matcher can work out.

"\\S*\"([^\"]*)\"\\S*|(\\S+)"

Example:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Tokenizer {

    public static void main(String[] args){

        String a = "/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy " +
                "/app/jboss-eap-7.1/standalone/updates/sample.war --force\"";
        String b = "Hello \"Stack Overflow\"";
        String c = "cmd=\"abcd efgh ijkl mnop\" \"apple\" banana mango";
        String d = "abcd ef=\"ghij klmn\"op qrst";
        String e = "1 2 \"333 4\" 55 6    \"77\" 8 999";

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("\\S*\"([^\"]*)\"\\S*|(\\S+)");
        Matcher regexMatcher = regex.matcher(a);
        while (regexMatcher.find()) {
            matchList.add(regexMatcher.group());
        }
        System.out.println("matchList="+matchList);
    }
}

Output:

matchList=[/opt/jboss-eap/bin/jboss-cli.sh, --connect, --controller=localhost:9990, -c, command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"]

score 0 · Answer 6 · answered Jul 29 '10 at 19:39

0

The example you have here would just have to be split by the double quote character.

answered Jul 29 '10 at 19:39

Nikolaos

1,399
2
14
19

for his example that would work, but that wouldn't solve this scenario: one two three "four five six" seven eight nine "ten" – Andrew Garrison Jul 29 '10 at 19:40

score 0 · Answer 7 · answered Aug 27 '16 at 15:27

This is an old question, however this was my solution as a finite state machine.

Efficient, predictable and no fancy tricks.

100% coverage on tests.

Drag and drop into your code.

/**
 * Splits a command on whitespaces. Preserves whitespace in quotes. Trims excess whitespace between chunks. Supports quote
 * escape within quotes. Failed escape will preserve escape char.
 *
 * @return List of split commands
 */
static List<String> splitCommand(String inputString) {
    List<String> matchList = new LinkedList<>();
    LinkedList<Character> charList = inputString.chars()
            .mapToObj(i -> (char) i)
            .collect(Collectors.toCollection(LinkedList::new));

    // Finite-State Automaton for parsing.

    CommandSplitterState state = CommandSplitterState.BeginningChunk;
    LinkedList<Character> chunkBuffer = new LinkedList<>();

    for (Character currentChar : charList) {
        switch (state) {
            case BeginningChunk:
                switch (currentChar) {
                    case '"':
                        state = CommandSplitterState.ParsingQuote;
                        break;
                    case ' ':
                        break;
                    default:
                        state = CommandSplitterState.ParsingWord;
                        chunkBuffer.add(currentChar);
                }
                break;
            case ParsingWord:
                switch (currentChar) {
                    case ' ':
                        state = CommandSplitterState.BeginningChunk;
                        String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
                        matchList.add(newWord);
                        chunkBuffer = new LinkedList<>();
                        break;
                    default:
                        chunkBuffer.add(currentChar);
                }
                break;
            case ParsingQuote:
                switch (currentChar) {
                    case '"':
                        state = CommandSplitterState.BeginningChunk;
                        String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
                        matchList.add(newWord);
                        chunkBuffer = new LinkedList<>();
                        break;
                    case '\\':
                        state = CommandSplitterState.EscapeChar;
                        break;
                    default:
                        chunkBuffer.add(currentChar);
                }
                break;
            case EscapeChar:
                switch (currentChar) {
                    case '"': // Intentional fall through
                    case '\\':
                        state = CommandSplitterState.ParsingQuote;
                        chunkBuffer.add(currentChar);
                        break;
                    default:
                        state = CommandSplitterState.ParsingQuote;
                        chunkBuffer.add('\\');
                        chunkBuffer.add(currentChar);
                }
        }
    }

    if (state != CommandSplitterState.BeginningChunk) {
        String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
        matchList.add(newWord);
    }
    return matchList;
}

private enum CommandSplitterState {
    BeginningChunk, ParsingWord, ParsingQuote, EscapeChar
}

score 0 · Answer 8 · answered Jun 18 '18 at 18:39

Another old school way is :

public static void main(String[] args) {

    String text = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
    String[] splits = text.split(" ");
    List<String> list = new ArrayList<>();
    String token = null;
    for(String s : splits) {

        if(s.startsWith("\"") ) {
            token = "" + s; 
        } else if (s.endsWith("\"")) {
            token = token + " "+ s;
            list.add(token);
            token = null;
        } else {
            if (token != null) {
                token = token + " " + s;
            } else {
                list.add(s);
            }
        }
    }
    System.out.println(list);
}

Output : - [One, two, "three four", five, "six seven eight", nine]

score 0 · Answer 9 · answered Apr 19 '20 at 05:39

private static void findWords(String str) {
    boolean flag = false;
    StringBuilder sb = new StringBuilder();
    for(int i=0;i<str.length();i++) {
        if(str.charAt(i)!=' ' && str.charAt(i)!='"') {
            sb.append(str.charAt(i));
        }
        else {
            System.out.println(sb.toString());
            sb = new StringBuilder();
            if(str.charAt(i)==' ' && !flag)
                continue;
            else if(str.charAt(i)=='"') {
                if(!flag) {
                    flag=true;
                }
                i++;
                while(i<str.length() && str.charAt(i)!='"') {
                    sb.append(str.charAt(i));
                    i++;
                }
                flag=false;
                System.out.println(sb.toString());
                sb = new StringBuilder();
            }
        }
    }
}

score 0 · Answer 10 · answered Nov 29 '20 at 07:42

This is what I myself use for splitting arguments in command line and things like that.

It's easily adjustible for multiple delimiters and quotes, it can process quotes within the words (like al' 'pha), it supports escaping (quotes as well as spaces) and it's really lenient.

public final class StringUtilities {
    private static final List<Character> WORD_DELIMITERS = Arrays.asList(' ', '\t');
    private static final List<Character> QUOTE_CHARACTERS = Arrays.asList('"', '\'');
    private static final char ESCAPE_CHARACTER = '\\';

    private StringUtilities() {

    }

    public static String[] splitWords(String string) {
        StringBuilder wordBuilder = new StringBuilder();
        List<String> words = new ArrayList<>();
        char quote = 0;

        for (int i = 0; i < string.length(); i++) {
            char c = string.charAt(i);

            if (c == ESCAPE_CHARACTER && i + 1 < string.length()) {
                wordBuilder.append(string.charAt(++i));
            } else if (WORD_DELIMITERS.contains(c) && quote == 0) {
                words.add(wordBuilder.toString());
                wordBuilder.setLength(0);
            } else if (quote == 0 && QUOTE_CHARACTERS.contains(c)) {
                quote = c;
            } else if (quote == c) {
                quote = 0;
            } else {
                wordBuilder.append(c);
            }
        }

        if (wordBuilder.length() > 0) {
            words.add(wordBuilder.toString());
        }

        return words.toArray(new String[0]);
    }
}

danyim · Answer 11 · 2010-07-29T20:09:10.527

-1

Try this:

String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");

It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.

You should use String's split method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer doesn't. At the end of what I provided above, you can just add the following:

String s;
for(String k : strArr) {
     s += k;
}
StringTokenizer strTok = new StringTokenizer(s);

edited Jul 29 '10 at 20:09

answered Jul 29 '10 at 19:41

danyim

1,222
9
27

Yes, this answer is incorrect. I am working on a new solution. This is an interesting problem. – danyim Jul 29 '10 at 19:58

score -1 · Answer 12 · answered Jul 29 '10 at 19:52

-1

try this:

String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");

answered Jul 29 '10 at 19:52

smp7d

4,697
1
23
48

It returns "One two" instead of "One and "two". – Ploo Jul 29 '10 at 20:03

score -1 · Answer 13 · answered Jul 29 '10 at 20:07

-1

I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.

answered Jul 29 '10 at 20:07

Kiersten Arnold

1,722
1
12
17

This does not provide an answer to the post. It may be a comment but it shouldn't be an answer. – Aurasphere Apr 15 '17 at 11:02

Tokenizing a String but ignoring delimiters within quotes

13 Answers13

References

See also

Appendix

Related questions

Linked

Related