69

My program reads a line from a file. This line contains comma-separated text like:

123,test,444,"don't split, this",more test,1

I would like the result of a split to be this:

123
test
444
"don't split, this"
more test
1

If I use the String.split(","), I would get this:

123
test
444
"don't split
 this"
more test
1

In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?

Dale K
  • 16,372
  • 12
  • 37
  • 62
Jakob Mathiasen
  • 1,189
  • 3
  • 13
  • 17

5 Answers5

153

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma
(?=         // Followed by
   (?:      // Start a non-capture group
     [^"]*  // 0 or more non-quote characters
     "      // 1 quote
     [^"]*  // 0 or more non-quote characters
     "      // 1 quote
   )*       // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
   [^"]*    // Finally 0 or more non-quotes
   $        // Till the end  (This is necessary, else every comma will satisfy the condition)
)

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " + 
                     ",          " +   // Split on comma
                     "(?=        " +   // Followed by
                     "  (?:      " +   // Start a non-capture group
                     "    [^\"]* " +   // 0 or more non-quote characters
                     "    \"     " +   // 1 quote
                     "    [^\"]* " +   // 0 or more non-quote characters
                     "    \"     " +   // 1 quote
                     "  )*       " +   // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
                     "  [^\"]*   " +   // Finally 0 or more non-quotes
                     "  $        " +   // Till the end  (This is necessary, else every comma will satisfy the condition)
                     ")          "     // End look-ahead
                         );
Rohit Jain
  • 195,192
  • 43
  • 369
  • 489
  • 4
    This answer is still valuable after all these years! – Cheeso Oct 12 '16 at 19:22
  • Got my program to work with your explanation. Thanks! Now, is there a way you can add newlines to this too? \n and \r? – Henry Lee Feb 01 '17 at 21:36
  • Hi my string is like this: \"don't split, this\" instead (it has those backslashes in front of ". How to modify the regex for that? – GeneCode Oct 31 '17 at 06:53
  • still the best answer in 2018. if you are using kotlin : str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)".toRegex()) . Remember to add toRegex(). – SajithK May 09 '18 at 11:19
  • You are amazing :) – Iharob Al Asimi Aug 05 '18 at 19:20
  • 1
    Hi Rohit, I followed your solution where I had two delimiters, and/or. and used following regex : (\s+and\s+|\s+or\s+)(?=(?:[^\"]*"[^\"]*\")*[^\"]*$). It works great for most of the usecases but fails for input : 'brand == "Kellogg\'\'s" or country == \'UnitedStates and " India\''. Could you please help me? I am very new to regex. – Harsh Bafna Aug 24 '18 at 10:35
  • 1
    This does not work for nontrivial cases, e.g. if the text contains quotes itself you'd need to escape, like `\"` – Felk Mar 14 '19 at 12:27
  • Sometimes you'll just have to step back.... Thanks!!!! Weeks I try to separate by comma not between parenthesis... – Mohicane Feb 13 '20 at 10:29
19

Why Split when you can Match?

Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:

"[^"]*"|[^,]+

This will match all the desired fragments (see demo).

Explanation

  • With "[^"]*", we match complete "double-quoted strings"
  • or |
  • we match [^,]+ any characters that are not a comma.

A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.

zx81
  • 38,175
  • 8
  • 76
  • 97
2

You can do this very easily without complex regular expression:

  1. Split on the character ". You get a list of Strings
  2. Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
  3. Join the list of lists, so you get only a list.

If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.

So basically it is something like this:

public class SplitTest {
    public static void main(String[] args) {
        final String splitMe="123,test,444,\"don't split, this\",more test,1";
        final String[] splitByQuote=splitMe.split("\"");
        final String[][] splitByComma=new String[splitByQuote.length][];
        for(int i=0;i<splitByQuote.length;i++) {
            String part=splitByQuote[i];
            if (i % 2 == 0){
               splitByComma[i]=part.split(",");
            }else{
                splitByComma[i]=new String[1];
                splitByComma[i][0]=part;
            }
        }
        for (String parts[] : splitByComma) {
            for (String part : parts) {
                System.out.println(part);
            }
        }
    }
}

This will be much cleaner with lambdas, promised!

stefan.schwetschke
  • 8,596
  • 1
  • 22
  • 29
1

Building upon @zx81's answer, cause matching idea is really nice, I've added Java 9 results call, which returns a Stream. Since OP wanted to use split, I've collected to String[], as split does.

Caution if you have spaces after your comma-separators (a, b, "c,d"). Then you need to change the pattern.

Jshell demo

$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
|  Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"

-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
|  Expression value is: java.util.stream.ReferencePipeline$Head@2038ae61
|    assigned to temporary variable $68 of type java.util.stream.Stream<MatchResult>

-> $68.map(MatchResult::group).toArray(String[]::new);
|  Expression value is: [Ljava.lang.String;@6b09bb57
|    assigned to temporary variable $69 of type String[]

-> Arrays.stream($69).forEach(System.out::println);
123
test
444
"don't split, this"
more test
1

Code

String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
    .matcher(so)
    .results()
    .map(MatchResult::group)
    .toArray(String[]::new);

Explanation

  1. Regex [^"] matches: a quote, anything but a quote, a quote.
  2. Regex [^"]* matches: a quote, anything but a quote 0 (or more) times , a quote.
  3. That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times - that is: [^,]+ - would "win".
  4. results() requires Java 9 or higher.
  5. It returns Stream<MatchResult>, which I map using group() call and collect to array of Strings. Parameterless toArray() call would return Object[].
0

Please see the below code snippet. This code only considers happy flow. Change the according to your requirement

public static String[] splitWithEscape(final String str, char split,
        char escapeCharacter) {
    final List<String> list = new LinkedList<String>();

    char[] cArr = str.toCharArray();

    boolean isEscape = false;
    StringBuilder sb = new StringBuilder();

    for (char c : cArr) {
        if (isEscape && c != escapeCharacter) {
            sb.append(c);
        } else if (c != split && c != escapeCharacter) {
            sb.append(c);
        } else if (c == escapeCharacter) {
            if (!isEscape) {
                isEscape = true;
                if (sb.length() > 0) {
                    list.add(sb.toString());
                    sb = new StringBuilder();
                }
            } else {
                isEscape = false;
            }

        } else if (c == split) {
            list.add(sb.toString());
            sb = new StringBuilder();
        }
    }

    if (sb.length() > 0) {
        list.add(sb.toString());
    }

    String[] strArr = new String[list.size()];

    return list.toArray(strArr);
}
Abhijith Nagarajan
  • 3,547
  • 16
  • 23