Java: splitting a comma-separated string but ignoring commas in quotes

Question

I have a string vaguely like this:

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)

the above string should split into:

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

score 464 · Accepted Answer · edited Nov 26 '16 at 05:24

464

Try:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (?:                     "+ //   start non-capturing group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex, -1);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

edited Nov 26 '16 at 05:24

Urban Vagabond

6,268
2
24
31

answered Nov 18 '09 at 16:10

Bart Kiers

153,868
34
276
272

1

According to RFC 4180: Sec 2.6: "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes." Sec 2.7: "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote" So, if `String line = "equals: =,\"quote: \"\"\",\"comma: ,\""`, all you need to do is strip off the extraneous double quote characters. – Paul Hanbury Nov 18 '09 at 17:41
@Bart: my point being that your solution still works, even with embedded quotes – Paul Hanbury Nov 18 '09 at 17:43
@Bart Kiers: Lools like it fails when you have a comma inside the string value: e.g. "op","ID","script","Mike,s","Content-Length" – Misha Narinsky May 05 '12 at 22:53
@MichaelNarinsky, no, it will not split on the comma in `"Mike,s"`. If you run the example code I posted, you'd see that it doesn't. – Bart Kiers May 07 '12 at 06:34
@BartKiers I seem to be having trouble if my last column is empty e.g. `"val1, "val, 2","` I would expect to get an array of size three, but it ignores the last blank column and the size is 2. – Alex Apr 23 '14 at 14:43
6

@Alex, yeah, the comma *is* matched, but the empty match is not in the result. Add `-1` to the split method param: `line.split(regex, -1)`. See: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String,%20int) – Bart Kiers Apr 23 '14 at 14:55
I ran into a test case, which fails: `String line = "\"EXCEPTION\",\"ER-0124\",\"10/09/2013 10:01:37\",814867,-1,\"SYSTEM\",\"ERROR\",\"[[F1:b4vvCaFBsG2Spk5cMCfiTt2dF2hO+f5ORcKWcBuLFZgY1EJg; ]] Message not available\",\"[[F1:b4vvCaFBsG2Spk5cMCfiTt2dF2hO+f5ORcKWcBuLFZgY1EJg; ]] TC6342-4: Test LOG_TAGGED with F1! tag, size as int\",3,\"ER-0121\",\"AB-9876,\"ER-0123\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",\"-\",1001,\"9.9.9.d\",\"ERXA_log_id_test [22388]\",\"ERXA_log_id_test.c\",\"?.?\",2301,\"801FFFFF\",\";;;;;;;;;\",\"Automatic\",\"00000-00000000-m0000-00\",\"END\"";` – Gerrit Brouwer Mar 16 '15 at 14:11
The first token contains all text until `...F1! tag` – Gerrit Brouwer Mar 16 '15 at 14:17
@GerritBrouwer, that isbecause `"AB-9876` does not have a closing quote. And its also a perfect example why to choose a proper CSV parser in production ;) – Bart Kiers Mar 16 '15 at 14:22
@BartKiers: thanx for the quick response! My confidence in your regex is completely restored :-) It turned out that I was sent a hand-crafted test file, rather than a machine-generated output file. – Gerrit Brouwer Mar 16 '15 at 14:44
:) no problem @GerritBrouwer – Bart Kiers Mar 16 '15 at 14:50
Thanks buddy: you inspire me. – Jonathan Mar 17 '15 at 13:03
I strongly suggest adding -1 to the split method to catch empty strings. `line.split(regex, -1)` – Peter Mar 28 '15 at 10:34
@BartKiers: Shouldn't we be adding ``-1`` to ``line.split()`` in the first example, too? – user1438038 May 29 '15 at 10:30
2

Works great! I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split), so I did `Splitter.on(Pattern.compile(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))`. – MikeFHay Jan 04 '16 at 12:23
You don't have to catch the group, you can do like this: `Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))` – Wilt May 26 '16 at 12:15
5

**WARNING!!!! This regexp is slow!!!** It has O(N^2) behavior in that the lookahead at each comma looks all the way to the end of the string. Using this regexp caused a 4x slowdown in large Spark jobs (e.g. 45 minutes -> 3 hours). The faster alternative is something like `findAllIn("(?s)(?:\".*?\"|[^\",]*)*")` in combination with a postprocessing step to skip the first (always-empty) field following each non-empty field. – Urban Vagabond Nov 26 '16 at 05:30
@BartKiers - I have received a file that contains data of type --- _foo,bar,**""c;qual"="baz,blurb"",d;junk="quux,syzygy"**_ --- where the above regex does not ignore the comma between baz and blurb. I am able to figure out why that is happening, but do not know how to solve it. Would you be able to help? Thanks – adbdkb May 08 '19 at 23:26
Feel free to create a question of your own @adbdkb. These comment boxes are not well suited for Q&A’s. Your input also seems to be unintentionally formatted. – Bart Kiers May 09 '19 at 05:16
@BartKiers I have a question posted at [this link](https://stackoverflow.com/questions/56028130/regex-for-parsing-csv-that-contains-json-in-a-column). I am receiving the data from another system and do not have much control over how they are formatting it. That question also has how the data is received. I have asked if they can change the column separator from comma to tilde ~ or bar | – adbdkb May 09 '19 at 11:00

score 51 · Answer 2 · edited Oct 26 '20 at 17:26

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}
result.add(input.substring(start));

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));

Quotes should be removed from parsed tokens, after string is parsed. — Sudhir N, Aug 04 '16 at 09:40
Found via google, nice algorithm bro, simple and easy to adapt, agree. stateful stuff should be done via parser, regex is a mess. — Rudolf Schmidt, Jun 01 '17 at 08:12
Keep in mind that if a comma is the last character it will be in the last item's String value. — Gabe Gates, Jan 03 '19 at 20:44

score 21 · Answer 3 · edited May 23 '17 at 11:47

21

http://sourceforge.net/projects/javacsv/

https://github.com/pupi1985/JavaCSV-Reloaded (fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)

http://opencsv.sourceforge.net/

CSV API for Java

Can you recommend a Java library for reading (and possibly writing) CSV files?

Java lib or app to convert CSV to XML file?

edited May 23 '17 at 11:47

Community

1
1

answered Nov 18 '09 at 16:11

Jonathan Feinberg

42,017
6
77
101

3

Good call recognizing that the OP was parsing a CSV file. An external library is extremely appropriate for this task. – Stefan Kendall Nov 18 '09 at 16:14
1

But the string is a CSV string; you should be able to use a CSV api on that string directly. – Michael Brewer-Davis Nov 18 '09 at 16:29
yes, but this task is simple enough, and a much smaller part of a larger application, that I don't feel like pulling in another external library. – Jason S Nov 18 '09 at 16:33
7

not necessarily... my skills are often adequate, but they benefit from being honed. – Jason S Nov 18 '09 at 18:10

score 12 · Answer 4 · answered Jun 06 '14 at 09:08

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:

Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.

My solution and test below.

String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;

start = System.nanoTime(); 
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
    switch (c) {
    case ',':
        if (inQuotes) {
            b.append(c);
        } else {
            tokensList.add(b.toString());
            b = new StringBuilder();
        }
        break;
    case '\"':
        inQuotes = !inQuotes;
    default:
        b.append(c);
    break;
    }
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;

System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);

Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

Interesting point regarding time splitting vs parsing. However, statement #2 is inaccurate. If you add a `-1` to the split method in Bart's answer, you will catch empty strings (including empty strings after the last comma): `line.split(regex, -1)` — Peter, Mar 28 '15 at 10:39
+1 because it is a better solution to the problem for which I was searching for a solution: parsing a complex HTTP POST body parameter string — varontron, Apr 30 '17 at 02:36

score 2 · Answer 5 · answered Nov 18 '09 at 16:15

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

Jason S · Answer 6 · 2009-11-18T16:47:49.157

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
    if (s == null)
        return Collections.emptyList();

    List<String> list = new ArrayList<String>();
    Matcher m = splitSearchPattern.matcher(s);
    int pos = 0;
    boolean quoteMode = false;
    while (m.find())
    {
        String sep = m.group();
        if ("\"".equals(sep))
        {
            quoteMode = !quoteMode;
        }
        else if (!quoteMode && ",".equals(sep))
        {
            int toPos = m.start(); 
            list.add(s.substring(pos, toPos));
            pos = m.end();
        }
    }
    if (pos < s.length())
        list.add(s.substring(pos));
    return list;
}

(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

score 1 · Answer 7 · answered Nov 18 '09 at 16:14

1

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

answered Nov 18 '09 at 16:14

Matthew Sowders

1,450
1
18
31

Pretty sure that would break for a list like: "foo",bar,"baz" – Angelo Genovese May 30 '13 at 22:57
1

I think you meant `(? – Alan Moore May 19 '14 at 15:04
It seams to work perfectly for me, IMHO I think this is a better answer due since its shorter and more easily comprehensible – Ordiel Jan 12 '17 at 18:25

Holger · Answer 8 · 2019-09-19T14:35:31.040

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.

The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:

Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");

The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.

Then, with Java 9, we can get an array as

String[] a = p.matcher(input).results()
    .map(m -> m.group(m.start(1)<0? 2: 1))
    .toArray(String[]::new);

whereas older Java versions need a loop like

for(Matcher m = p.matcher(input); m.find(); ) {
    String token = m.group(m.start(1)<0? 2: 1);
    System.out.println("found: "+token);
}

Adding the items to a List or an array is left as an excise to the reader.

For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.

For mixed content with embedded strings, like in the question, you can simply use

Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");

But then, the strings are kept in their quoted form.

score 0 · Answer 9 · answered Nov 18 '09 at 16:13

0

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.

After you split on comma, replace all mapped identifiers with the original string values.

answered Nov 18 '09 at 16:13

Stefan Kendall

61,898
63
233
391

and how to find quote groupings without crazy regexS? – Kai Huppmann Nov 18 '09 at 16:22
For each character, if character is quote, find next quote and replace with grouping. If no next quote, done. – Stefan Kendall Nov 18 '09 at 16:48

score 0 · Answer 10 · answered Apr 12 '20 at 09:29

0

what about a one-liner using String.split()?

String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

answered Apr 12 '20 at 09:29

Kaplan

872
3
7

score -1 · Answer 11 · edited Nov 29 '11 at 20:23

-1

I would do something like this:

boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"')
{
   foundQuote = true;
}

if(foundQuote == true)
{
   //do nothing
}

else 

{
  string[] split = currentString.split(',');  
}

edited Nov 29 '11 at 20:23

Jason Plank

2,322
4
29
39

answered Nov 18 '09 at 16:11

Woot4Moo

22,887
13
86
143

Java: splitting a comma-separated string but ignoring commas in quotes

11 Answers11

EDIT

Linked

Related