0

I have a csv file that I would like to use the String split() method on. I want each element of the array returned by split() to be the comma separated values in the csv. However, there are other commas in the csv file.

Fortunately, these other commas are escaped like '\,'

I am having trouble getting the right regex for the split() method. I want to split by commas that are not preceded by the escape character.

My current code is:

String[] columns = new String[CONST];
columns = someString.split("*^\\,*");

To me this says: split by a comma but the character before the comma must not be the escape character. Any number of characters before or after the comma are allowed.

  1. How do I get the correct regular expression?
lmcanavals
  • 2,197
  • 1
  • 21
  • 33
CodeKingPlusPlus
  • 12,865
  • 46
  • 123
  • 204
  • Why are you writing `= new String[CONST]`, only to replace it immediately afterward? – SLaks Jan 21 '13 at 03:54
  • 1
    Also, `CONST` is an _extremely_ poor variable name; it gives no indication of what the variable represents. – SLaks Jan 21 '13 at 03:55
  • 3
    [opencsv](http://opencsv.sourceforge.net/) is a very simple csv (comma-separated values) parser library for Java. Configurable separator and quote characters (or use sensible defaults) – Paul Vargas Jan 21 '13 at 03:59
  • @Brian Roach I do not know regular expressions very well. – CodeKingPlusPlus Jan 21 '13 at 04:00
  • @BrianRoach - Maybe the regex was written by a rabbit? –  Jan 21 '13 at 04:01
  • 1
    Regular expressions *cannot* be trivial used with CSV (of which common forms include [optional] quoting and/or delimiter escapes) - as others have said, use a library. –  Jan 21 '13 at 04:01
  • 1
    I believe Apache Commons also has a CSV parser. Regular expressions are definitely the wrong way to go. – David Conrad Jan 21 '13 at 04:07
  • @pst: I think regex can be used to parse CSV (although the solution is not trivial), but you have to know the exact format to write one that works correctly... – nhahtdh Jan 21 '13 at 06:58
  • @nhahtdh "Trivial" is the keyword. There are plenty of SO questions that cover matched/balanced quote [ir]regular expressions. Integrating it into a split is no less complicated nor does it cover the case of an n-escaped separators (as what this question is really about) .. –  Jan 21 '13 at 18:21
  • I second @PaulVargas's comment. Please just use OpenCSV and be done, and not manually parse CSV with regexes. – Chris Jester-Young Feb 04 '13 at 02:40
  • You can replace "\," with a pattern and then split using comma, re-replace back what you split and that's all. – AndreaTaroni86 Oct 24 '20 at 09:09

3 Answers3

0

First, comma doesn't have special meaning at the position you are using, therefore you can omit the escape

The biggest problem in your regex is, * alone doesn't give you any meaning. * means any occurrence of previous token.

So the regex should be

.*,.* (I think escaping the comma should still be fine .*\,.* )

Then, come to usage, you are using the regex in String.split(). String.split() expect for the regex for the delimiter. Therefore you should only pass a , as regex. Having .*,.* as "delimiter" is going to give you unexpected result (You may have a try).

Adrian Shum
  • 35,318
  • 9
  • 72
  • 119
  • This will bomb the whole string when used with `split`. – nhahtdh Jan 21 '13 at 06:59
  • 1
    I am only talking about the validity of his regex, and haven't pay attention on where he is using it for (split). In order to use in split(), simply a comma should work – Adrian Shum Jan 21 '13 at 07:25
0

Since I hit this page on a search, I will answer the question as stated and put the correct pattern (and for completeness):

columns = someString.split("[^\\\\],");

Note that you need 4 escape characters because you need 2 escape characters to create 1 escape character in a string. In other words, "\\" creates the string \ . So "\\\\" creates the string \\, which escapes the escape in the regex to create the char \ in the regex. Therefore you need 4 escape characters in a string to create one in a regex. The brackets and the carat are one way to make a not statement (specifically for a single character).

You can also surround CSV entries that you don't want to split with quotes. Then use the following solution: Java: splitting a comma-separated string but ignoring commas in quotes.

My personal preference would be to use split over a 3rd party parser because of the environment I code in.

Community
  • 1
  • 1
EngineerWithJava54321
  • 1,095
  • 2
  • 13
  • 18
0

The correct way is to use a parser (to deal with \\, \, ,) but using a simple regex can work;

jshell> "a,b".split("(?!\\\\),")
$2 ==> String[2] { "a", "b" }

How to test things that don't work;

jshell> "a,b".split("[^\\\\],")
$1 ==> String[2] { "", "b" }

and

jshell> "a,b".split("*^\\,*")
|  java.util.regex.PatternSyntaxException thrown: Dangling meta character '*' near index 0
*^\,*
^
|        at Pattern.error (Pattern.java:1997)
|        at Pattern.sequence (Pattern.java:2172)
|        at Pattern.expr (Pattern.java:2038)
|        at Pattern.compile (Pattern.java:1760)
|        at Pattern.<init> (Pattern.java:1409)
|        at Pattern.compile (Pattern.java:1065)
|        at String.split (String.java:2307)
|        at String.split (String.java:2354)
|        at (#6:1)
user1133275
  • 2,412
  • 21
  • 28