12

I asked this question earlier and it was closed because it was a duplicate, which I accept and actually found the answer in the question Java: splitting a comma-separated string but ignoring commas in quotes, so thanks to whoever posted it.

But I've since run into another issue. Apparently what I need to do is use "," as my delimiter when there are zero or an even number of double-quotes, but also ignore any "," contained in brackets.

So the following:

"Thanks,", "in advance,", "for("the", "help")"

Would tokenize as:

  • Thanks,
  • in advance,
  • for("the", "help")

I'm not sure if there's anyway to modify the current regex I'm using to allow for this, but any guidance would be appreciated.

line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
Community
  • 1
  • 1
binarymelon
  • 894
  • 2
  • 12
  • 25
  • 12
    You should be using a real CSV-parser to handle that mess. Not **every** parsing problem is best handled with regexes. – Joachim Sauer Feb 22 '10 at 18:04
  • 1
    @Joachim, How many CSV parsers do you know that can handle quotes, inside brackets, inside quotes in the way that he wants? – Mark Byers Feb 22 '10 at 18:25
  • 1
    None, because it's invalid CSV format. – BalusC Feb 22 '10 at 19:08
  • It's not CSV. It's a list of parameters for a function call. I also realized my original input was incorrect. There should be no double-quotes surrounding for("the", "help"). – binarymelon Feb 23 '10 at 12:31

2 Answers2

5

Sometimes it is easier to match what you want instead of what you don't want:

String s = "\"Thanks,\", \"in advance,\", \"for(\"the\", \"help\")\"";
String regex = "\"(\\([^)]*\\)|[^\"])*\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()) {
    System.out.println(s.substring(m.start(),m.end()));
}

Output:

"Thanks,"
"in advance,"
"for("the", "help")"

If you also need it to ignore closing brackets inside the quotes sections that are inside the brackets, then you need this:

 String regex = "\"(\\((\"[^\"]*\"|[^)])*\\)|[^\"])*\"";

An example of a string which needs this second, more complex version is:

 "foo","bar","baz(":-)",":-o")"

Output:

"foo"
"bar"
"baz(":-)",":-o")"

However, I'd advise you to change your data format if at all possible. This would be a lot easier if you used a standard format like XML to store your tokens.

Mark Byers
  • 719,658
  • 164
  • 1,497
  • 1,412
3

A home-grown parser is easily written.

For example, this ANTLR grammar takes care of your example input without much trouble:

parse
  :  line*
  ;

line
  :  Quoted ( ',' Quoted )* ( '\r'? '\n' | EOF )
  ;

Quoted
  :  '"' ( Atom )* '"'
  ;

fragment
Atom
  :  Parentheses
  |  ~( '"' | '\r' | '\n' | '(' | ')' )
  ;

fragment
Parentheses
  :  '(' ~( '(' | ')' | '\r' | '\n' )* ')'
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

and it would be easy to extend this to take escaped quotes or parenthesis into account.

When feeding the parser generated by that grammar to following two lines of input:

"Thanks,", "in advance,", "for("the", "help")"
"and(,some,more)","data , here"

it gets parsed like this:

alt text

If you consider to use ANTLR for this, I can post a little HOW-TO to get a parser from that grammar I posted, if you want.

Glorfindel
  • 19,729
  • 13
  • 67
  • 91
Bart Kiers
  • 153,868
  • 34
  • 276
  • 272