6

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:

max,emily,john = ["max", "emily", "john"]

BUT

max,"emily,kate",john = ["max", "emily,kate", "john"]

Looking to use in C#: Regex.Split(string, "PATTERN-HERE");

Thanks.

BoltClock
  • 630,065
  • 150
  • 1,295
  • 1,284
Justin
  • 34,956
  • 68
  • 168
  • 266

4 Answers4

14

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.

You might try something like this instead:

public static IEnumerable<string> SplitCSV(string csvString)
{
    var sb = new StringBuilder();
    bool quoted = false;

    foreach (char c in csvString) {
        if (quoted) {
            if (c == '"')
                quoted = false;
            else
                sb.Append(c);
        } else {
            if (c == '"') {
                quoted = true;
            } else if (c == ',') {
                yield return sb.ToString();
                sb.Length = 0;
            } else {
                sb.Append(c);
            }
        }
    }

    if (quoted)
        throw new ArgumentException("csvString", "Unterminated quotation mark.");

    yield return sb.ToString();
}

It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

cdhowie
  • 133,716
  • 21
  • 261
  • 264
  • You're currently removing quotes from the string. Shouldn't you change the logic to always append `c` except when it's a comma? – Joren Nov 11 '10 at 01:48
  • No. Surrounding quotes are supposed to be removed as per the OP. The escape the comma; they are not data by themselves. Granted, this solution will not work if the data actually does contain meaningful quotation marks somewhere, but this has not been indicated yet. – cdhowie Nov 11 '10 at 01:49
  • Thanks. :) I am certainly open to refining it too if the OP indicates that quotes can be escaped or something like that. – cdhowie Nov 11 '10 at 01:54
  • Can you provide a use case example? For example I am doing: public CSV(string p_csv_full_path) { data = new List(); try { using(StreamReader readFile = new StreamReader(p_csv_full_path)) { string line; string[] row; while((line = readFile.ReadLine()) != null) { IEnumerable row_enumerable = this.SplitCSV(line); //data.Add(row); } } } catch(Exception ex) { throw ex; } } The problem is that I need the result of SplitCSV to be a string array so I can add to the list. – Justin Nov 11 '10 at 22:59
  • You can use the LINQ method `.ToArray()` to convert the enumerable into an array: `string[] row_array = SplitCSV(line).ToArray();` – cdhowie Nov 11 '10 at 23:10
  • cdhowie: The code you provide above inst doing what I expected. So: 1, 1, "1234, Main St", San Diego, CA, 92101 was returned into: [0]="" [1]="" [2]="1234, Main St" [3]="" [4]="" [5]="" When it should be: [0]="1" [1]="1" [2]="1234, Main St" [3]="San Diego" [4]="CA" [5]="92101" – Justin Nov 11 '10 at 23:32
  • 1
    Ah, yes, my bad. I have updated the code in my answer to fix that bug. – cdhowie Nov 11 '10 at 23:36
  • I've had so much problems when exporting data from one site to another where I use a CSV file and this solution worked perfectly for that. – grimsan55 Sep 15 '15 at 06:55
1

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.

Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():

You could use the regex (please don't!)

(?<=^(?:[^"]*"[^"]*")*[^"]*)  # assert that there is an even number of quotes before...
\s*,\s*                       # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$)   # as well as after the comma.

if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.

This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
0

A little late maybe but I hope I can help someone else

     String[] cols = Regex.Split("max, emily, john", @"\s*,\s*");
     foreach ( String s in cols ) {
        Console.WriteLine(s);
     }
0

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.

Here's our simple regex:

"[^"]*"|(,)

The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.

This program shows how to use the regex (see the results at the bottom of the online demo):

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main()  {
string s1 = @"max,""emily,kate"",john";
var myRegex = new Regex(@"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
    if (m.Groups[1].Value == "") return m.Value;
    else return "SplitHere";
    });
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97