2

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.

EDIT: Per comments this code block has been changed.

var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

fileContents = fileContents.Replace(fileContents, @"regex", "");

regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

My files are formatted like this:

"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,

So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso

(?<=,")([^"]+,[^"]+)(?=",)

I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?

SOLVED: Combined [^"]+ with look behind/ahead:

(?<=,"[^"]+)(,)(?=[^"]+",)

FINAL EDIT: Here's my final complete solution:

//read file contents
var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");

//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));

//write result back to file
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);
KingRichard
  • 990
  • 2
  • 11
  • 24
  • Same question for for Java: http://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes – D Stanley Dec 10 '14 at 17:30
  • Filecontents.Replace does not regex replace for starters. You create a Regex regex = new Regex(pattern); then you do regex.Replace(filecontents, replacement); – Florian Schmidinger Dec 10 '14 at 17:30
  • @DStanley I'm not trying to split the string – KingRichard Dec 10 '14 at 17:33
  • @FlorianSchmidinger thanks for that explanation, I'll try it that way ,but still need to figure out the correct regex – KingRichard Dec 10 '14 at 17:33
  • You want the contents of the 3rd column replaced? Cant see the requirement clearly – Florian Schmidinger Dec 10 '14 at 17:34
  • I just want to replace commas between `,"` and `",` with `nothing`. There are 2-3 columns that may contain commas, which is why I wrote the regex the way I did. I just can't figure out how to isolate it down to just the comma. – KingRichard Dec 10 '14 at 17:39
  • This may not be an issue, but your solution does work if the comma is at the start or end of the field. ",this has a comma" – Martin Brown Dec 10 '14 at 18:30
  • On your 'final solution', it makes no sense using a delegate with the regex `"(?<=,"[^"]+),(?=[^"]+",)` since that regex uses a variable length lookbehind to match the _next_ single comma as the engine bumps along. Either use the regex in _your_ posted answer _without_ the delegate, or use the one posted by @MarkPeters with the delegate. –  Dec 10 '14 at 20:35
  • I'm not sure if you're understanding my regex: `?=,"` (look behind for `,"`) followed by `[^"]+` (one or more chars that are not a `"`) THEN `?=` (look ahead for) `[^"]+` (one or more chars not a `"`) followed by `",` – KingRichard Dec 10 '14 at 21:15
  • 1
    @RichardN - When you use that regex it only finds a single character that it replaces. The match evaluator delegate is an expensive _callback_ that's primary purpose is to do a sub-replacement on a main general replacement. Using the same regex, try this `Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", ""));` then this `Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", m => m.ToString().Replace(",", "")));` –  Dec 10 '14 at 22:34
  • OK, I think I see what you're saying. I didn't notice mark had used my initial regex. So basically instead of finding each comma and then replacing it, it would take the whole string in the quotes and replace any and all commas at once. Thanks for the explanation. I realized I don't know what you mean by 'a delegate' so maybe that's what I wasn't understanding. Like I said initially, I'm new to C# (as in this is my first C# script). Thanks again! – KingRichard Dec 10 '14 at 23:44

4 Answers4

1

Try to parse out all your columns with this:

 Regex regex = new Regex("(?<=\").*?(?=\")");

Then you can just do:

 foreach(Match match in regex.Matches(filecontents))
 {
      fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
 }

Might not be as fast but should work.

Orilux
  • 1,371
  • 13
  • 19
Florian Schmidinger
  • 4,377
  • 2
  • 13
  • 25
1

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",

(?<=,"[^"]+)(,)(?=[^"]+",)

KingRichard
  • 990
  • 2
  • 11
  • 24
  • This works ok. You could even use `(?<=,"[^"]*),(?=[^"]*",)` to handle edge cases like `delimiter",middle,"delimiter`. +1 –  Dec 10 '14 at 18:30
  • Yeah, I guess that would work too. It will never happen as the files I'm dealing with are auto generated in a specific format, the `,` inside the field only appears in numbers such as `10,000` or `1,000,000`. I guess I could even use `(?=[0-9]+),(?=[0-9]+)` – KingRichard Dec 10 '14 at 18:42
  • There you go, that makes sense. –  Dec 10 '14 at 18:45
0

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text. This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.

I find keeping your regexes simple will pay benefits when you're trying to maintain them later.

Note: this is similar to the answer by @Florian, but this replace restricts itself to replacement only in the matched text.

string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp); 
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))
Mark Peters
  • 15,345
  • 2
  • 19
  • 15
0

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.

While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.

 (?<=,"[^"]*),(?=[^"]*",)

Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:

"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,

In this case you are going to be out of luck with a regular expression.

Thankfully the code to deal with CSV files is very simple:

    public static IList<string> ParseCSVLine(string csvLine)
    {
        List<string> result = new List<string>();
        StringBuilder buffer = new StringBuilder();

        bool inQuotes = false;
        char lastChar = '\0';

        foreach (char c in csvLine)
        {
            switch (c)
            {
                case '"':
                    if (inQuotes)
                    {
                        inQuotes = false;
                    }
                    else
                    {
                        if (lastChar == '"')
                        {
                            buffer.Append('"');
                        }
                        inQuotes = true;
                    }
                    break;

                case ',':
                    if (inQuotes)
                    {
                        buffer.Append(',');
                    }
                    else
                    {
                        result.Add(buffer.ToString());
                        buffer.Clear();
                    }
                    break;

                default:
                    buffer.Append(c);
                    break;
            }

            lastChar = c;
        }
        result.Add(buffer.ToString());
        buffer.Clear();

        return result;
    }

PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

Martin Brown
  • 22,802
  • 13
  • 71
  • 107
  • Thanks for all the details here. It was very informational. Just to clarify, my regex is `(?<=,"[^"]+),(?=[^"]+",)` using `+` instead of `*` so that it requires one or more chars between the `,"` and `,` – KingRichard Dec 10 '14 at 18:37