4

I tried following regex to split data in a text file, but I found a strange bug during testing - pretty simple file was spitted clearly incorrect. Sample code to illustrate such behavior:

        const string line = "511525,3122,9,39,2007,9,39,3127,9,39,\" -49,368.11 \",\"-32,724.16\",2,1,\" 2,347.91 \", -   ,\" 2,234.17 \", -   ,2.2,1.143,2,1.24,FALSE,1,2,0,311,511625";
        const string pattern = ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)";

        Console.WriteLine();
        Console.WriteLine("SPLIT");
        var splitted = Regex.Split(line, pattern, RegexOptions.Compiled);
        foreach (var s in splitted)
        {
            Console.WriteLine(s);
        }

        Console.WriteLine();
        Console.WriteLine("REPLACE");
        var replaced = Regex.Replace(line, pattern, "!" , RegexOptions.Compiled);
        Console.WriteLine(replaced);

        Console.WriteLine();
        Console.WriteLine("MATCH");
        var matches = Regex.Matches(line, pattern);
        foreach (Match match in matches)
        {
            Console.WriteLine(match.Index);
        }

So, as you can see, split is the only method which produces unexpected results(it splits on invalid positions!)!Both Matches and Replace give absolutely correct results. I even tried to test mentioned regex in RegexBuddy, and it showed same matches as Regex.Matches! Am i missing something or it looks like a bug in Split method?

Console output:

SPLIT
511525
, -   ," 2,234.17 "
3122
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
2007
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
3127
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
" -49,368.11 "
, -   ," 2,234.17 "
"-32,724.16"
, -   ," 2,234.17 "
2
, -   ," 2,234.17 "
1
, -   ," 2,234.17 "
" 2,347.91 "
 -   ," 2,234.17 "
 -
" 2,234.17 "
" 2,234.17 "
 -
2.2
1.143
2
1.24
FALSE
1
2
0
311
511625

REPLACE
511525!3122!9!39!2007!9!39!3127!9!39!" -49,368.11 "!"-32,724.16"!2!1!" 2,347.91 "! -   !" 2,234.17 "! -   !2.2!1.143!2!1.24!FALSE!1!2!0!311!511625

MATCH
6
11
13
16
21
23
26
31
33
36
51
64
66
68
81
87
100
106
110
116
118
123
129
131
133
135
139
Community
  • 1
  • 1
illegal-immigrant
  • 7,648
  • 7
  • 46
  • 78
  • I'll try to include them..They are quite big – illegal-immigrant Jan 17 '12 at 14:27
  • i think your pattern is wrong (dont ask me how to get it right) cause your code seems to skip a part of your string and then split. this is also the case for your replace and match methods, but these are not visible (when replacing the code might jump through your string replacing stuff, but output is good) – Moonlight Jan 17 '12 at 14:52
  • 2
    Have you read the remarks in the [documentation](http://msdn.microsoft.com/en-us/library/byy2946e.aspx) where it discusses capturing parentheses - and especially, the behaviour where multiple capturing parentheses are present? – Damien_The_Unbeliever Jan 17 '12 at 15:02
  • @Marnix van Valen As i mentioned, I've checked regex with RegexBuddy. – illegal-immigrant Jan 17 '12 at 15:04
  • @Damien_The_Unbeliever thanks, i'll read remarks – illegal-immigrant Jan 17 '12 at 15:05
  • @Damien_The_Unbeliever I am not sure. Do you think it's the reason? – illegal-immigrant Jan 17 '12 at 15:09
  • @taras.roshko - I think it might be. I set a breakpoint in the final foreach loop (for the `Match`) and examined the `match` object. What you're seeing as the second element of `splitted` is `match.Groups[1].Captures[3].Value`, so I'm thinking that it is to do with inserting captures in the output. – Damien_The_Unbeliever Jan 17 '12 at 15:11
  • @Damien_The_Unbeliever Take a look at link i've posted) – illegal-immigrant Jan 17 '12 at 15:14
  • It also strikes me that this regex won't work as advertised if the *first* item in the list is a quoted string containing a comma. – Damien_The_Unbeliever Jan 17 '12 at 15:14

2 Answers2

2

Solution from MS

(Adding ExplicitCapture regex option)

illegal-immigrant
  • 7,648
  • 7
  • 46
  • 78
2

Based on your response from Microsoft (add ExplicitCapture) it seems the problem is the capturing group. The ExplicitCapture option would turn that capturing group into a non-capturing group

You can do the same without the option by making the group explicitly non-capturing:

const string pattern = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";

which, testing with LINQPad, seems to produce the results are looking for.

Whether there are any capturing groups makes a difference as described in the docs for Regex.Split

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, splitting the string " plum-pear" on a hyphen placed within capturing parentheses adds a string element that contains the hyphen to the returned array.

Mark Peters
  • 15,345
  • 2
  • 19
  • 15