0

I have a (ton of) string(s) like this:

HKR,Drivers,SubClasses,,"wave,midi,mixer,aux"

I'm basically looking to split it into multiple strings at the , character.

However, if the , character is inside " or inside %, it should be ignored.

In other words from the line above I'd expect the strings:

HKR
Drivers
SubClasses

"wave,midi,mixer,aux"

(with one empty string represented by the blank line above).

If the line was HKR,Drivers,SubClasses,,%wave,midi,mixer,aux% then basically the same as above, only of course the last string to be returned should be %wave,midi,mixer,aux%.

I have some working code, but it's incredibly slow to process all of the lines and I very much need to find a faster way to do this.

private static IEnumerable<string> GetValues(string line)
{
    var insideQuotes = false;
    var insidePercent = false;
    var startValueIndex = 0;

    for (var i = 0; i < line.Length; i++)
    {
        if (line[i] == '%' && !insideQuotes)
        {
            insidePercent = !insidePercent;
        }

        if (line[i] == '"')
        {
            insideQuotes = !insideQuotes;
        }

        if (line[i] != ',' || insideQuotes || insidePercent)
        {
            continue;
        }

        yield return line.Substring(startValueIndex, i - startValueIndex);
        startValueIndex = i + 1;
    }
}

Any help would be appreciated.

cogumel0
  • 2,126
  • 3
  • 21
  • 38
  • If you read the question, I don't want to split it at *every* `,`. – cogumel0 Jul 18 '17 at 15:20
  • 4
    Use `VisualBasic.TextFieldParser` and set [`HasFieldsEnclosedInQuotes`](https://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.hasfieldsenclosedinquotes(v=vs.110).aspx) to `true`. – Tim Schmelter Jul 18 '17 at 15:20
  • @cogumel0 He's talking about using a Regex in the Split – Yahya Jul 18 '17 at 15:21
  • @Yahya I understand the question was initially marked as java by mistake and java might be different, but in .NET there's no `string.Split` that accepts a regex. – cogumel0 Jul 18 '17 at 15:22
  • 1
    It looks like a CSV. Why not using a CSV library like [CsvHelper](https://www.nuget.org/packages/CsvHelper)? – aloisdg Jul 18 '17 at 15:23
  • @Yahya and if a regex is the way to go, then post a regex that works and I'll gladly accept it as a question. I'm sure there's a better way, there's no denying it, that's why I asked here. – cogumel0 Jul 18 '17 at 15:23
  • How slow is "incredibly slow" exactly, and how would you suppose this could be sped up? – CodeCaster Jul 18 '17 at 15:24
  • @aloisdg it's not a CSV, it's very different from a CSV unfortunately. Only looks like one in these examples. – cogumel0 Jul 18 '17 at 15:24
  • @cogumel0 No YOU need to understand that *mistake* gave you many unsolicited advice!! For using Regex, you can have a look at [this](https://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes) question which is very much similar to what you want. – Yahya Jul 18 '17 at 15:25
  • 4
    I'm voting to close this question as off-topic because this is working code that needs a review. Thus it should go to codereview.stackexchange.com. Aprt from this you should use a profiler to determine if this is really the performance-bottleneck within your code. – HimBromBeere Jul 18 '17 at 15:26
  • @Yahya forgot an *I* in my comment above. – cogumel0 Jul 18 '17 at 15:26
  • @CodeCaster taking several seconds over a few thousand lines to process. I imagine that at the very minimum getting rid of the `string.Substring` call would make it faster. But hoping a regex can make it incredibly faster. – cogumel0 Jul 18 '17 at 15:28
  • 2
    My guess is that a regex is going to make it incredibly slower, but why don't you try? Anyway it really doesn't look to me like this actually is the bottleneck of your code. The CPU can do this at lightning speed. Are you measuring the time to read the file from disk as well? – CodeCaster Jul 18 '17 at 15:28
  • 1
    One way or another you'll need to be splitting the string up, and regex is likely to be *slower* not faster – Jamiec Jul 18 '17 at 15:29
  • @Jamiec with regex you can use capture groups though so you get the result directly on the `Matches` – cogumel0 Jul 18 '17 at 15:30
  • 1
    @cogumel0 that isnt a magic wand which will magicaly make it faster! – Jamiec Jul 18 '17 at 15:33
  • Must tell you ive been trying to improve your code and I cannot get the average runtime of your method above 0ms (0.00765ms to be exact). Even for a much longer string than your given input. Are you *sure* this is your bottleneck? (This is why premature optimization is the work of the devil BTW!) – Jamiec Jul 18 '17 at 15:41
  • Oh you have a bug also - the last item is not returned - you need a `yield return line.Substring(startValueIndex, line.Length-startValueIndex);` after the loop. – Jamiec Jul 18 '17 at 15:48
  • I doubt that you can make a huge performance-improvement by modifying this few lines of code unless either your string has millions of lines. You should focus on your *actual* problems, not assuming there *might* be one without measuring if this is true (which is done by a profiler, or at least a `StopWatch`). – HimBromBeere Jul 18 '17 at 15:52
  • Ok, turns out everyone saying this wasn't the bottleneck was right. There was a rogue `Console.WriteLine` hidden in the caller... – cogumel0 Jul 19 '17 at 08:24

2 Answers2

1

Use VisualBasic.TextFieldParser and set HasFieldsEnclosedInQuotes to true.

I would use a method like this which processes all lines at once:

public static IEnumerable<string[]> GetValues(string allLines)
{
    using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(new StringReader(allLines)))
    {
        parser.HasFieldsEnclosedInQuotes = true;
        parser.Delimiters = new[] { "," };
        while (!parser.EndOfData)
        {
            string[] nextLineFields = parser.ReadFields();
            yield return nextLineFields;
        }
    }
}

Your sample:

var allLinesFields = GetValues("HKR,Drivers,SubClasses,,\"wave, midi, mixer, aux\"");
foreach (string[] lineFields in allLinesFields)
    Console.WriteLine(string.Join(Environment.NewLine, lineFields));

It will be more efficient than String.Split and also supports other things which you might not even have thought of. You can also handle special exceptions if the format was invalid.

Tim Schmelter
  • 411,418
  • 61
  • 614
  • 859
  • 1
    Not sure how well this will deal with the fields enclosed in `%` though – Jamiec Jul 18 '17 at 15:31
  • 2
    The question claims the code shown is too slow. This answer contains no evidence this approach is any faster. – CodeCaster Jul 18 '17 at 15:32
  • @Jamiec: that will be a problem, haven't seen it yet :) So OP has no idea in what format his csv comes? However, he could use other csv parsers like [this](http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader) which support it – Tim Schmelter Jul 18 '17 at 15:32
  • @CodeCaster: i am not responsible for evidences, i just want to help. I know this is more efficient though – Tim Schmelter Jul 18 '17 at 15:37
0

I've just reordered some of your query statements to avoid string operations. This should be more efficient.

   private static IEnumerable<string> GetValues2(string line)
    {
        bool insideQuotes = false;
        bool insidePercent = false;
        int startValueIndex = 0;

        for (int i = 0; i < line.Length; i++)
        {
            if (!insideQuotes && line[i] == '%')
            {
                insidePercent = !insidePercent;
            }

            if (line[i] == '"')
            {
                insideQuotes = !insideQuotes;
            }

            if (insideQuotes || insidePercent || line[i] != ',')
            {
                continue;
            }

            yield return line.Substring(startValueIndex, i - startValueIndex);
            startValueIndex = i + 1;
        }
    }
Roey Liba
  • 17
  • 5
Ben
  • 738
  • 1
  • 5
  • 14