0

all So, I'm trying to figure out how to make a simple regex code for Visual Basic.net, but am not getting anywhere.

I'm parsing csv files into a list of array, but the source csv's are anything but pristine. There are extra/rogue quotes in just enough places to crash the program, and enough sets of quotes to make fixing the data manually cumbersome.

I've written in a bunch of error-checking, and it works about 99.99% of the time. However, with 10,000 lines to parse for each folder, that averages one error per set of csv files. Crash. To get that last 0.01% parsed properly, I've created an If statement that will pull out lines that have odd numbers of quotes and remove ALL of them, which triggers a manual error-check If there are zero quotes, the field processes as usual. If there's an even number of quotes, the standard Split function cannot ignore delimiters between quotes without a regex.

Could someone help me figure out a regex string that will ignore fields enclosed in quotes?
Here's the code I've been able to think up up to this point.

Thank you in advance

Using filereader1 As New Microsoft.VisualBasic.FileIO.TextFieldParser(files_(i),
              System.Text.Encoding.Default) 'system text decoding adds odd characters

    filereader1.TextFieldType = FieldType.Delimited
    'filereader1.Delimiters = New String() {","}
    filereader1.SetDelimiters(",") 
    filereader1.HasFieldsEnclosedInQuotes = True 


    For Each c As Char In whole_string
        If c = """" Then cnt = cnt + 1
    Next
    If cnt = 0 Then 'no quotes
        split_string = Split(whole_string, ",") 'split by commas
    ElseIf cnt Mod 2 = 0 Then 'even number of quotes

         split_string = Regex.Split(whole_string, "(?=(([^""]|.)*""([^""]|.)*"")*([^""]|.)*$)")
    ElseIf cnt <> 0 Then 'odd number of quotes
        whole_string = whole_string.Replace("""", " ") 'delete all quotes
        split_string = Split(whole_string, ",") 'split by commas
    End If
user3697824
  • 518
  • 4
  • 14
  • Please give us a concrete example of what you want to ignore, and in what context (in a line, etc.) If you want me to see your message, reply with `@zx81` at the front. – zx81 Jun 18 '14 at 00:10
  • @zx81
    Input line ___________________________________________________________ LIST,410210,2-4,"PUMP, HYDRAULIC PISTON - MAIN",1,,, _________________ ______________________________________________________ desired output line (delimited at pipes) _________________________ LIST|410210|2-4|"PUMP, HYDRAULIC PISTON - MAIN"|1||| ______________ ________________________________________________________ Current output line (delimited at pipes) ______________________________ LIST|410210|2-4|"PUMP| HYDRAULIC PISTON - MAIN"|1|||
    – user2175620 Jun 18 '14 at 12:49
  • Thanks for clarifying. Posted two options, let me know how they work. :) – zx81 Jun 18 '14 at 21:14

1 Answers1

2

In VB.NET, there are several ways to proceed.

Option 1

You can use this regex: ,(?![^",]*")

It matches commas that are not inside quotes: a comma , that is not followed (as asserted by the negative lookahead (?![^",]*") ) by characters that are neither a comma nor a quote then a quote.

In VB.NET, something like:

Dim MyRegex As New Regex(",(?![^"",]*"")")
ResultString = MyRegex.Replace(Subject, "|")

Option 2

This uses this beautifully simple regex: "[^"]*"|(,)

This is a more general solution and easy to tweak solution. For a full description, I recommend you have a look at this question about of Regex-matching or replacing... except when.... It can make a very tidy solution that is easy to maintain if you find other cases to tweak.

The left side of the alternation | matches complete "quotes". We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

This code should work:

Imports System
Imports System.Text.RegularExpressions
Imports System.Collections.Specialized

Module Module1
Sub Main()
Dim MyRegex As New Regex("""[^""]*""|(,)")
Dim Subject As String = "LIST,410210,2-4,""PUMP, HYDRAULIC PISTON - MAIN"",1,,,"
Dim Replaced As String = myRegex.Replace(Subject, 
                     Function(m As Match)
                        If (m.Groups(1).Value = "") Then
                            Return ""
                        Else 
                            Return m.Groups(0).Value
                        End If
                     End Function)
Console.WriteLine(Replaced)

Console.WriteLine(vbCrLf & "Press Any Key to Exit.")
Console.ReadKey()
End Sub
End Module

Reference

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97