Trouble Parsing Unicode CSV File

Question

I'm trying to load a CSV into a DataTable using this:

class CSVReader
{
    public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) + 
        "; Extended Properties = \"Text;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }
}

It works fine one one CSV, but not another. Here is the snippet of the file that doesn't load properly: enter image description here

It simply loads "T" as the first column name, and everything else is blank/null. I tried manually looking at the first line with

Using (StreamReader x = new StreamReader(fileName) { string firstline x = x.ReadLine(); }

and the equivalent File.ReadAllLines and referenced the array's "0" entry ( file[0] ).

Both simply return "T" as the "first line" and anything beyond that is blank. Any ideas why it only sees the first character in the CSV and nothing else?

EDIT: First line looks like this:

TERM(s),OBJECTID,FILE,PATH,HIT COUNT

The second line looks like this:

"(test)","172911","16369318","Item001.E01/Partition 1/NONAME [NTFS]/[unallocated space]/13621367/16369318","4"

EDIT 2: I switched over the the lib somebody linked (CVSReader) and it seems to have taken a lot of the pain out. I tried encoding the file in UTF-8 with Notepad++ and it gets farther along, until it gets to:

LumenWorks.Framework.IO.Csv.MalformedCsvException was unhandled Message=The CSV appears to be corrupt near record '1373' field '3 at position '2601'. Current raw data : '32/System.ServiceModel/06d6eab93282d2b136a377bd50b7c5a9/System.ServiceModel.ni.dll","11" "(vc)","40656","Adobe AIR Application Installer.swf","Item001.E01/Partition 1/NONAME [NTFS]/[root]/Program Files/Common Files/Adobe AIR/Versions/1.0/Adobe AIR Application Installer.swf","11" "(vc)","503322","䄳䆷䞫䄦䠥","Item001.E01/Partition 1/NONAME [NTFS]/[root]/WINDOWS/Installer/520ae67.msp/䄳䆷䞫䄦䠥","11"

I'm guessing that it is taking issue with the foreign characters in UTF-8 encoding. If I leave the file the way it was, original encoding, it processes poorly/incorrectly. I don't want to make the user have to open the file and save it as ASCII/UTF-16 since it is ~90mb. I've been trying to google around, but most people say .NET can handle any encoding.

It seems the file is outputted as UCS-2 LE (which I think is UTF-16, right?). I"m confused why CVSReader/StreamReader are taking issue.

Diagnosed but not fully solved

When I pass a "characterset=Unicode" appended to the string on my OldeDB function it seems to work for USC-2LE/Unicode encoding. I would prefer to use the CSVReader custom lib, but it seems to use TextReader (which as far as I can tell can't handle Unicode).

http://www.codeproject.com/KB/database/CsvReader.aspx

THe following code will not work. It doesn't throw an error, but it seems to stall out even on its own thread:

Bad Code for USC2/Unicode:

                using (CsvReader csv = new CsvReader(
                           new StreamReader(kwfile, Encoding.Unicode), true))
            {
                csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
                keywordHits.Load(csv);
            }

Working, but not preferred solution:

        public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) +
        "; Extended Properties = \"Text;characterset=Unicode;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }

I've tried it with the optional third parameter in the CSVReader as well, and no difference. The program "works" by loading one CSV using the CSVReader class but has to use the OldeDB on the Unicode CSV. Obviously, using StreamReader with the Encoding.Unicode parameter works, but I'd have to re-invent the wheel in parsing out possibly malformed entries. Any thoughts? Or is this the best I can do without rewriting the CSVReader?

@jonathan: it's zero. Period. Zero upvotes and zero 'accepted'. What are you looking at? — p.campbell, Mar 21 '11 at 16:56
the second line looks like this "(test)","172911","16369318","Item001.E01/Partition 1/NONAME [NTFS]/[unallocated space]/13621367/16369318","4" — Jai, Mar 21 '11 at 16:56
You should try not to reinvent the wheel - this csv reader is fast and lightweight. Have a look http://www.codeproject.com/KB/database/CsvReader.aspx — Bob, Mar 21 '11 at 16:59
@Jai: the second line is unimportant. The **file** is important, as it likely has problems in its first line. There may be a hidden character after the "T" that's breaking your parser. — p.campbell, Mar 21 '11 at 17:03
No other app has the CSV open? The CSV file is on a local hard disk? — typo.pl, Mar 21 '11 at 17:05

score 4 · Accepted Answer · answered Mar 21 '11 at 17:07

4

A far shot, but perhaps your file is encoded using UTF-16. The bytes in the file would like this:

0x54 0x00 0x45 0x00 ...

Reading these bytes using UTF-8 encoding (the default for StreamReader) will yield the following characters:

T <NUL> E <NUL> ...

Try to open your file using a binary editor. If the encoding is unexpected then open it in a text editor and save it using an encoding that will work for you (ASCII or UTF-8 are good candidates).

answered Mar 21 '11 at 17:07

Martin Liversage

96,855
20
193
238

That seems to be it. I edited my post. Thanks for getting me on the right track, though I'm still a bit confused but at least I know what's going wrong. – Jai Mar 21 '11 at 18:39

score 0 · Answer 2 · edited May 23 '17 at 12:30

0

Is this snippet from the beginning? I found with the OleDb reader, it would base the schema on the first row, so that if subsequent rows had, for example, more columns, the additional columns would be ignored.

I stopped using it for that reason. If you want something lightweight, I have a fully-functional CsvReader in this answer..

edited May 23 '17 at 12:30

Community

1
1

answered Mar 21 '11 at 17:01

harpo

37,686
13
90
126

score 0 · Answer 3 · answered Mar 21 '11 at 18:42

0

When your file is UTF-16 encoded, try to specify the Encoding in the StreamReader and/or CVS reader code. The default encoding in .NET is UTF-8 which will result in the -Characters stated by Martin Liversage. The StreamReader has an overload new StreamReader(path, Encoding.UTF16), for the CVS classes I don't know.

answered Mar 21 '11 at 18:42

eFloh

2,018
20
24

It looks like the CVS Reader uses TextReader which gets a StreamReameder passed to it. I tried passing a UTF-16 override to the streamreader, but from what I can tell/read, the TextWriter only takes the default encoding of the system (UTF-8?). Am I out of look if I wanted to use this CVSReader? – Jai Mar 22 '11 at 13:35
sorry, as I stated already, I don't know the CSVReader you are using. But teh TextWriter class also can use a different statement, look at the StreamWriter subclass! – eFloh Mar 23 '11 at 12:31

Trouble Parsing Unicode CSV File

3 Answers3