I'm trying to load a CSV into a DataTable using this:
class CSVReader
{
public System.Data.DataTable GetDataTable(string strFileName)
{
System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
(
"Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) +
"; Extended Properties = \"Text;HDR=YES;FMT=Delimited\""
);
conn.Open();
string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
System.Data.DataSet ds = new System.Data.DataSet("CSV File");
adapter.Fill(ds);
return ds.Tables[0];
}
}
It works fine one one CSV, but not another. Here is the snippet of the file that doesn't load properly:
It simply loads "T" as the first column name, and everything else is blank/null. I tried manually looking at the first line with
Using (StreamReader x = new StreamReader(fileName) { string firstline x = x.ReadLine(); }
and the equivalent File.ReadAllLines and referenced the array's "0" entry ( file[0] ).
Both simply return "T" as the "first line" and anything beyond that is blank. Any ideas why it only sees the first character in the CSV and nothing else?
EDIT: First line looks like this:
TERM(s),OBJECTID,FILE,PATH,HIT COUNT
The second line looks like this:
"(test)","172911","16369318","Item001.E01/Partition 1/NONAME [NTFS]/[unallocated space]/13621367/16369318","4"
EDIT 2: I switched over the the lib somebody linked (CVSReader) and it seems to have taken a lot of the pain out. I tried encoding the file in UTF-8 with Notepad++ and it gets farther along, until it gets to:
LumenWorks.Framework.IO.Csv.MalformedCsvException was unhandled Message=The CSV appears to be corrupt near record '1373' field '3 at position '2601'. Current raw data : '32/System.ServiceModel/06d6eab93282d2b136a377bd50b7c5a9/System.ServiceModel.ni.dll","11" "(vc)","40656","Adobe AIR Application Installer.swf","Item001.E01/Partition 1/NONAME [NTFS]/[root]/Program Files/Common Files/Adobe AIR/Versions/1.0/Adobe AIR Application Installer.swf","11" "(vc)","503322","䄳䆷䞫䄦䠥","Item001.E01/Partition 1/NONAME [NTFS]/[root]/WINDOWS/Installer/520ae67.msp/䄳䆷䞫䄦䠥","11"
I'm guessing that it is taking issue with the foreign characters in UTF-8 encoding. If I leave the file the way it was, original encoding, it processes poorly/incorrectly. I don't want to make the user have to open the file and save it as ASCII/UTF-16 since it is ~90mb. I've been trying to google around, but most people say .NET can handle any encoding.
It seems the file is outputted as UCS-2 LE (which I think is UTF-16, right?). I"m confused why CVSReader/StreamReader are taking issue.
Diagnosed but not fully solved
When I pass a "characterset=Unicode" appended to the string on my OldeDB function it seems to work for USC-2LE/Unicode encoding. I would prefer to use the CSVReader custom lib, but it seems to use TextReader (which as far as I can tell can't handle Unicode).
http://www.codeproject.com/KB/database/CsvReader.aspx
THe following code will not work. It doesn't throw an error, but it seems to stall out even on its own thread:
Bad Code for USC2/Unicode:
using (CsvReader csv = new CsvReader(
new StreamReader(kwfile, Encoding.Unicode), true))
{
csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
keywordHits.Load(csv);
}
Working, but not preferred solution:
public System.Data.DataTable GetDataTable(string strFileName)
{
System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
(
"Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) +
"; Extended Properties = \"Text;characterset=Unicode;HDR=YES;FMT=Delimited\""
);
conn.Open();
string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
System.Data.DataSet ds = new System.Data.DataSet("CSV File");
adapter.Fill(ds);
return ds.Tables[0];
}
I've tried it with the optional third parameter in the CSVReader as well, and no difference. The program "works" by loading one CSV using the CSVReader class but has to use the OldeDB on the Unicode CSV. Obviously, using StreamReader with the Encoding.Unicode parameter works, but I'd have to re-invent the wheel in parsing out possibly malformed entries. Any thoughts? Or is this the best I can do without rewriting the CSVReader?