I have a CSV file of tab-separated data with headers and data of different types which I would like to convert into a dictionary of vectors. Eventually I would like to convert the dictionary into numpy arrays, and store them in some binary format for fast retrieval by different scripts. This is a large file with approximately 700k records and 16 columns. The following is a sample:
"answer_option" "value" "fcast_date" "expertise"
"a" 0.8 "2013-07-08" 3
"b" 0.2 "2013-07-08" 3
I have started implementing this with the DictReader class, which I'm just learning about.
import csv
with open( "filename.tab", 'r') as records:
reader = csv.DictReader( records, dialect='excel-tab' )
row = list( reader )
n = len( row )
d = {}
keys = list( row[0] )
for key in keys :
a = []
for i in range(n):
a.append( row[i][key] )
d [key] = a
which gives the result
{'answer_option': ['a', 'b'],
'value': ['0.8', '0.2'],
'fcast_date': ['2013-07-08', '2013-07-08'],
'expertise': ['3', '3']}
Besides the small nuisance of having to clean from the numerical values the quotation characters that are enclosing them, I thought that perhaps there is something ready made. I'm also wondering if there is anything that extracts directly from the file into numpy vectors, since I do not need to necessarily transform my data in dictionaries.
I took a look at SciPy.org and a search of CSV also refers to HDF5 and genfromtxt, but I haven't dived into those suggestions yet. Ideally I would like to be able to store the data in a fast-to-load format, so that it would be simple to load from other scripts with only one command, where all vectors are made available the same way it is possible in Matlab/Octave. Suggestions are appreciated
EDIT: the data are tab separated with strings enclosed by quotation marks.