1

I have a CSV file of tab-separated data with headers and data of different types which I would like to convert into a dictionary of vectors. Eventually I would like to convert the dictionary into numpy arrays, and store them in some binary format for fast retrieval by different scripts. This is a large file with approximately 700k records and 16 columns. The following is a sample:

"answer_option" "value" "fcast_date"    "expertise"
"a" 0.8 "2013-07-08"    3
"b" 0.2 "2013-07-08"    3

I have started implementing this with the DictReader class, which I'm just learning about.

import csv
with open( "filename.tab", 'r') as records:
    reader = csv.DictReader( records, dialect='excel-tab' )
    row = list( reader )
    n = len( row )
    d = {}
    keys = list( row[0] )
    for key in keys :
        a = []
        for i in range(n):
            a.append( row[i][key] )
        d [key] = a

which gives the result

{'answer_option': ['a', 'b'],
'value': ['0.8', '0.2'],
'fcast_date': ['2013-07-08', '2013-07-08'],
'expertise': ['3', '3']}

Besides the small nuisance of having to clean from the numerical values the quotation characters that are enclosing them, I thought that perhaps there is something ready made. I'm also wondering if there is anything that extracts directly from the file into numpy vectors, since I do not need to necessarily transform my data in dictionaries.

I took a look at SciPy.org and a search of CSV also refers to HDF5 and genfromtxt, but I haven't dived into those suggestions yet. Ideally I would like to be able to store the data in a fast-to-load format, so that it would be simple to load from other scripts with only one command, where all vectors are made available the same way it is possible in Matlab/Octave. Suggestions are appreciated


EDIT: the data are tab separated with strings enclosed by quotation marks.

gciriani
  • 454
  • 2
  • 6
  • 14

1 Answers1

2

This will read the csv into a Pandas data frame and remove the quotes:

import pandas as pd
import csv
import io

with open('data_with_quotes.csv') as f_input:
    data = [next(csv.reader(io.StringIO(line.replace('"', '')))) for line in f_input]

df = pd.DataFrame(data[1:], columns=data[0])
print(df)
  answer_option value  fcast_date expertise
0              a   0.8  2013-07-08         3
1              b   0.2  2013-07-08         3

You can easily convert the data to a numpy array using df.values:

array([['a', '0.8', '2013-07-08', '3'],
       ['b', '0.2', '2013-07-08', '3']], dtype=object)

To save the data in a binary format, I recommend using Hdf5:

import h5py

with h5py.File('file.hdf5', 'w') as f:
    dset = f.create_dataset('default', data=df)

To load the data, use the following:

with h5py.File('file.hdf5', 'r') as f:
   data = f['default']

You can also use Pandas to save and load the data in binary format:

# Save the data
df.to_hdf('data.h5', key='df', mode='w')

# Load the data
df = pd.read_hdf('data.h5', 'df')
Nathaniel
  • 2,825
  • 6
  • 14
  • Thanks! I had forgotten to highlight that the data are tab separated (I just made an edit). Thus I added the parameter dialect='excel-tab' to your csv.reader code and it seems to be working. – gciriani Mar 20 '19 at 16:08
  • Found out that line.replace('"', '') is superfluous with dialect='excel-tab', so the csv.reader statement is simplified to csv.reader( io.StringIO(line), dialect='excel-tab' ) – gciriani Mar 20 '19 at 17:24
  • When attempting to save in HDF5 format I get error: "TypeError: Object dtype dtype('O') has no native HDF5 equivalent". Is it possible it is thrown off by the unusual date format? The message is preceded by 3 calls to h5py.h5t.py_create, with the final one: "File "h5py\h5t.pyx", line 1688, in h5py.h5t.py_create" – gciriani Mar 20 '19 at 18:12
  • From [this issue](https://github.com/adamreeve/npTDMS/issues/100), it looks like ``dtype('O')`` is used for Python objects that are not any of the built-in Numpy types. If that is incompatible with HDF5, you might try setting your data types to something other than object (string, float, int). You can check the data type of each column in your data frame using ``df.dtypes``, and you can set them using ``df['column'] = df['column'].astype(str)``. – Nathaniel Mar 20 '19 at 19:32
  • df.dtypes returns all Pandas of dtype object, which according to the Pandas documentation are strings. I'm able to change the numbers into dtype "int" or dtype "float64". However, because the literals need to remain strings, I will have to do some HDF5 conversion, still obscure to me. In a separate thread the conversion is handled thusly: https://stackoverflow.com/questions/23220513/storing-a-list-of-strings-to-a-hdf5-dataset-from-python – gciriani Mar 21 '19 at 20:06
  • I experimented with the pandas function csv_read, and it goes twice as fast, without the need for the with statement, nor for the open, nor for declaring the header. The code is data = pd.read_csv( f, header=0, sep='\t') – gciriani Mar 28 '19 at 14:43