35

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

np.genfromtxt('t.csv', delimiter=',')

produces the error:

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

The data structure I am looking for is:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

joris
  • 106,362
  • 32
  • 216
  • 184
CraigO
  • 515
  • 1
  • 5
  • 6
  • can you post more of the full csv, I think I know how to fix your problem ;) – Eiyrioü von Kauyf Jul 29 '13 at 20:13
  • You should change your csv replacing the separating character from `,` to `;` for example... – Saullo G. P. Castro Jul 29 '13 at 20:42
  • @SaulloCastro: I can't do that because my real data is much messier and has strings with `;` and/or any other character you can think of - this is only a toy example. What I am looking for is a more general solution. – CraigO Jul 30 '13 at 12:43
  • @EiyrioüvonKauyf: As I mentioned, this is just a toy example to illustrate the problem, but I think it has all the essential features. What exactly are you looking for? – CraigO Jul 30 '13 at 12:45

4 Answers4

24

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

joris
  • 106,362
  • 32
  • 216
  • 184
  • Awesome, works like a charm. I have heard a bit about `pandas` but have never gotten around to trying before - this seems like a great opportunity. And BTW, I was trying to keep my initial example simple, but I was actually planning to use `np.recfromcsv` to get structured numpy arrays. – CraigO Jul 30 '13 at 13:41
13

The problem with the additional comma, np.genfromtxt does not deal with that.

One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.

If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.

That would go something like this:

import csv
import numpy as np

np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")

This essentially replaces on-the-fly only the appropriate commas with tabs.

askewchan
  • 39,694
  • 13
  • 105
  • 128
Bitwise
  • 7,043
  • 4
  • 30
  • 48
  • Well, I see what you're getting at here, but I played around a bit but still couldn't get it to work quite right. Anyway, I think today I'm going with the `pandas` route. Thanks, anyway. – CraigO Jul 30 '13 at 13:42
5

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:

import pandas
data = pandas.read_csv('file.csv').as_matrix()

Pandas will handle the "Lexington, KY" case correctly

Michael Yurin
  • 703
  • 10
  • 14
2

Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.

The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.

import csv
import numpy as np

def recfromcsv_mod(fname, **kwargs):
    def rewrite_csv_as_tab(fname):
        with open(fname, newline='') as fp:
            dialect = csv.Sniffer().sniff(fp.read(1024))
            fp.seek(0)
            for row in csv.reader(fp, dialect):
                yield "\t".join(row)
    return np.recfromcsv(
        rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)

# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)
Mike T
  • 34,456
  • 15
  • 128
  • 169