industrial strength csv reader (python)

Question

Here's my use case: It's my job to clean CSV files which are often scrapped from web pages (most are english but some german and other weird non unicode characters sneak in there). Python 3 is "utf-8" by default and the usual

import csv

#open file 
with open('input.csv','r',encoding = 'utf-8') 
    reader = csv.reader(f)

fails with UnicodeEncodeError even with try/catch blocks everywhere

I can't figure out how to clean the input if I can't even open it. My end goal is simply to read each line into a list I call text.

I'm out of ideas I've even tried the following:

 for encoding in ('utf-8','latin-1',etc, etc):
     try:
         //open the file

I can't make any assumptions about the encoding as they may be written on a unix machine in another part of the world and I'm on a windows machine. The input are just simple strings otherwise example

test case: "This is an example of a test case and the test may wrap around to a new line when opened in a text processor"

could you read it in bytes the try to `.decode` it with various methods? — Tadhg McDonald-Jensen, May 11 '16 at 04:09
see http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file — Tadhg McDonald-Jensen, May 11 '16 at 04:13
Tadhg, I thought of that, but in python 3, reading a csv as 'wb' throws another error. I'm sure there's a way to do it I'm not sure what that is. — MrL, May 11 '16 at 04:14
You said you tried `latin1`, and it can read anything (but not accurately if not really `latin1`) without a "Unicode**De**codeError", so where exactly are you getting the error? Actual, reproducible examples with exact tracebacks help. My guess is a `print` is really getting the exception if you have "Unicode**En**codeError". If you can't make any assumptions about the encoding, you have a bigger problem. Maybe the `chardet` module can help. — Mark Tolonen, May 11 '16 at 05:48

score 3 · Answer 1 · edited May 11 '16 at 04:19

3

Maybe try reading in the contents entirely, then using bytes.decode() in much the same way you mentioned:

#!python3
import csv
from io import StringIO

with open('input.csv', 'rb') as binfile:
    csv_bytes = binfile.readall()

for enc in ('utf-8', 'utf-16', 'latin1'):
    try:
        csv_string = csv_bytes.decode(encoding=enc, errors='strict')
        break
    except UnicodeError as e:
        last_err = e
else: #none worked
    raise last_err


with StringIO(csv_string) as csvfile:
    csv = csv.reader(csvfile)
    for row in csv:
        print(row[0])

edited May 11 '16 at 04:19

Tadhg McDonald-Jensen

17,001
3
29
51

answered May 11 '16 at 04:17

aghast

13,113
1
16
44

1

I hope you don't mind, I edited the code so that it wouldn't get a `NameError` if all the encodings failed. – Tadhg McDonald-Jensen May 11 '16 at 04:20

industrial strength csv reader (python)

1 Answers1