2

Here's my use case: It's my job to clean CSV files which are often scrapped from web pages (most are english but some german and other weird non unicode characters sneak in there). Python 3 is "utf-8" by default and the usual

import csv

#open file 
with open('input.csv','r',encoding = 'utf-8') 
    reader = csv.reader(f)

fails with UnicodeEncodeError even with try/catch blocks everywhere

I can't figure out how to clean the input if I can't even open it. My end goal is simply to read each line into a list I call text.

I'm out of ideas I've even tried the following:

 for encoding in ('utf-8','latin-1',etc, etc):
     try:
         //open the file 

I can't make any assumptions about the encoding as they may be written on a unix machine in another part of the world and I'm on a windows machine. The input are just simple strings otherwise example

test case: "This is an example of a test case and the test may wrap around to a new line when opened in a text processor"

Paul Rooney
  • 17,518
  • 8
  • 35
  • 57
MrL
  • 318
  • 3
  • 22
  • could you read it in bytes the try to `.decode` it with various methods? – Tadhg McDonald-Jensen May 11 '16 at 04:09
  • see http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – Tadhg McDonald-Jensen May 11 '16 at 04:13
  • Tadhg, I thought of that, but in python 3, reading a csv as 'wb' throws another error. I'm sure there's a way to do it I'm not sure what that is. – MrL May 11 '16 at 04:14
  • 1
    You said you tried `latin1`, and it can read anything (but not accurately if not really `latin1`) without a "Unicode**De**codeError", so where exactly are you getting the error? Actual, reproducible examples with exact tracebacks help. My guess is a `print` is really getting the exception if you have "Unicode**En**codeError". If you can't make any assumptions about the encoding, you have a bigger problem. Maybe the `chardet` module can help. – Mark Tolonen May 11 '16 at 05:48

1 Answers1

3

Maybe try reading in the contents entirely, then using bytes.decode() in much the same way you mentioned:

#!python3
import csv
from io import StringIO

with open('input.csv', 'rb') as binfile:
    csv_bytes = binfile.readall()

for enc in ('utf-8', 'utf-16', 'latin1'):
    try:
        csv_string = csv_bytes.decode(encoding=enc, errors='strict')
        break
    except UnicodeError as e:
        last_err = e
else: #none worked
    raise last_err


with StringIO(csv_string) as csvfile:
    csv = csv.reader(csvfile)
    for row in csv:
        print(row[0])
Tadhg McDonald-Jensen
  • 17,001
  • 3
  • 29
  • 51
aghast
  • 13,113
  • 1
  • 16
  • 44