Open a file in the proper encoding automatically

Question

I'm dealing with some problems in a few files about the encoding. We receive files from other company and have to read them (the files are in csv format)

Strangely, the files appear to be encoded in UTF-16. I am managing to do that, but I have to open them using the codecs module and specifying the encoding, this way.

ENCODING = 'utf-16'
with codecs.open(test_file, encoding=ENCODING) as csv_file:
    # Autodetect dialect
    dialect = csv.Sniffer().sniff(descriptor.read(1024))
    descriptor.seek(0)
    input_file = csv.reader(descriptor, dialect=dialect)

    for line in input_file:
       do_funny_things()

But, just like I am able to get the dialect in a more agnostic way, I 'm thinking it will be great to have a way of opening automatically the files with its proper encoding, at least all the text files. There are other programs, like vim that achieve that.

Anyone knows a way of doing that in python 2.6?

PD: I hope that this will be solved in Python 3, as all the strings are Unicode...

Python 3 doesn't solve this; all it does is add `encoding` and `errors` arguments to `open()`. — Ignacio Vazquez-Abrams, Feb 26 '10 at 16:15

score 10 · Accepted Answer · edited May 17 '12 at 20:54

10

chardet can help you.

Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source.

edited May 17 '12 at 20:54

chrisaycock

32,202
12
79
116

answered Feb 26 '10 at 14:39

Desintegr

6,411
1
20
17

score 5 · Answer 2 · edited May 23 '17 at 12:09

It won't be "fixed" in python 3, as it's not a fixable problem. Many documents are valid in several encodings, so the only way to determine the proper encoding is to know something about the document. Fortunately, in most cases we do know something about the document, like for instance, most characters will come clustered into distinct unicode blocks. A document in english will mostly contain characters within the first 128 codepoints. A document in russian will contain mostly cyrillic codepoints. Most document will contain spaces and newlines. These clues can be used to help you make educated guesses about what encodings are being used. Better yet, use a library written by someone who's already done the work. (Like chardet, mentioned in another answer by Desintegr.

score 0 · Answer 3 · edited May 23 '17 at 12:02

0

csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentation and this question for ways to handle it.

edited May 23 '17 at 12:02

Community

1
1

answered Feb 26 '10 at 17:22

Mark Tolonen

132,868
21
152
208

score -3 · Answer 4 · answered Feb 26 '10 at 15:04

-3

If it will be fixed in Python 3, it should also be fixed by using

from __future__ import unicode_literals

answered Feb 26 '10 at 15:04

RdV

11

2

Apparently, that does only mean that your strings are unicode, not that you can upload unicode directly from a file... Also it's utf-8 – Khelben Feb 26 '10 at 15:51

Open a file in the proper encoding automatically

4 Answers4

Linked

Related