31

what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

my code is

 import pandas as pd
 df = pd.read_csv("D:\ss.csv")
 df.columns  #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
 df['True'] = df['True'] + 2     #making changes to one column of type float
 df.to_csv("D:\ss.csv")       #updating that .csv    
 df1 = pd.read_csv("D:\ss.csv")   #again trying to read that csv

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte

So please suggest how can i avoid the error and be able to read that csv again to a dataframe.

I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.

But i don't know what exactly should be changed.so need help.

Mangu Singh Rajpurohit
  • 8,894
  • 2
  • 50
  • 76
Satya
  • 3,707
  • 16
  • 38
  • 63

6 Answers6

51

Known encoding

If you know the encoding of the file you want to read in, you can use

pd.read_csv('filename.txt', encoding='encoding')

These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

Unknown encoding

If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.

import chardet
import pandas as pd

with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])
MaxNoe
  • 12,219
  • 2
  • 32
  • 40
  • to import chardet will i have to install any package? because i am getting error in importing this module. package name please.Thanks – Satya Nov 20 '15 at 06:08
  • 1
    Yes, `pip install chardet`. – MaxNoe Nov 20 '15 at 06:11
  • 4
    chardet.detect() is very slow. I use Python 3.6, and `pd.read_csv('filename.csv', encoding='Latin-1')` works perfect for me. – Jun Wang Jan 03 '18 at 16:52
  • 1
    **If** you **know** the encoding it's always better to state it. If you do not know and your file is large you can try giving chardet not the full file `f.read ()` but a smaller part, e.g. `f.read(1024**2)` for the first megabyte . – MaxNoe Jan 03 '18 at 17:20
  • Yes, that also solved my similiar problem. – bruce Mar 15 '19 at 02:56
17

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.

Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).

Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:

df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")

Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.

Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

rmunn
  • 29,094
  • 9
  • 62
  • 91
  • Thanks,as you can see my first read does not give me any error also to_csv is a success.But the error arise while trying second read.may be, while i am trying to save that csv by to_csv i should give some encdoe or decode type so taht in my second read i can read with same encoding type.please correct me. – Satya Nov 20 '15 at 05:47
  • @Satya - The `to_csv` function also takes an `encoding` parameter, so you could also try specifying `to_csv(filename, encoding="utf-8")` (I highly recommend using UTF-8 as your encoding everywhere, if you have the choice) before reading it with `read_csv(filename, encoding="utf-8")`. But since UTF-8 is already the default, I don't know if that will make much difference. – rmunn Nov 20 '15 at 05:58
  • @Satya - Actually, I was wrong just now. If you're using Python 3, UTF-8 is the default for `to_csv`. But if you're using Python 2, it's NOT the default -- so adding the `encoding="utf-8"` parameter to all your `to_csv()` calls is definitely a good idea. – rmunn Nov 20 '15 at 05:59
  • @rmunn-i am using python 3.4.1 and tried 1st read with encoding = 'utf-8', then to_csv with encoding = 'utf-8' and in 2nd read encoding = 'utf-8',,,,,still getting same error in 2nd read.please correct me if i should use UTF-8. – Satya Nov 20 '15 at 06:13
  • @Satya - Are you really, REALLY sure that it's the 2nd read that's failing? Because there's no reason for it to fail, and if it's really the 1st read that's failing, you'd get the same error message. It might be that this whole time, you've had a bad input file and not known it. I'd suggest that you put a `print("First read was successful")` line in your code after the first `read_csv` call, then make sure you actually see those words in the output. Just to be really, **REALLY** sure. – rmunn Nov 20 '15 at 09:46
  • @rmunn-yeah that was a bad file actually.and i have corrected that using nodepad++ and Thanks, your advise helped me in a better way. And one thing --when decoding = 'utf-8' is used while working with csv or it is used with strings only. – Satya Nov 20 '15 at 09:59
15

One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.

Krishnaa
  • 531
  • 7
  • 10
5

Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.

Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.

Mangu Singh Rajpurohit
  • 8,894
  • 2
  • 50
  • 76
5

Above method used by importing and then detecting file type works import chardet

import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large


pd.read_csv('filename.csv', encoding=result['encoding'])
Mona Jalal
  • 24,172
  • 49
  • 166
  • 311
Abhishek
  • 121
  • 1
  • 2
  • 8
  • I used the code above and got this error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 5356: character maps to " – Mona Jalal Apr 01 '18 at 18:57
2

I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Matt
  • 19
  • 5