0

I was attempting to convert all the characters in a Pandas column into strings, in this manner:

df_sample1['county'] = df_sample1['county'].astype(str)

While doing so, I encountered the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 38: ordinal not in range(128)

A similar problem has been discussed on Stackoverflow and the suggested solution included "You have to discover in which encoding is this character at the source."

I don't know what encoding my column is in - I was expecting only ASCII Character given that they are county names. Is there a way to find out which characters are the truant ones, and if so, can I convert them all to UTF-8? Or in general, how do I find out what the encoding of the characters is?

user2762934
  • 1,714
  • 8
  • 28
  • 37
  • https://pypi.python.org/pypi/chardet – Alex Hall May 10 '16 at 20:01
  • See http://stackoverflow.com/questions/6707657/python-detect-charset-and-convert-to-utf-8 or http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file. I don't know if the inclusion of `Pandas` makes this question unique or not. – Mark Ransom May 10 '16 at 20:06
  • There is no foolproof way of telling the character encoding. You will have to track the encoding. Anyway, in this case it's probably ISO-8859-1, or Windows-1252 (which is essentially a superset of ISO-8859-1), because \xc9 is the character É in these encodings. – Walter Tross May 10 '16 at 20:18
  • "Which characters are the truant ones": The characters are not the problem. The problem is the loss of the knowledge of which encoding you are being given. If you want one that works, whether it is right or not, just guess one of the many that have 256 characters and allow any sequence of values 0-255. If that sounds good, guess CP437. [UTF-8 and Windows-1252 do not qualify.] – Tom Blodget May 10 '16 at 23:18
  • @TomBlodget I doubt the character ╔ (\xc9 in CP437) can be part of a county name. – Walter Tross May 11 '16 at 07:08
  • @WalterTross CP437 is just one to those encodings that will never error on decode. – Tom Blodget May 11 '16 at 16:26

0 Answers0