0

I want to write data from a Pandas dataframe to an Excel file. I am using pyExcelerate rather than the built in to_excel() method to get better performance.

While Pandas handles utf-8 encoded special characters, the encoding seems to be lost when converting to python list. I read through https://docs.python.org/2/howto/unicode.html, however I could not find out how to preserve the encoding. Or is there some way to restore encoding at the time of writing?

# -*- coding: utf-8 -*-
import pandas as pd
from pyexcelerate import Workbook 

df = pd.DataFrame({'val': ['ä']})
print df
#   val
# 0   ä  <- non ascii character a with dots

value = df.values.tolist()
print value
# [['\xc3\xa4']]

wb = Workbook()
ws = wb.new_sheet("sheet name", data=value)
wb.save("output.xlsx")

# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)
malte
  • 529
  • 6
  • 23
  • Do you by chance have `# -*- coding: utf-8 -*-` at the top of your file? from [here](http://stackoverflow.com/questions/10589620/syntaxerror-non-ascii-character-xa3-in-file-when-function-returns-%C2%A3) – Phillip Martin Feb 16 '16 at 14:07
  • 1
    FWIW, `'\xc3\xa4'` _is_ the UTF-8 encoding of `'ä'`; if we decode it to Unicode (eg `'\xc3\xa4'.decode('utf8')`) we get `u'\xe4'`. You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Feb 16 '16 at 14:10
  • `# -*- coding: utf-8 -*-` was in the code (edited my example). If it's not there, I already get an error creating the DataFrame. – malte Feb 16 '16 at 15:15

1 Answers1

1

Simple, feed your Dataframe with Unicodes by using Unicode literals:

df = pd.DataFrame({u'val': [u'ä']})

note the u prefixes

Ensure your file is encoded with UTF-8 (to match the # -*- coding: utf-8 -*- header) (It does looks UTF-8 encoded as 0xc3 is the first byte of ä)

The error is caused by Python not knowing the encoding of the byte string, when the string is marshalled into a Unicode object later.

Alastair McCormack
  • 23,069
  • 7
  • 60
  • 87
  • Adding the `u` prefix solved the problem. To "add the `u`" on an existing column I used `df['val'] = df['val'].str.decode('utf8')` – malte Feb 17 '16 at 10:34