UTF8 Python BOM

Question

Possible Duplicate:
Write to utf-8 file in python

I have Unicode strings (with Japanese characters) I want to write to a CSV file. However, the BOM does not seem to be written correctly, just as a string "ï»¿" in the first line. This leads to Excel not displaying the Japanese characters correctly. When opening the CSV with Notepad++, the characters are displayed correctly.

fileObj = codecs.open(filename,"w",'utf-8')
fileObj.write(codecs.BOM_UTF8)
c = u';'
for s in stringsToWrite:
   line = e.someUnicodeString
   fileObj.write(line)
fileObj.close()

"ï»¿" *is* the BOM, when wrongly interpreted as Latin-1. How are you checking the result? Also, Excel notoriously sucks with encodings. — deceze, Aug 29 '12 at 14:36
@InternetSeriousBusiness well I do discourage it, but Microsoft won't listen to me. — Adrian Ratnapala, Oct 27 '14 at 10:56
Excel is a pain. You are right, you do need to specify the BOM, however, by default Excel will load the file in whatever the default encoding is for you machine (almost certainly NOT utf8). You must import it and manually select the correct encoding, UTF8, with the BOM in place. — Matthew Wilcoxson, Mar 17 '16 at 11:28

score 8 · Answer 1 · answered Aug 30 '12 at 09:47

fileObj = codecs.open(filename,"w",'utf-8')

OK, you have a Unicode output stream.

fileObj.write(codecs.BOM_UTF8)

BOM_UTF8 is a sequence of bytes, not a Unicode string as you would expect to write to a Unicode stream. Python will automatically convert from bytes to Unicode using some encoding which may not be the correct one. If the default encoding is Windows code page 1252 rather than UTF-8, you'll be effectively double-encoding the BOM and it will come as the UTF-8 encoding of ï»¿.

Suggest writing the BOM as the Unicode character it is instead:

fileObj.write(u'\uFEFF')

InternetSeriousBusiness wrote:

Isn't the UTF-8 BOM discouraged, anyway? –

Yes, the UTF-8 faux-BOM is largely a disaster in most contexts, but it is needed to get Excel's charset guessing to pick up UTF-8. Unfortunately it doesn't work in Excel for Mac. Another possible approach might be to use UTF-16.

score 0 · Answer 2 · answered Aug 29 '12 at 14:36

0

The string you copied is the UTF-8 BOM. So your problem is not in your python code but somewhere else.

answered Aug 29 '12 at 14:36

ThiefMaster

285,213
77
557
610

UTF8 Python BOM

2 Answers2

Linked