0

I have the task of converting utf-8 csv file to excel file, but it is not read properly in excel. Because there was no byte order mark (BOM) at the beginning of the file
I see how: https://stackoverflow.com/a/38025106/6102332

    with open('test.csv', 'w', newline='', encoding='utf-8-sig') as f:
        w = csv.writer(f)
        # Write Unicode strings.
        w.writerow([u'English', u'Chinese'])
        w.writerow([u'American', u'美国人'])
        w.writerow([u'Chinese', u'中国人'])


But it seems like that only works with brand new files.
But not work for my file already has data.
Are there any easy ways to share?

Is there any other way than this? : https://stackoverflow.com/a/6488070/6102332

  1. Save the exported file as a csv
  2. Open Excel
  3. Import the data using Data-->Import External Data --> Import Data
  4. Select the file type of "csv" and browse to your file
  5. In the import wizard change the File_Origin to "65001 UTF" (or choose correct language character identifier)
  6. Change the Delimiter to comma
  7. Select where to import to and Finish
  • If you know the current character set for sure, you can probably automate the conversion to UTF-8-sig with Python. – lenz Jan 13 '20 at 14:21
  • If you do the above in Power Query, it will "remember" the encoding for future runs of the query. – Ron Rosenfeld Jan 13 '20 at 19:53

2 Answers2

1

Read the file in and write it back out with the encoding desired:

with open('input.csv','r',encoding='utf-8-sig') as fin:
    with open('output.csv','w',encoding='utf-8-sig') as fout:
        fout.write(fin.read())

utf-8-sig codec will remove BOM if present on read, and will add BOM on write, so the above can safely run on files with or without BOM originally.

You can convert in place by doing:

file = 'test.csv'

with open(file,'r',encoding='utf-8-sig') as f:
    data = f.read() 

with open(file,'w',encoding='utf-8-sig') as f:
    f.write(data)

Note also that utf16 works as well. Some older Excels don't handle UTF-8 correctly.

Mark Tolonen
  • 132,868
  • 21
  • 152
  • 208
0

Thank You!
I have found a way to automatically handle the missing BOM utf-8 signature. In addition to the lack of BOM signature, there is another problem is that duplicate BOM signature is mixed in the file data. Excel does not show clearly and transparently. and make a mistake other data when compared, calculated. eg :
data -> Excel
Chinese -> Chinese
12 -> 12

If you compare it, obviously ChineseBOM will not be equal to Chinese. Code python to solve the problem:

import codecs
bom_utf8 = codecs.BOM_UTF8

def fix_duplicate_bom_utf8(file, bom=bom_utf8):
    with open(file, 'rb') as f:
        data_f = f.read()

    data_finish = bom + data_f.replace(bom, b'')

    with open(file, 'wb') as f:
        f.write(data_finish)
    return

# Use:
file_csv = r"D:\data\d20200114.csv" # American, 美国人

fix_duplicate_bom_utf8(file_csv)
 # file_csv  -> American, 美国人