What is the simplest way to fix an existing csv unicode utf-8 without BOM file not displaying correctly in excel?

Question

I have the task of converting utf-8 csv file to excel file, but it is not read properly in excel. Because there was no byte order mark (BOM) at the beginning of the file
I see how: https://stackoverflow.com/a/38025106/6102332

    with open('test.csv', 'w', newline='', encoding='utf-8-sig') as f:
        w = csv.writer(f)
        # Write Unicode strings.
        w.writerow([u'English', u'Chinese'])
        w.writerow([u'American', u'美国人'])
        w.writerow([u'Chinese', u'中国人'])

But it seems like that only works with brand new files.
But not work for my file already has data.
Are there any easy ways to share?

Is there any other way than this? : https://stackoverflow.com/a/6488070/6102332

Save the exported file as a csv

Open Excel

Import the data using Data-->Import External Data --> Import Data

Select the file type of "csv" and browse to your file

In the import wizard change the File_Origin to "65001 UTF" (or choose correct language character identifier)

Change the Delimiter to comma

Select where to import to and Finish

If you know the current character set for sure, you can probably automate the conversion to UTF-8-sig with Python. — lenz, Jan 13 '20 at 14:21
If you do the above in Power Query, it will "remember" the encoding for future runs of the query. — Ron Rosenfeld, Jan 13 '20 at 19:53

Mark Tolonen · Answer 1 · 2020-01-13T19:00:23.290

Read the file in and write it back out with the encoding desired:

with open('input.csv','r',encoding='utf-8-sig') as fin:
    with open('output.csv','w',encoding='utf-8-sig') as fout:
        fout.write(fin.read())

utf-8-sig codec will remove BOM if present on read, and will add BOM on write, so the above can safely run on files with or without BOM originally.

You can convert in place by doing:

file = 'test.csv'

with open(file,'r',encoding='utf-8-sig') as f:
    data = f.read() 

with open(file,'w',encoding='utf-8-sig') as f:
    f.write(data)

Note also that utf16 works as well. Some older Excels don't handle UTF-8 correctly.

Sơn Lê Trường · Answer 2 · 2020-01-14T11:14:49.607

Thank You!
I have found a way to automatically handle the missing BOM utf-8 signature. In addition to the lack of BOM signature, there is another problem is that duplicate BOM signature is mixed in the file data. Excel does not show clearly and transparently. and make a mistake other data when compared, calculated. eg :
data -> Excel
Chineseï»¿ -> Chinese
1ï»¿2 -> 12

If you compare it, obviously ChineseBOM will not be equal to Chinese. Code python to solve the problem:

import codecs
bom_utf8 = codecs.BOM_UTF8

def fix_duplicate_bom_utf8(file, bom=bom_utf8):
    with open(file, 'rb') as f:
        data_f = f.read()

    data_finish = bom + data_f.replace(bom, b'')

    with open(file, 'wb') as f:
        f.write(data_finish)
    return

# Use:
file_csv = r"D:\data\d20200114.csv" # Americanï»¿, 美国人ï»¿

fix_duplicate_bom_utf8(file_csv)
 # file_csv  -> ï»¿American, 美国人

What is the simplest way to fix an existing csv unicode utf-8 without BOM file not displaying correctly in excel?

2 Answers2