Python 2.7 check if a file is encoded with UTF-8

Question

My current solution is just read all bytes of a file, try to decode, if any exception, I will say this file is not properly encoded. Any other more elegant ways? Thanks.

utfbytes.decode('utf-8')

regards, Lin

Possible duplicate of [Python: Is there a way to determine the encoding of text file?](http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file) — Dean Fenster, Aug 06 '16 at 23:18
Thanks @DeanFenster, vote up. If I do not use the 3rd party library, my current solution of leveraging Python 2.7 built-in solution is already good? — Lin Ma, Aug 06 '16 at 23:28

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

No. From that answer:

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

However, there are some libraries that exist that do make the best effort to try and find the encoding type.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 06 '16 at 23:10

Nick Bull

8,365
5
22
43

Thanks Nick, vote up. If I do not use the 3rd party library, my current solution of leveraging Python 2.7 built-in solution is already good? – Lin Ma Aug 06 '16 at 23:27
1

Your solution looks perfect, as long as you handle exceptions! – Nick Bull Aug 06 '16 at 23:28
Sure, thanks Nick. Have a good weekend. Vote up and mark your reply as answer. – Lin Ma Aug 06 '16 at 23:29
1

@LinMa Thanks happy to help, and to yourself! – Nick Bull Aug 06 '16 at 23:35

Python 2.7 check if a file is encoded with UTF-8

1 Answers1