My current solution is just read all bytes of a file, try to decode, if any exception, I will say this file is not properly encoded. Any other more elegant ways? Thanks.
utfbytes.decode('utf-8')
regards, Lin
My current solution is just read all bytes of a file, try to decode, if any exception, I will say this file is not properly encoded. Any other more elegant ways? Thanks.
utfbytes.decode('utf-8')
regards, Lin
No. From that answer:
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
However, there are some libraries that exist that do make the best effort to try and find the encoding type.