How to determine the encoding of text?

Question

I received some text that is encoded, but I don't know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the encoding/codepage of a text file deals with C#.

nosklo · Accepted Answer · 2020-11-29T22:11:18.823

245

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

edited Nov 29 '20 at 22:11

answered Jan 12 '09 at 17:45

nosklo

193,422
54
273
281

1

Thanks for the `chardet` reference. Seems good, although a bit slow. – Craig McQueen Jan 28 '10 at 05:15
Not been able to detect the encoding all the time...isn't this a flaw in the encoding standard? shouldn't this be allways predictable? – Geomorillo Dec 01 '13 at 21:40
21

@Geomorillo: There's no such thing as "the encoding standard". Text encoding is something as old as computing, it grew organically with time and needs, it wasn't planned. "Unicode" is an attempt to fix this. – nosklo Dec 02 '13 at 14:34
1

And not a bad one, all things considered. What I would like to know is, how do I find out what encoding an open text file was opened with? – holdenweb Mar 14 '14 at 06:27
I am confused by this. I have a text file I was struggling to read in Python, so I opened it in Visual Studio Code. In the bottom gutter of the resulting file window it says "UTF-16 LE". When you note that it is impossible does that means that tools like VSCode would fail too? – dumbledad Apr 20 '18 at 10:28
2

@dumbledad what I said is that correctly detecting it **all times** is impossible. All you can do is a guess, but it can fail sometimes, it won't work every time, due to encodings not being really detectable. To do the guess, you can use one of the tools I suggested in the answer – nosklo Apr 20 '18 at 15:41
`chardet` has some really nice command line interface, i'm not sure about your use case, for me i was really just trying to guess a file charset on-the-fly, and not to use that in a script or something. to use the CLI option (after `pip install chardet`) `$ chardet filename` and you can use the guessed encoding to encode your text file into other options, using tools like `iconv`. – adonese Aug 31 '18 at 19:40
I'd love `chardet` but since Turkish support was added it skewed numbers in a way that now it guesses Turkish for way too many files I came across. So much so I had to get rid of `chardet`. – Csaba Toth Jan 09 '19 at 21:49
1

Apparently `cchardet` is faster, but requires `cython`. – Superdooperhero May 26 '19 at 19:48
It is amazing that no pre-existing solutions get it right, but a simple function will do for most typical cases (and can be customized for your local needs): https://paste.zi.fi/p/decode.py/view – L. Kärkkäinen Jul 12 '19 at 12:46
1

@LasseKärkkäinen the point of that answer is to show that corectly detecting encoding is **impossible**; the function you provide can guess right for your case, but is wrong for many cases. – nosklo Jul 12 '19 at 13:58
@nosklo Quite true; that's why the comment says 8-bit guesswork. However, sensible priorities should be used and now chardet favours Turkish way too much. The paste demonstrates clear problems with chardet that cannot be justified by the ambiguity (because \x81 does not exist in the "detected" encoding and because UTF-8 should always be the first choice whenever it fits). – L. Kärkkäinen Jul 15 '19 at 04:29
You could add the more recent [charset-normalizer](https://pypi.org/project/charset-normalizer/) – snakecharmerb Aug 31 '20 at 13:42

Hamish Downer · Answer 2 · 2020-03-20T16:45:26.750

81

Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.

The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

edited Mar 20 '20 at 16:45

answered Apr 24 '13 at 23:10

Hamish Downer

15,325
14
82
80

5

`libmagic` is indeed a viable alternative to `chardet`. And great info on the distinct packages named `python-magic`! I'm sure this ambiguity bites many people – MestreLion Oct 22 '13 at 16:42
2

`file` isn't particularly good at identifying human language in text files. It is excellent for identifying various container formats, though you sometimes have to know what it means ("Microsoft Office document" could mean an Outlook message, etc). – tripleee Mar 06 '15 at 07:15
Looking for a way to manage file encoding mystery I found this post. Unfortunately, using the example code, I can't get past `open()`: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 169799: invalid start byte`. The file encoding according to vim's `:set fileencoding` is `latin1`. – xtian Aug 19 '17 at 15:51
If I use the optional argument `errors='ignore'`, the output of the example code is the less helpful `binary`. – xtian Aug 19 '17 at 15:54
2

@xtian You need to open in binary mode, i.e. open("filename.txt", "rb"). – L. Kärkkäinen Jul 15 '19 at 04:34
chardet didn't work for `iso-8859-1` (a common Windows file format), but this works perfectly! libmagic seems like the best solution to this problem. – cdignam Oct 23 '19 at 17:11
the second one solved my problem. And I have to use blob = open("unknown-file", "rb").read() like L. Kärkkäinen said. – Xiaoqi Mar 19 '20 at 07:03
There's also a python-magic-bin package which has binaries and works on Windows: https://github.com/julian-r/python-magic That package returned 'binary' for the encoding for a .docx file, while UnicodeDammit returned 'utf-8' and chardet returned None. In this case, bs4 (UnicodeDammit) looks best. – wordsforthewise Jan 31 '21 at 23:05

zzart · Answer 3 · 2021-01-27T12:19:34.453

Some encoding strategies, please uncomment to taste :

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

#PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
            for e in encodings:
                try:
                    fh = codecs.open('file.txt', 'r', encoding=e)
                    fh.readlines()
                    fh.seek(0)
                except UnicodeDecodeError:
                    print('got unicode error with %s , trying different encoding' % e)
                else:
                    print('opening the file with encoding:  %s ' % e)
                    break

You can also use `io`, like `io.open(filepath, 'r', encoding='utf-8')`, which is more convenient, because `codecs` doesn't convert `\n` automatically on reading and writing. More on [HERE](https://docs.python.org/2/library/codecs.html#codecs.open) — Searene, May 01 '16 at 06:57

score 31 · Answer 4 · answered Jul 18 '17 at 13:01

31

Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

answered Jul 18 '17 at 13:01

ryanjdillon

13,415
6
73
96

Looking at this after getting an up-vote and now see that this solution could slow down if there were a lot of data on the first line. In some cases it would be better to read the data in differently. – ryanjdillon Jan 22 '18 at 11:55
3

I have modified this function this way: `def predict_encoding(file_path, n=20): ... skip ... and then rawdata = b''.join([f.read() for _ in range(n)])` Have been tried this function on Python 3.6, worked perfectly with "ascii", "cp1252", "utf-8", "unicode" encodings. So this is definitely upvote. – n158 Oct 18 '18 at 11:59
2

this is very good for handling small datasets with a variety of formats. Tested this recursively on my root dir and it worked like a treat. Thanks buddy. – Umar.H Nov 25 '19 at 12:48
I'm not very familiar with reading data at the byte level. @n158, is there a chance one might stop reading bytes in the middle of a character and confuse `chardet`? – kuzzooroo Mar 28 '21 at 17:24

score 8 · Answer 5 · edited Mar 25 '20 at 23:54

8

This might be helpful

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

edited Mar 25 '20 at 23:54

kgf3JfUtW

9,808
7
38
61

answered Mar 25 '20 at 23:40

richinex

81
1
2

Bimo · Answer 6 · 2017-06-16T14:06:49.783

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

score 2 · Answer 7 · answered Jan 12 '09 at 17:32

It is, in principle, impossible to determine the encoding of a text file, in the general case. So no, there is no standard Python library to do that for you.

If you have more specific knowledge about the text file (e.g. that it is XML), there might be library functions.

score 2 · Answer 8 · edited Jun 22 '17 at 17:04

Depending on your platform, I just opt to use the linux shell file command. This works for me since I am using it in a script that exclusively runs on one of our linux machines.

Obviously this isn't an ideal solution or answer, but it could be modified to fit your needs. In my case I just need to determine whether a file is UTF-8 or not.

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

Forking a new process is not needed. Python code already runs inside a process, and can call the proper system functions itself without the overhead of loading a new process. — vdboor, Jul 18 '17 at 10:15

score 1 · Answer 9 · answered Jan 12 '09 at 17:36

1

If you know the some content of the file you can try to decode it with several encoding and see which is missing. In general there is no way since a text file is a text file and those are stupid ;)

answered Jan 12 '09 at 17:36

Martin Thurau

7,312
7
39
76

js2010 · Answer 10 · 2019-04-12T14:46:44.120

This site has python code for recognizing ascii, encoding with boms, and utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html. Read file into byte array (data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array. Here's an example. I'm in osx.

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](//stackoverflow.com/help/deleted-answers) — double-beep, Apr 12 '19 at 14:11

score 0 · Answer 11 · answered May 06 '21 at 14:59

Using linux file -i command

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

score 0 · Answer 12 · answered May 27 '21 at 10:30

You can use `python-magic package which does not load the whole file to memory:

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

The output is the encoding name for example:

iso-8859-1
us-ascii
utf-8

How to determine the encoding of text?

12 Answers12

Linked

Related