Decoding Cyrillic in Python - character maps to

Question

I receive a server response, bytes:

\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91

This is for sure Cyrillic, but I'm not sure which encoding. Every attempt to decode it in Python fails:

b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411'
>>> print(b.decode('utf-8'))
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4:
character maps to <undefined>

>>> b.decode('cp1251')
'\u0420\xa0\u0421\u0453\u0420±\u0420»\u0420\u0451 \u0420\xa0\u0420¤
\u0420\u0459\u0420¦\u0420\u2018'
>>> print(b.decode('cp1251'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u0420' in
position 0: character maps to <undefined>

Both results somewhat resemble Unicode-escape, but this does not work either:

>>> codecs.decode('\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411',
'unicode-escape')
'Ð\xa0Ñ\x83Ð±Ð»Ð¸ Ð\xa0Ð¤ Ð\x9aÐ¦Ð\x91'

There's a web service for recovering Cyrillic texts, it is able to decode my bytes using Windows-1251:

Output (source encoding : WINDOWS-1251)

Рубли РФ КЦБ

But I don't have any more ideas as for how to approach it.

I think I'm missing something about how encoding works, so if the problem seems trivial to you, I would greatly appreciate a bit of explanation/a link to a tutorial/ some keywords for further googling.

Solution:

Windows PowerShell uses Windows-850 codepage by default, which is incapable of handling some Cyrillic characters. One fix is to change the codepage to Unicode every time starting the shell:

chcp 65001

Here is explained how to make it the new default

(1) don't use `chcp 65001` [there could be hard to workaround bugs e.g., `print u'\xc1\xc1'`](http://stackoverflow.com/q/31846091/4279), [use `win-unicode-console` instead](http://stackoverflow.com/a/30551552/4279) (2) don't put solution (the answer) into your question, [post it as your own answer instead (to allow commenting, voting, etc)](http://stackoverflow.com/help/self-answer) — jfs, Sep 11 '15 at 20:39

wolendranh · Answer 1 · 2015-09-11T14:21:32.203

1

Try out this.

 In [1]: s = "\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91"

 In [11]: print s.decode('utf-8')
    Рубли РФ КЦБ

To print or display some strings properly, they need to be decoded (Unicode strings).

There is a lot information with examples in standart Python library

Python 3:

>>> import sys
>>> print (sys.version)
3.4.0 (default, Jun 19 2015, 14:20:21) 
[GCC 4.8.2]
>>> b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'Рубли РФ КЦБ'

edited Sep 11 '15 at 14:21

answered Sep 11 '15 at 12:07

wolendranh

3,731
1
25
37

2

I don't think this works in Python 3. AttributeError: 'str' object has no attribute 'decode' – Artem Sep 11 '15 at 13:17

score 1 · Accepted Answer · answered Sep 11 '15 at 12:10

This is for sure Cyrillic, but I'm not sure which encoding.

This is UTF-8 (100%).

Python 3.4.3 (default, Mar 25 2015, 17:13:50) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'

In [2]: s = b.decode('utf-8')

In [3]: print(s)
Рубли РФ КЦБ

Works fine for me. May be you have problem with your terminal or repl?

Decoding Cyrillic in Python - character maps to

2 Answers2