1

I am playing around with unicode in python.

So there is a simple script:

# -*- coding: cp1251 -*-

print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')

In cmd I've switched encoding to Active code page: 1251.

And there is the output:

СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод

I am a little bit confused.

Since I've specified encoding to cp1251 I expect that it would be decoded correctly.

But as result there is some trash code points were interpreted. I am understand that 'юникод' is just a bytes like: '\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.

But there is a way to get correct output in terminal with cp1251? Should I build byte string manually?

Seems like I misunderstood something.

xiº
  • 4,067
  • 3
  • 21
  • 36

3 Answers3

5

I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.

The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.

If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'

If you write this source:

txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')

and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.

Alternatively, if you find comfortable with you current editor, you could write:

# -*- coding: utf8 -*-

txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')

which should give same results - but now the declared source encoding should be the actual encoding of your python source.


BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:

>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Serge Ballesta
  • 121,548
  • 10
  • 94
  • 199
  • I could not exactly execute this code because my system use CP1252 so I used `'éè'` instead of `'юникод'` and 1252 instead of 1251... – Serge Ballesta Mar 04 '16 at 16:02
  • Good catch! You are exactly right. That unicode is really painful :D – xiº Mar 04 '16 at 16:07
  • @xi_: the issue is simpler than this answer suggests. You typed `'юникод'`, your editor saved these bytes `'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'` (юникод in utf-8), `python` obediently decoded these bytes using cp1251 encoding (that you've specified explicitly) and printed the result (wrong Unicode string) correctly. Note: the encoding declaration (`coding: cp1251`) is ignored by your editor and *it is not used* in your code (it is used for Unicode literals: `u'юникод'` -- it works only if your encoding declaration is correct -- if bytes on disk use the same encoding as declared). – jfs Mar 05 '16 at 18:47
  • @SergeBallesta: [use `win-unicode-console` package, to be able to display Unicode characters (such as юникод) in a Windows console whatever `chcp` returns](http://stackoverflow.com/a/32176732/4279) – jfs Mar 05 '16 at 18:52
  • @J.F.Sebastian: Thanks for the reference, I did not know this module. But OP's problem was just lack of coherence between declared (`-*- coding:`) and real encoding, not a problem of *displaying* anything. – Serge Ballesta Mar 09 '16 at 22:14
  • @SergeBallesta: 1- my comment is prompted by your comment: *"I could not exactly execute this code because my system use CP1252 so I used 'éè' instead of 'юникод'"* -- if you use `win-unicode-console` then you should be able to `print u'юникод'` on your system. 2- wrong. `coding:` declaration is not used in the code in the question (reread my comment above) -- the issue is the explicit `.decode('cp1251')` in the code. – jfs Mar 09 '16 at 22:51
  • @J.F.Sebastian: of course you are right... I've edited my post accordingly – Serge Ballesta Mar 10 '16 at 08:25
  • don't call `decode('utf8').encode('cp1251')`; it is completely unnecessary here (i.e., `.decode('cp1251')`, `unicode(txt, 'cp1251')` should be removed too). `print u'юникод'` is enough. I've included the [complete code example in my answer for clarity.](http://stackoverflow.com/a/35818111/4279) – jfs Mar 10 '16 at 09:16
1

Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.

>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод

Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead. If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console

Here's complete code, for reference:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')

Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.

If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.

Community
  • 1
  • 1
jfs
  • 346,887
  • 152
  • 868
  • 1,518
0

Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:

#coding:utf8
print u'юникод'

The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.

1Unless your terminal is misconfigured.

Mark Tolonen
  • 132,868
  • 21
  • 152
  • 208