Increased understanding of Unicode in Python (2.7)

Question

I'm observing that in the program

# -*- coding: utf-8 -*-
words = ['artists', 'Künstler', '艺术家', 'Митець']
for word in words:
    print word, type(word)

it is not absolutely necessary to fully qualify the strings as unicode strings:

words = ['artist', u'Künstler', u'艺术家', u'Митець']

The different alphabets are handled just fine without the 'u' prefix.

And so it appears that once coding: utf-8 is specified, all strings are encoded in Unicode. Is that true?

Or is unicode used only if the string can no longer fit in range(128)?
Why does type(word) report <str> in all cases? Isn't unicode a special datatype?

Mandatory reading: http://bit.ly/unipain – Daenyth Feb 25 '16 at 18:31 — Daenyth, Feb 25 '16 at 18:31

score 3 · Answer 1 · answered Feb 25 '16 at 05:22

And so it appears that once coding: utf-8 is specified, all strings are encoded in Unicode. Is that true?

No. It means that byte sequences within the source code are interpreted as UTF-8. You have created bytestrings and the system is interpreting their contents naively (versus creating text with u'...').

Mark Tolonen · Accepted Answer · 2016-02-25T22:23:47.910

Perhaps this will make it more clear:

# -*- coding: utf-8 -*-
words = ['artists', 'Künstler', '艺术家', 'Митець']
for word in words:
    print word, type(word), repr(word)
words = [u'artists', u'Künstler', u'艺术家', u'Митець']
for word in words:
    print word, type(word), repr(word)

Output:

artists <type 'str'> 'artists'
Künstler <type 'str'> 'K\xc3\xbcnstler'
艺术家 <type 'str'> '\xe8\x89\xba\xe6\x9c\xaf\xe5\xae\xb6'
Митець <type 'str'> '\xd0\x9c\xd0\xb8\xd1\x82\xd0\xb5\xd1\x86\xd1\x8c'
artists <type 'unicode'> u'artists'
Künstler <type 'unicode'> u'K\xfcnstler'
艺术家 <type 'unicode'> u'\u827a\u672f\u5bb6'
Митець <type 'unicode'> u'\u041c\u0438\u0442\u0435\u0446\u044c'

In the first case you get byte strings encoded in the declared source encoding of UTF-8. They will only display correctly on a UTF-8 terminal.

In the second case you get Unicode strings. They will display correctly on any terminal whose encoding supports the characters.

Here's how the strings display on a Windows code page 437 console, using a Python environment variable to configure Python to replace unsupported characters instead of raising the default UnicodeEncodeError exception for them:

c:\>set PYTHONIOENCODING=cp437:replace
c:\>py -2 x.py
artists <type 'str'> 'artists'
K├╝nstler <type 'str'> 'K\xc3\xbcnstler'
Φë║µ£»σ«╢ <type 'str'> '\xe8\x89\xba\xe6\x9c\xaf\xe5\xae\xb6'
╨£╨╕╤é╨╡╤å╤î <type 'str'> '\xd0\x9c\xd0\xb8\xd1\x82\xd0\xb5\xd1\x86\xd1\x8c'
artists <type 'unicode'> u'artists'
Künstler <type 'unicode'> u'K\xfcnstler'
??? <type 'unicode'> u'\u827a\u672f\u5bb6'
?????? <type 'unicode'> u'\u041c\u0438\u0442\u0435\u0446\u044c'

Bytes strings are mostly garbage, but Unicode strings are sensible since Chinese and Russian aren't supported by that code page.

unrelated: Python may use Unicode API, to [print Unicode strings to Windows console](http://stackoverflow.com/a/32176732/4279) i.e., it might be irrelevant what `chcp` returns. — jfs, Mar 02 '16 at 11:53

Increased understanding of Unicode in Python (2.7)

2 Answers2

Linked