Python general unicode

Question

I'm having problems understanding unicode in python 2.7.2, so I tried some tests in idle. Two things are marked 'not sure'. Please tell me why they fail. As to the other items, please tell me if my comment is accurate.

>>> s
'Don\x92t '  # s is a string
>>> u
u'Don\u2019t '  # u is a unicode object
>>> type(u)     # confirm u is unicode
<type 'unicode'>
>>> type(s)     # confirm s is string
<type 'str'>
>>> type(s) == 'str' # wrong way to test
False
>>> isinstance(s, str)  # right way to test
True
>>> print s
Don’t       # works because idle can handle strings
>>> print u
Don’t       # works because idle can handle unicode
>>> open('9', 'w').write(s.encode('utf8')) #encode takes unicode, but s is a string,
                                            # so this fails
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    open('9', 'w').write(s.encode('utf8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> open('9', 'w').write(s) # write can write strings
>>> open('9', 'w').write(u) # write can't write unicode

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    open('9', 'w').write(u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> open('9', 'w').write(u.encode('utf8'))  # encode turns unicode to string, which write can handle
>>> open('9', 'w').write(s.decode('utf8'))  # decode turns string to unicode, which write can't handle

Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    open('9', 'w').write(s.decode('utf8'))
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte
>>> e = '{}, {}'.format(s, u) # fails becase ''.format is string, while u is unicode

Traceback (most recent call last):
  File "<pyshell#33>", line 1, in <module>
    e = '{}, {}'.format(s, u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> e = '{}, {}'.format(s, u.encode('utf8')) # works because u.encode is a string
>>> e = u'{}, {}'.format(s, u) # not sure

Traceback (most recent call last):
  File "<pyshell#36>", line 1, in <module>
    e = u'{}, {}'.format(s, u)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

Traceback (most recent call last):
  File "<pyshell#55>", line 1, in <module>
    e = u'{}, {}'.format(s.decode('utf8'), u)
  File "C:\program files\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte

>>> e = '\n'.join([s, u]) # wants strings, but u is unicode

Traceback (most recent call last):
  File "<pyshell#37>", line 1, in <module>
    e = '\n'.join([s, u])
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 3: ordinal not in range(128)
>>> e = '\n'.join([s, u.encode('utf8')]) # u.encode is now a string

This is wrong: `>>> type(s) == 'str' # wrong way to test` `type()` is retuning a type, not a type in a string, so your test should be: `type(s) == str`, but you're right, it is better to use `isinstance()` — Paco, Aug 14 '13 at 14:11
Please ask your question(s) in text, and use a question title that summarizes the question(s). Normally, you should ask separate questions separately. — Jukka K. Korpela, Aug 14 '13 at 14:13
If you have problems understanding Unicode in Python 2 and you do not really need python 2, then go ahead and stop trying to understand and install Python 3 instead. In Python 3 all this confusion is gone. — Antti Haapala, Aug 15 '13 at 01:17
@AnttiHaapala The concepts seem plain. It's implementation details, such as the u'{}'.format() matter. I'm dealing with music playlists and a playlist that works in one player may not work in another, which makes things harder. I was having problems with a tag reader, posted here and discovered it was a a bug rather than anything I was doing. — foosion, Aug 15 '13 at 01:37

Viktor Kerkez · Accepted Answer · 2013-08-15T00:46:27.277

2

First s is not a utf-8 encoded string its probably cp1250 encoded string. So decoding it using utf-8 always fails.

>>> e = u'{}, {}'.format(s, u) # not sure

First "not sure" is because u'{}, {}' is unicode and tries to encode every argument of the format function to unicode string. But because it doesn't know what the s is encoded in, it assumes that s is encoded as ascii, so it tries to decode it as ascii (basically doing s.decode('ascii')) and fails since s is a cp1250 encoded string.

>>> e = u'{}, {}'.format(s.decode('utf8'), u) # not sure

Second one fails because you tried to decode it as utf-8 but it's actually, as said earlier, in some other encoding that is not compatible with utf-8.

edited Aug 15 '13 at 00:46

answered Aug 14 '13 at 14:17

Viktor Kerkez

38,587
11
96
81

The second one failed with: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte – foosion Aug 14 '13 at 14:27
Oh wait I missed it since I haven't tested the code. `s` is not a valid `utf-8` string. `s` is not encoded as `utf-8`. That is the problem. Valid `utf-8` version of `s` would be `'Don\xe2\x80\x99t '` So the decode fails. – Viktor Kerkez Aug 14 '13 at 14:32
Tested it, `s` is `cp1250` or `cp1251`. So you should use `decode('cp1250')` – Viktor Kerkez Aug 14 '13 at 14:36
Is there code to test which encoding to use? – foosion Aug 14 '13 at 14:38
Unfortunately no. :-/ http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file/436299#436299 – Viktor Kerkez Aug 14 '13 at 14:39
There is also a [libguess](https://pypi.python.org/pypi/python-libguess) but it's a little bit out of date (not that encodings change much :D) but it may have some bugs... – Viktor Kerkez Aug 14 '13 at 14:42
Alas. That link includes "An encoding discovered in the document itself" The document the string came from (an xml playlist from itunes) starts with "" which led me to believe utf8 would work – foosion Aug 14 '13 at 14:43
How did you load the xml? – Viktor Kerkez Aug 14 '13 at 14:50
I loaded it with plistlib – foosion Aug 14 '13 at 14:53
Yes probably someone incorrectly encoded the file. Expat parser (plistlib uses it under the hood) should correctly decode values depending on the xml header. – Viktor Kerkez Aug 14 '13 at 15:11

Martijn Pieters · Answer 2 · 2013-08-14T14:49:52.230

Python 2 will automatically encode Unicode values, or decode string values when mixing string and unicode operations. This is where your confusion stems from.

When writing a Unicode value to a file, for example, Python 2 will try to encode that value to a string. Because no encoding has been specified, the default encoding is used instead, which on Python 2 is ASCII. The same goes for using a str value in a unicode context, Python 2 will decode it using the ASCII codec.

Your sample values, however, contain a codepoint or byte that is not representable as an ASCII character, so the automatic conversions fail. The UnicodeEncodeError or UnicodeDecodeError exceptions you see are the result of the automatic conversions.

Specifically, e = u'{}, {}'.format(s, u) tries to decode s to Unicode to interpolate it into the unicode u'{}, {}' template string.

To avoid automatic conversions, you thus need to use explicit conversions instead. And to use explicit conversions, you need to know the right encoding used for your byte string, or what codec you are targeting when encoding unicode.

Your computer is a Windows machine configured to use a Latin-1-like codepage, either the 1250 or 1252 codepage. That is why printing the \x92 byte prints a ’ when you write that byte to the terminal directly.

Python knows your computer is configured with that codepage, if you print sys.stdout.encoding you'll see cp1250 or cp1252 or similar printed. That is why Python knows how to print a Unicode value and you'll see the ’ character when printing the \u2019 codepoint.

Your s value is not encoded in UTF-8 however. Trying to decode that value from UTF8 will thus fail. You need to decode from cp1252 instead:

>>> '\x92'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> '\x92'.decode('cp1252')
u'\u2019'

If you use u'{}, {}'.format(s.decode('cp1252'), u), then no exception will be thrown as s can be decoded to Unicode correctly.

FWIW, the s value came from an itunes xml playlist starting with "" I suppose I should take it up with Apple :( — foosion, Aug 14 '13 at 14:52
@foosion: are you certain? A proper XML parser gives you *unicode* values from such a file.. — Martijn Pieters, Aug 14 '13 at 14:53
I'm certain the xml file was produced by itunes and that the header started with I cut and paste s from the file into this question. I posted after reading the xml with plistlib and doing some processing failed unicode errors. I'll run some more tests. — foosion, Aug 14 '13 at 14:56
See http://stackoverflow.com/questions/18275020/python-2-7-2-plistlib-with-itunes-xml/ It seems I may have found a bug in plistlib — foosion, Aug 16 '13 at 20:10

Python general unicode

2 Answers2