2

(This question is related to this one)

Take a look at the following session:

Python 2.7.3 (default, Jan  2 2013, 16:53:07) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import simplejson as json
>>> 
>>> my_json = '''[
...   {
...     "id" : "normal",
...     "txt" : "This is a normal entry"
...   },
...   {
...     "id" : "αβγδ",
...     "txt" : "This is a unicode entry"
...   }
... ]'''
>>> 
>>> cache = json.loads(my_json, encoding='utf-8')
>>> 
>>> cache
[{'txt': 'This is a normal entry', 'id': 'normal'}, {'txt': 'This is a unicode entry', 'id': u'\u03b1\u03b2\u03b3\u03b4'}]

Why is the json decoder producing sometimes unicode, and sometimes plain strings? Isn't it supposed to produce always unicode?

Community
  • 1
  • 1
blueFast
  • 33,335
  • 48
  • 165
  • 292
  • That's the behavior I'd expect in Python 2. Why is this a problem? – Aaron Digulla Oct 31 '13 at 08:57
  • 1
    Well, I assume it could cause a dozen bazillion problems. My specific problem is `UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal` because up until now I was assuming json load produced strings. Now I see it should produce unicode. And, after seeing that, I realize simplejson does not follow the spec and produces *sometimes* unicode and *sometimes* strings. That means I have to **undo** the *optimization* that simplejson is implementing, since I either need all strings unicode or all strings strings, but never a mixture. – blueFast Oct 31 '13 at 09:53

1 Answers1

4

It seems to be an optimization in simplejson, from simplejson docs:

If s is a str then decoded JSON strings that contain only ASCII characters may be parsed as str for performance and memory reasons. If your code expects only unicode the appropriate solution is decode s to unicode prior to calling decode.

Note: Any any characters included in ASCII are encoded the same in both UTF-8 and ASCII. So ASCII is a subset of UTF-8.

Lycha
  • 9,108
  • 1
  • 33
  • 43
  • Yes but the type of the objects in the dict is str, not unicode – jbat100 Oct 31 '13 at 09:08
  • Ahhh! So I just need to decode the input json to unicode and then pass it to load? That looks doable! – blueFast Oct 31 '13 at 09:58
  • 1
    @gonvaled Note that [the standard library `json` module is documented](http://docs.python.org/2/library/json.html#json.JSONDecoder) to convert JSON `string` type to Python `unicode`. – Janne Karila Oct 31 '13 at 10:05
  • @JanneKarila: and I assume that library is not `simplejson`? Any reason (appart from this one) to switch to the stock library? I am relying on simplejson since ages. – blueFast Oct 31 '13 at 10:07