291

I'm using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.

Is it possible to get string objects instead of Unicode ones?

Example

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

Update

This question was asked a long time ago, when I was stuck with Python 2. One easy and clean solution for today is to use a recent version of Python — i.e. Python 3 and forward.

martineau
  • 99,260
  • 22
  • 139
  • 249
Brutus
  • 6,788
  • 7
  • 30
  • 39
  • 1
    There is no problem under Python3, the type of items in new_list is `str` – GoingMyWay Jun 13 '17 at 03:29
  • 1
    Python 3k is not a 'recent version of Python' , it is just an alternative branch. – user2589273 Dec 19 '17 at 01:48
  • 11
    It's strange to see such comment in Dec 2017 - Python 2 is deprecated and no maintenance will happen after Jan 1 2020, which is less than 2 years: https://pythonclock.org/ – Zaar Hai Apr 19 '18 at 04:52
  • 2
    @ZaarHai LOT of people are stuck in Python 2 against their will. There are many applications which embed their own Python version for automation and scripting so people have to use it until the vendor updates (I'm looking at you Maya, Houdini, Nuke..) – Geordie Jun 21 '18 at 01:28
  • 1
    @Geordie I surely know and understand that. My comment was about terminology - Python is not an "alternative branch", but rather unfortunate lack of alternative (pun intended) for those who are stuck with it. – Zaar Hai Jun 22 '18 at 03:24
  • 1
    thanks a lot for the update! Saved me a lot of time – user1993 Jun 13 '19 at 18:15
  • python 3 removed u' in my case while writing code in aws lambda – Narendra Maru Jun 17 '19 at 07:05

21 Answers21

185

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of unicode type. Because JSON is a subset of YAML it works nicely:

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

Notes

Some things to note though:

  • I get string objects because all my entries are ASCII encoded. If I would use unicode encoded entries, I would get them back as unicode objects — there is no conversion!

  • You should (probably always) use PyYAML's safe_load function; if you use it to load JSON files, you don't need the "additional power" of the load function anyway.

  • If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.

Conversion

As stated, there is no conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:

I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.

Community
  • 1
  • 1
Brutus
  • 6,788
  • 7
  • 30
  • 39
  • 8
    Take a little care if you decide to use this answer. It works perfectly for Brutus's case, but only because he knows that his data only contains ASCII-encodable characters. If you don't have that guarantee, this answer won't work. For example, try executing `yaml.load(json.dumps([u'a', u'£', u'É']))` at the Python shell and observe that you get back `['a', u'\xa3', u'\xc9']` (which contains `unicode` strings). If you can't be sure that your data only contains characters from the ASCII character set, you should use a different approach instead (I recommend my own answer). – Mark Amery Sep 06 '14 at 16:49
  • 1
    YAML also makes use of `[u'a', u'b']` be careful. – Carlos Calla Sep 15 '14 at 22:31
  • 1
    This is nice, but it does not work with low numbers.. look here: http://stackoverflow.com/questions/30458977/yaml-loads-5e-6-as-string-and-not-a-number – Oren May 26 '15 at 12:46
  • @Oren: This is not an error in the [YAML spec](http://yaml.org/spec/1.2/spec.html#id2804092) but in the PyYAML parser. The [YAML parser from ruamel](https://bitbucket.org/ruamel/yaml) works. – Brutus Jun 04 '15 at 10:21
  • I want to have ouput like ["a", "b"] not like ['a', 'b'] @Brutus – user60679 Nov 21 '15 at 09:18
  • Awesome and easiest solution, Must be selected answer! You will need pyyaml if you are using pip. Use sudo pip install pyyaml or in ubuntu sudo apt-get install pythom-yaml – Zohair May 17 '16 at 12:33
  • I've found YAML is quite slow so you may not want to do this if performance is important. See also: https://stackoverflow.com/questions/27743711/can-i-speedup-yaml – szmoore May 26 '17 at 05:32
145

There's no built-in option to make the json module functions return byte strings instead of unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

Just call this on the output you get from a json.load or json.loads call.

A couple of notes:

  • To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7.
  • Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.
Community
  • 1
  • 1
Mark Amery
  • 110,735
  • 57
  • 354
  • 402
  • 1
    I like this - it is not an ignore - it is recognizing that when people say "strings" and "ascii" they mostly naively meant they wanted bytes, not theoretical unicode characters. (and not ascii as they still want pound signs at the other end) – Danny Staple Feb 16 '15 at 17:26
  • I like this, it works almost the same way as my pretty printer works, since I know that json don't make tuple, you should add the exception for tuple too. – y.petremann Feb 26 '15 at 23:10
  • This is horribly inefficient, requiring you to recursively traverse nodes that you may not need to. The json module gives you hooks to do this much more efficiently. The answer below using `object_hook` is actually far worse than this one, though, but, using `object_pairs_hook`, you can come up with a [reasonably efficient method](http://stackoverflow.com/a/34796078/62660) that requires no recursion or revisiting of nodes that do not contain strings. – Travis Jensen Jan 14 '16 at 18:38
  • 1
    @TravisJensen Interesting. The `object_pairs_hook` method is perhaps very slightly harder to understand than this one (you need to understand how the parameter works and why lists and dicts require different handling), and the performance benefit won't matter to most people... but I'd expect it to exist, especially for anybody dealing with an unusually deeply nested JSON object. – Mark Amery Jan 14 '16 at 18:58
  • plus1 This is the most concise answer; besides PyYAML is a pain to install. The only thing better would be to somehow micro-stream the conversion so it doesn't use 4X memory. – personal_cloud Sep 27 '17 at 17:40
113

A solution with object_hook

[edit]: Updated for Python 2.7 and 3.x compatibility.

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    if isinstance(data, str):
        return data

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.items() # changed to .items() for python 2.7/3
        }

    # python 3 compatible duck-typing
    # if this is a unicode string, return its string representation
    if str(type(data)) == "<type 'unicode'>":
        return data.encode('utf-8')

    # if it's anything else, return it in its original form
    return data

Example usage:

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

How does this work and why would I use it?

Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?

Purely for performance. Mark's answer decodes the JSON text fully first with unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:

  • A copy of the entire decoded structure gets created in memory
  • If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth

This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the docs:

object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders

Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.

Mark's answer isn't suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.

Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn't have a dict at the top level.

Orwellophile
  • 11,307
  • 3
  • 59
  • 38
Mirec Miskuf
  • 1,655
  • 1
  • 12
  • 13
  • 1
    +1 for the approach here; I didn't really grasp it when I first read it, but finally understood when rereading it in light of Travis Jensen's answer. I've made a pretty aggressive edit in the hopes of clarifying how it works and what its advantages over my answer are. The core idea of the code remains untouched, but I've modified pretty much everything else. Feel free to roll back my edit if you object to this - it's your answer! – Mark Amery Jan 23 '16 at 11:28
  • No problem Mark, many thanks. I like your edit, it is much more explanatory than my original. Maybe, one day, I'll learn to give more concise answers. – Mirec Miskuf Jan 23 '16 at 21:08
  • 2
    This is great solution; efficient and elegant. However, if you're stuck in the realm of Python < 2.7, as I am, you will need to replace the line: `return { byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True) for key, value in data.iteritems() }` with `return dict((_byteify(key, ignore_dicts=True), _byteify(value, ignore_dicts=True)) for key, value in data.iteritems())` for it to work. – Richard Dunn Apr 19 '16 at 10:27
  • I think you're wrong about the recursion depth issue. With yours, I can go up to 990: `json_loads_byteified('[' * 990 + ']' * 990)`. With 991 it crashes. Mark's still works with 991: `byteify(json.loads('[' * 991 + ']' * 991))`. It crashes at 992. So at least in this test, Mark's can go deeper, contrary to what you said. – Stefan Pochmann May 03 '17 at 21:26
  • @MarkAmery What do you think about my above comment? (I just saw in the edit history that it was actually you who added that claim). – Stefan Pochmann May 03 '17 at 21:32
  • @StefanPochmann Hmm. I don't perfectly recall what I tested 16 months ago, but something that immediately jumps out to me is that you're testing deeply-nested lists/arrays rather then deeply-nested dicts/objects. `object_hook` only gets invoked on each dict/object, not on each list/array so it's unsurprising that your *particular* test doesn't show the avoidance of the recursion limit that my edit described. Retry with deeply-nested dicts/objects instead, and I expect you'll see a different result! – Mark Amery May 03 '17 at 21:57
  • @MarkAmery Ah, yes, with `'{"a": ' * n + '1' + '}' * n` yours can only go up to n=496. Mirec's isn't safe, either, but works up to n=984. Yours however works up to 990 if I replace its dictionary comprehension with a for loop that builds a dictionary. For some reason, dictionary comprehensions are turned into functions, doubling the recursion depth. – Stefan Pochmann May 03 '17 at 22:30
  • your code already generic , remove ignore_dict and it will work. One need not to worry about ignore_dict , it is already taken carelike def _byteify(data): if isinstance(data, unicode): return data.encode('utf-8') if isinstance(data, list): return [ _byteify(item) for item in data ] if isinstance(data, dict): return { _byteify(key): _byteify(value) for key, value in data.iteritems() } return data – Ram Sharan Mittal May 11 '17 at 06:58
  • @RichardDunn unfortunately the world has shifted the other way, and now you'll have to be content with being incompatible with `data.items()` instead, since that is the python 2.7 and 3.x compatible method – Orwellophile Apr 02 '21 at 14:49
74

You can use the object_hook parameter for json.loads to pass in a converter. You don't have to do the conversion after the fact. The json module will always pass the object_hook dicts only, and it will recursively pass in nested dicts, so you don't have to recurse into nested dicts yourself. I don't think I would convert unicode strings to numbers like Wells shows. If it's a unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).

Also, I'd try to avoid doing something like str(val) on a unicode object. You should use value.encode(encoding) with a valid encoding, depending on what your external lib expects.

So, for example:

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)
Zearin
  • 1,288
  • 2
  • 15
  • 32
Mike Brennan
  • 2,868
  • 2
  • 14
  • 7
  • 3
    This is fine if the object in `s` is a JSON `Object` (an unordered collection of key:value pairs with the ':' character separating the key and the value, comma-separated and enclosed in curly braces), but not if it's, say, a JSON `Array`. So if given a JSON `Array` like `["a", "b"]`, the result will still be `[u'a', u'b']`. None of the other currently available customizing hook-type parameters for `json.loads()` can do the job either. – martineau Dec 27 '12 at 16:32
  • 2
    Since, as you mentioned, the `json` module will recursively pass in nested `dict`s, it's unnecessary to check for them in the two functions -- so the two `elif` clauses that check for them should be removed. – martineau Dec 27 '12 at 17:40
  • 1
    Note that starting function names with an underscore has a special meaning for import statements. If you put these functions in a file called Utility.py and in another file do `from Utility import *`, the functions will *not* be seen because of that underscore. – M Katz Jan 17 '13 at 20:34
  • 1
    This is a really bad idea. `object_hook` gets called for every json object parsed, so if you recurse into what is given you, you are re-"byteifying" things that you have already "byteified". Performance is going to grow geometrically with the size of the object. I've included an answer [here](http://stackoverflow.com/a/34796078/62660) that uses `object_pairs_hook` and doesn't suffer from that problem. – Travis Jensen Jan 14 '16 at 18:35
38

That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

But if you really want bytestrings, just encode the results to the encoding of your choice:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']
nosklo
  • 193,422
  • 54
  • 273
  • 281
  • Thanks nosklo, that is what I have done first. But as I said, the real data I used is pretty nested and all, so this introduced quiety some overhead. I'm still looking for an automatic solution... There's at least one bug report out there where people complain about simplejson returning string objects instead of unicode. – Brutus Jun 05 '09 at 17:23
  • 1
    @Brutus: I think json is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects. What I mean is that json (javascript) strings can store any kind of unicode character, so it makes sense to create unicode objects when translating from json. You should really fix your libraries instead. – nosklo Jun 05 '09 at 18:27
  • Unless you have a python library, which passes to a C lib under the hood and it expects an as ASCII string. I have that situation, and the bound C lib is raising `argument of type 'std::string const &'`. – MikeyE Dec 31 '20 at 02:30
16

There exists an easy work-around.

TL;DR - Use ast.literal_eval() instead of json.loads(). Both ast and json are in the standard library.

While not a 'perfect' answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

gives:

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.

Charles Merriam
  • 17,320
  • 5
  • 66
  • 75
  • 11
    Better be sure your json doesn't contain any `null`, `true`, or `false` values, because they aren't valid in python and will cause `literal_eval()` to fail. – ʇsәɹoɈ Oct 10 '14 at 08:34
  • 3
    @ʇsәɹoɈ Also better hope your JSON doesn't contain an escaped solidus (`\/`) inside a string, or a unicode escape sequence (like `"\u0061"`, which is another way of writing `"a"`). Python's literal syntax is incompatible with JSON in several ways, and I wouldn't trust this answer for any script that I wasn't going to throw away. – Mark Amery Feb 16 '15 at 21:19
  • People are right that if the string really is unicode then this answer fails, but if that were the case we wouldn't be able to cast to a string anyways. +1 for an answer that works only when it works and throws an exception otherwise – Stefan Sullivan Jul 13 '16 at 22:38
  • if possible don't use `json` to dump the data, just use `print` if running python. Then `ast.literal_eval` works – Jean-François Fabre Mar 14 '18 at 16:43
12

Mike Brennan's answer is close, but there is no reason to re-traverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:

object_pairs_hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.

With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don't have to recurse to make it happen.

EDIT: A coworker pointed out that Python2.6 doesn't have object_hook_pairs. You can still use this will Python2.6 by making a very small change. In the hook above, change:

for key, value in pairs:

to

for key, value in pairs.iteritems():

Then use object_hook instead of object_pairs_hook:

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

Using object_pairs_hook results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.

Community
  • 1
  • 1
Travis Jensen
  • 5,112
  • 3
  • 33
  • 38
  • 1
    This is neat and seems very close to deserving the green checkmark (which Brutus has, admirably, already passed around liberally as better answers have come in). But... why not actually handle lists properly in the `deunicodify_hook` that you exhibit in this answer? At the moment, you have an implementation of `deunicodify_hook` that doesn't iterate over lists and deunicodify the strings and lists within them, and thus the output you're exhibiting does *not* match the output your hook will actually produce. Fix that, and this answer will be superior to mine. – Mark Amery Jan 14 '16 at 19:04
  • Frivolous: I'd also suggest demonstrating the function with the ordinary CPython interpreter rather than the one you're using here (which I think is IronPython)? The CPython interpreter is more familiar to most Python users and is, in my opinion, prettier. – Mark Amery Jan 14 '16 at 19:06
  • This doesn't work for me but I'm sure it's some quirk of what I'm doing ... I'm storing one list from a larger json doc to a file. Whether I load it with or without this object_pairs_hook, every item comes up unicode. Darn. – rsaw Jan 17 '16 at 07:28
  • 1
    @rsaw Good point! Since the `object_pairs_hook` only gets called for *objects*, if your JSON text has a list of strings at the top level, this solution will fail. There's no way to fix this without calling some function on the thing returned from `json.load`; none of the `json.load` hooks can guarantee you'll be able to deal with every string. I think this is a big enough flaw for me to keep recommending my solution over using the hooks. – Mark Amery Jan 17 '16 at 13:56
  • 1
    -1 because I just realised that Mirec Miskuf [already posted](http://stackoverflow.com/a/33571117/1709587) an object-hook answer that neither has the disadvantages of Mike Brennan's approach (re-byteifies the same dictionaries multiple times) nor of this one (fails to byteify nested lists or top-level lists or strings). I'm not sure why his answer has languished with almost no attention while this one - which is inferior - has rapidly gained votes. – Mark Amery Jan 23 '16 at 10:42
9

I'm afraid there's no way to achieve this automatically within the simplejson library.

The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

Jarret Hardie
  • 84,200
  • 10
  • 123
  • 121
4

As Mark (Amery) correctly notes: Using PyYaml's deserializer on a json dump works only if you have ASCII only. At least out of the box.

Two quick comments on the PyYaml approach:

  1. NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.

  2. You can make it work also for non ASCII via this:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
    

But performance wise its of no comparison to Mark Amery's answer:

Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

So deserialization including fully walking the tree and encoding, well within the order of magnitude of json's C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.

=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.

This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).

Why?

Unicode normalisation. For the unaware: Take a painkiller and read this.

So using the byteify recursion you kill two birds with one stone:

  1. get your bytestrings from nested json dumps
  2. get user input values normalised, so that you find the stuff in your storage.

In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than w/o NFC - but thats heavily dependent on the sample data I guess.

Red Pill
  • 443
  • 5
  • 15
3

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with unicode. You have json in py 2.6+, and this gives you unicode values, whereas simplejson returns string objects. Just try easy_install-ing simplejson in your environment and see if that works. It did for me.

ducu
  • 1,139
  • 1
  • 12
  • 14
2

Just use pickle instead of json for dump and load, like so:

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

The output it produces is (strings and integers are handled correctly):

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}
Stefan Gruenwald
  • 2,342
  • 22
  • 24
  • 1
    +1 for a solution that doesn't require additional packages (like *yaml*). But sometimes - like in my original case - I need to have the data in JSON, so *pickle* is not always the best option. Besides, you have `safe_load` in YAML, I don't know if something similar exist for *pickle*. – Brutus Apr 29 '14 at 10:24
1

I had a JSON dict as a string. The keys and values were unicode objects like in the following example:

myStringDict = "{u'key':u'value'}"

I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).

fragilewindows
  • 1,324
  • 1
  • 13
  • 25
narko
  • 2,875
  • 1
  • 23
  • 27
  • The example you have given is not an example of JSON. `{u'key':u'value'}` is not JSON. – Mark Amery Feb 16 '15 at 21:54
  • 2
    I perfectly know it is not JSON. That's how it was parsed from an external source in my python script. If it was JSON directly like in the following example, I wouldn't need the byteify function marked as the solution: {"firstName":"John", "lastName":"Doe"}. It would be just great if before voting you read the answers. Thanks. – narko Feb 18 '15 at 21:33
1

So, I've run into the same problem. Guess what was the first Google result.

Because I need to pass all data to PyGTK, unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for typesafe JSON conversion - json.dump() would bail on any non-literals, like Python objects. Doesn't convert dict indexes though.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj
mario
  • 138,064
  • 18
  • 223
  • 277
  • The only problem that might come up here is if you need the keys in a dictionary converted from unicode. Though this implementation will convert the values, it maintains the unicode keys. If you create a 'newobj', use newobj[str(i)] = ..., and assign obj = newobj when you're done, the keys will be converted as well. – Neal Stublen Sep 30 '10 at 14:43
  • This could be prettier with comprehensions or better by converting keys. It's also unidiomatic; it both mutates objects in place (in the case of dictionaries) and returns the new value, which is inconsistent with Python's built-in collection methods which either mutate the current object or return a new one, but not both. – Mark Amery Feb 16 '15 at 21:24
1

Support Python2&3 using hook (from https://stackoverflow.com/a/33571117/558397)

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

Returns:

 {'three': '', 'key': 'value', 'one': 'two'}
abarik
  • 2,544
  • 4
  • 39
  • 76
0

This is late to the game, but I built this recursive caster. It works for my needs and I think it's relatively complete. It may help you.

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

Just pass it a JSON object like so:

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

I have it as a private member of a class, but you can repurpose the method as you see fit.

Wells
  • 8,895
  • 10
  • 45
  • 75
  • I've run into a problem where I'm trying to parse JSON and pass the resulting mapping to a function as **kwargs. It looks like function parameter names cannot be unicode, so your _parseJSON function is great. If there's an easier way, someone can let me know. – Neal Stublen Sep 30 '10 at 14:35
  • 1
    This code has a problem - you make a recursive call in the List piece, which is going to fail if the elements of the list are not themselves dictionaries. – I82Much Oct 14 '10 at 14:00
  • Besides the bug described by @I82Much, this is also badly named (it doesn't actually parse the JSON; a `json.loads` call is needed first), arbitrarily tries to convert strings to ints for no explained reason, and isn't copy-and-paste ready. – Mark Amery Feb 16 '15 at 21:29
0

I rewrote Wells's _parse_json() to handle cases where the json object itself is an array (my use case).

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj
0

here is a recursive encoder written in C: https://github.com/axiros/nested_encode

Performance overhead for "average" structures around 10% compared to json.loads.

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

using this teststructure:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)
Red Pill
  • 443
  • 5
  • 15
0

With Python 3.6, sometimes I still run into this problem. For example, when getting response from a REST API and loading the response text to JSON, I still get the unicode strings. Found a simple solution using json.dumps().

response_message = json.loads(json.dumps(response.text))
print(response_message)
Yuelin
  • 1
  • 3
-1

I've adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck-typing.

The encoding is done manually and ensure_ascii is disabled. The python docs for json.dump says that

If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences

Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852 the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250 used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from wikipedia.

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

I'd also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:

A string is a collection of zero or more Unicode characters

In my use-case I had files with json. They are utf-8 encoded files. ensure_ascii results in properly escaped but not very readable json files, that is why I've adapted Mark Amery's answer to fit my needs.

The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.

n611x007
  • 7,979
  • 7
  • 53
  • 91
  • I'm not sure I see the benefits of using duck typing here? We know that collections returned from `json.loads` are going to be lists or dicts, not some user-defined or library-defined type that implements their methods and magic methods, so why not just do an `isinstance` check? Isn't that easier to understand than checking for the existence of `iteritems` or whether `iter` will accept the object as an argument? – Mark Amery Nov 02 '15 at 13:52
  • @MarkAmery this is about dumps, not loads. if you *create* data to dump - as opposed to *loading* it - you cannot be sure what it is. the idea was to let it come from anywhere in the code. – n611x007 Nov 03 '15 at 07:53
-1

I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the unicode keys to strings. (simplejson on GAE does not return string keys.)

obj is the object decoded from JSON:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs is what I pass to the constructor of the GAE application (which does not like unicode keys in **kwargs)

Not as robust as the solution from Wells, but much smaller.

Zearin
  • 1,288
  • 2
  • 15
  • 32
boatcoder
  • 15,651
  • 16
  • 95
  • 164
-2

Check out this answer to a similar question like this which states that

The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.

For example, try this:

print mail_accounts[0]["i"]

You won't see a u.

kunal
  • 81
  • 7
  • Not true if e.g. you want to format something containing a unicode string, in Py2. e.g. `'{}'.format({u'x' : u'y'})` still includes the u's. – Ponkadoodle Apr 15 '20 at 00:26