Python: remove ^A in a string

Question

I get a string from database that contains strange characters, and the characters break the json string.

Here is the json string:

{"id":13,"code":"cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@","label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}

It looks OK, but there are '^A' between i and K (2B2[h1s`lNzF@sPC1FtaCiK0VF@), we can not see it from here, but if you copy it to a text editor, it will show up.

My question is how to make this json string parsable? or how to use Python to remove '^A'?

it is not a json string. JSON string can't contain literal [U+0001 codepoint](http://codepoints.net/U+0001). — jfs, May 08 '14 at 00:57

score 2 · Accepted Answer · edited Sep 16 '19 at 01:13

how to use Python to remove '^A'?

if you open your terminal and do the following using the real ASCII ^A character (to write it, you need to do C-vC-a):

>>> print ord('^A')
1

so you know you have to remove the ASCII control character 1 from the string:

>>> json_string = '{"id":13,"code":"cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@","label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}'
>>> json_string.replace(chr(1), '')
'{"id":13,"code":"cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@","label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}'

if you refer from the ASCII table:

_{(source: asciitable.com)}

it's the "start of heading" code, commonly used to get at begining of line in a shell.

N.B.:

the function ord() gives the int equivalent of a character (which in python is a one char string) ;
the function chr() gives the string equivalent of an ascii character ;
in ASCII, printable characters are between 32 and 126

My question is how to make this json string parsable?

In the end, there's no way to make the exact string you're giving parseable by JSON, because JSON only works with printable characters, and thus shall not contain control characters (which may have undesirable side effects when given through sockets or tty ports). In other words, a string looking like JSON that contains an ASCII control character is not JSON.

Not knowing the context, if you want your JSON data to work only one way (injective function), you can remove the control character from the field (and the name) before building the JSON string. You might as well use a hashing function, which will make it smaller and look nicer.

Though if you want it to be symmetrical (bijective), you'd better either transform code into a list of integers, or code it using something like base64:

with base64:

>>> import base64
>>> bcode = base64.encodestring(code)
>>> bcode
'Y2Zsd2AyQjJbaDFzYGxOekZAc1BDMUZ0YUNpSwEwVkZA\n'
>>> base64.decodestring(bcode)
'cflw`2B2[h1s`lNzF@sPC1FtaCiK\x010VF@'

or as a list of integers:

>>> lcode = [ord(c) for c in code]
>>> lcode
[99, 102, 108, 119, 96, 50, 66, 50, 91, 104, 49, 115, 96, 108, 78, 122, 70, 64, 115, 80, 67, 49, 70, 116, 97, 67, 105, 75, 1, 48, 86, 70, 64]
>>> "".join([chr(c) for c in lcode])
'cflw`2B2[h1s`lNzF@sPC1FtaCiK\x010VF@'

making your json string:

{"id":13,"code":"Y2Zsd2AyQjJbaDFzYGxOekZAc1BDMUZ0YUNpSwEwVkZA\n","label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}

or

{"id":13,"code":[99, 102, 108, 119, 96, 50, 66, 50, 91, 104, 49, 115, 96, 108, 78, 122, 70, 64, 115, 80, 67, 49, 70, 116, 97, 67, 105, 75, 1, 48, 86, 70, 64],"label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}

But in the end, you need to have the ^A control character removed from the string before the JSON is being built, at encoding time, not at decoding time…

score 1 · Answer 2 · answered May 07 '14 at 22:38

1

json_str = '{"id":13,"code":"cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@","label":"Anonymous lifestyle App cflw`2B2[h1s`lNzF@sPC1FtaCiK0VF@"}'
print map(ord, json_str)

This will get you an array of integer unicode codes. Find the integer unicode code you want to get rid of, then do find/replace using Python's built-in str.replace(old, new)

answered May 07 '14 at 22:38

shadowfox

482
4
6

Bingo. Exactly what I was thinking but you connected the dots. – deweyredman May 07 '14 at 22:39
though, ^A is not unicode, it's an ascii control char. – zmo May 07 '14 at 22:41
ah, you're actually right – deweyredman May 07 '14 at 22:42

score 1 · Answer 3 · answered May 07 '14 at 22:41

^A is UNIX speak for ASCII code 0x01 (in a tty, equivalent to Ctrl+A).

You really should not be getting raw binary data in JSON strings, and you should fix the producer (a typical approach is to base64 encode binary data).

However, given your data, you can remove this particular character with:

yourstring.replace(chr(1), "")

or remove all control characters with:

import re
re.sub("[\x00-\x1F]", "", yourstring)

deweyredman · Answer 4 · 2014-05-07T22:38:29.747

0

hrm, I'd probably find the unicode character code and do a find replace on it...I'm sure there's a better solution, but this should work.

Can you do me a favor... is there any way you can do this for the string in question:

for i in len(yourstring):
  print str.format("\"{0}\"", yourstring)

edited May 07 '14 at 22:38

answered May 07 '14 at 22:32

deweyredman

1,400
1
8
12

this is not unicode, it's an ascii control char – zmo May 07 '14 at 22:42
good catch..thanks for that – deweyredman May 07 '14 at 22:43

V13 · Answer 5 · 2014-05-07T23:26:00.247

0

You can do something like this on the string:

import string
newstr=''.join([x for x in oldstr if x in string.printable])

Or instead of string.printable you can use any other subset of the character. oldstr is the original string.

Having said that, unless you're storing binary data in the database you should consider making sure that your encoding while communicated with the database is correct.

Finally, I get the feeling that you're trying to create the json by hand instead of a library. The json module should be able to handle these cases without a problem. For example:

>>> import json
>>> a={'a': "a\x01b", 'b': 20}
>>> a
{'a': 'a\x01b', 'b': 20}
>>> print a['a']
ab
>>> json.dumps(a)
'{"a": "a\\u0001b", "b": 20}'
>>> json.loads(json.dumps(a))
{u'a': u'a\x01b', u'b': 20}

edited May 07 '14 at 23:26

answered May 07 '14 at 22:42

V13

612
6
12

but loads fails loudly with a non printable character. – zmo May 07 '14 at 23:11
@zmo it doesn't fail for me. I updated the answer to show that loads works well. loads should fail only when the json is not correct. The normal json library will escape any unprintable characters (see above) so there's no reason for the load to fail. – V13 May 07 '14 at 23:27
my bad, I made a mistake in my tests, you're right it's converting it correctly! – zmo May 07 '14 at 23:31

Python: remove ^A in a string

5 Answers5