How to make the python interpreter correctly handle non-ASCII characters in string operations?

Question

I have a string that looks like so:

6Â 918Â 417Â 712

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

The code:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

It gets no further than s.replace...

Tried all of the 4 answers so far. No go. Still getting the UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128) — adergaard, Aug 27 '09 at 16:09
@SilentGhost: as you can see, there's no way of being sure it is a unicode string. I get a string that has the content shown above, but it contains non ascii strings. That's the real problem. I'm guessing it is unicode since it is not in the first 128. — adergaard, Aug 27 '09 at 16:28
The error has nothing to do with incoming string. It is a string in your code that raises this error! — SilentGhost, Aug 27 '09 at 16:34
@SilentGhost: I really appreciate the effort but believe me, it stops on that row saying: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128) I can even add: test = u"Â " oms.replace(test, '') and it still gives the same error. — adergaard, Aug 27 '09 at 16:38
I'll bet this is why Python 3 is so strict about the difference between strings and byte sequences, just to avoid this kind of confusion. — Mark Ransom, Aug 27 '09 at 16:42

score 157 · Answer 1 · edited Mar 13 '21 at 13:19

157

Throw out all characters that can't be interpreted as ASCII:

def remove_non_ascii(s):
    return "".join(c for c in s if ord(c)<128)

Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

edited Mar 13 '21 at 13:19

Boris

7,044
6
62
63

answered Aug 27 '09 at 16:57

fortran

67,715
23
125
170

1

I get: TypeError: ord() expected a character, but string of length 2 found – Ivelin May 07 '13 at 19:59
@Ivelin that's because the "character" is not being interpreted as proper unicode... check that your source string is prefixed with `u` if it's a literal. – fortran May 07 '13 at 23:07

Jason S · Accepted Answer · 2014-08-02T17:09:06.863

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

See: http://docs.python.org/tutorial/interpreter.html#source-code-encoding

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

The above is in the docs, but this also works:

# coding: utf-8

Additional considerations:

The source file must be saved using the correct encoding in your text editor as well.
In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"") will also fail if s is not a unicode string.
string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

You actually only need `# coding: utf-8`. `-*-` is not for decoration, but you are unlikely to ever need it. I think it was there for old shells. — fmalina, May 09 '13 at 13:40

score 37 · Answer 3 · answered Aug 27 '09 at 15:59

37

>>> unicode_string = u"hello aåbäcö"
>>> unicode_string.encode("ascii", "ignore")
'hello abc'

answered Aug 27 '09 at 15:59

truppo

23,583
4
35
45

4

I see the votes you get but when I try it it says: Nope. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128). Could it be that my orignal string is not in unicode? Well in any case. it needs – adergaard Aug 27 '09 at 16:11
2

Nice, thanks. May I suggest to use .decode() on the result to get it in the original coding? – AkiRoss Nov 17 '12 at 17:29
If you are getting UnicodeDecodeError: 'ascii', then try to convert string into ''UTF-8' format before applying encoding function. – Sateesh May 14 '20 at 08:06

score 16 · Answer 4 · edited Apr 22 '12 at 16:16

16

The following code will replace all non ASCII characters with question marks.

"".join([x if ord(x) < 128 else '?' for x in s])

edited Apr 22 '12 at 16:16

Maehler

5,373
1
36
44

answered Apr 22 '12 at 13:12

VisioN

132,029
27
254
262

Out of curiosity, I wanted to know that, Is there any specific reason to replace it with the question mark? – Mohsin Nov 21 '17 at 15:13

score 6 · Answer 5 · answered Aug 27 '09 at 18:54

6

Using Regex:

import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&=,/'\";:~`\$\^\*\(\)\+\[\]\.\{\}\|\?\<\>\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')

answered Aug 27 '09 at 18:54

Akoi Meexx

717
2
6
15

score 5 · Answer 6 · answered Nov 08 '11 at 03:48

Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as s.decode('utf-8') (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

Example (Python 3)

s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE

Output

6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712

score 3 · Answer 7 · answered Aug 27 '09 at 16:03

3

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s

This will print out 6 918 417 712

answered Aug 27 '09 at 16:03

Isaiah

4,031
3
22
39

Nope. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128). Could it be that my orignal string is not in unicode? Well in any case. I'm probably doing something wrong. – adergaard Aug 27 '09 at 16:11
@adergaard, did you add # -*- coding: utf-8 -*- at the top of the source file? – Nadia Alramli Aug 27 '09 at 16:17
Yes, see the top of this page again, I've edited the questoin and put in the code and the header comments. Thanks for your assistance. – adergaard Aug 27 '09 at 16:19
I think you will have to figure out how to get the strings from the html or xml document in unicode. More info on that here: http://diveintopython.org/xml_processing/unicode.html – Isaiah Aug 27 '09 at 16:38

Louis LC · Answer 8 · 2011-05-24T16:43:19.420

I know it's an old thread, but I felt compelled to mention the translate method, which is always a good way to replace all character codes above 128 (or other if necessary).

Usage : str.translate(table[, deletechars])

>>> trans_table = ''.join( [chr(i) for i in range(128)] + [' '] * 128 )

>>> 'Résultat'.translate(trans_table)
'R sultat'
>>> '6Â 918Â 417Â 712'.translate(trans_table)
'6  918  417  712'

Starting with Python 2.6, you can also set the table to None, and use deletechars to delete the characters you don't want as in the examples shown in the standard docs at http://docs.python.org/library/stdtypes.html.

With unicode strings, the translation table is not a 256-character string but a dict with the ord() of relevant characters as keys. But anyway getting a proper ascii string from a unicode string is simple enough, using the method mentioned by truppo above, namely : unicode_string.encode("ascii", "ignore")

As a summary, if for some reason you absolutely need to get an ascii string (for instance, when you raise a standard exception with raise Exception, ascii_message ), you can use the following function:

trans_table = ''.join( [chr(i) for i in range(128)] + ['?'] * 128 )
def ascii(s):
    if isinstance(s, unicode):
        return s.encode('ascii', 'replace')
    else:
        return s.translate(trans_table)

The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by '?'. This is often useful, for instance for indexing purposes.

I get: TypeError: character mapping must return integer, None or unicode — Ivelin, May 07 '13 at 20:04

SilentGhost · Answer 9 · 2009-08-27T16:26:30.090

1

s.replace(u'Â ', '')              # u before string is important

and make your .py file unicode.

edited Aug 27 '09 at 16:26

answered Aug 27 '09 at 15:58

SilentGhost

264,945
58
291
279

score 1 · Answer 10 · answered Aug 27 '09 at 17:02

1

This is a dirty hack, but may work.

s2 = ""
for i in s:
    if ord(i) < 128:
        s2 += i

answered Aug 27 '09 at 17:02

Corey D

4,574
4
22
32

constantly appending to a string is usually not as efficient as building a list then joining https://stackoverflow.com/questions/3055477/how-slow-is-pythons-string-concatenation-vs-str-join – Boris Mar 13 '21 at 13:26

score 0 · Answer 11 · edited Nov 07 '12 at 11:53

For what it was worth, my character set was utf-8 and I had included the classic "# -*- coding: utf-8 -*-" line.

However, I discovered that I didn't have Universal Newlines when reading this data from a webpage.

My text had two words, separated by "\r\n". I was only splitting on the \n and replacing the "\n".

Once I looped through and saw the character set in question, I realized the mistake.

So, it could also be within the ASCII character set, but a character that you didn't expect.

How to make the python interpreter correctly handle non-ASCII characters in string operations?

11 Answers11

Example (Python 3)

Output

Linked

Related