Replace non-ASCII characters with a single space

Question

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

wow, you really took good efforts to show so many links. +1 as soon as the day renews! — shad0w_wa1k3r, Nov 19 '13 at 18:20
You seem to have missed this one http://stackoverflow.com/questions/1342000/how-to-replace-non-ascii-characters-in-string — Stuart, Nov 19 '13 at 18:35
I'm interested in seeing an example input that has problems. — dstromberg, Nov 19 '13 at 18:42
@Stuart: Thanks, but that is the very first one that I mention. — dotancohen, Nov 20 '13 at 09:08
@dstromberg: I mention a problematic example character in the question: `–`. It's [this guy](http://www.fileformat.info/info/unicode/char/2013/index.htm). — dotancohen, Nov 20 '13 at 11:52
This helped me out a lot, I was having trouble while HTML parsing, but the characters that was causing a `UnicodeEncodeError` weren't needed, so your code just replaced with something more readable and feasible. Thanks — Crispy, Dec 12 '14 at 07:36
If you want various somewhat better representations of the string in question, see the answers at [Python - Unicode to ASCII conversion](https://stackoverflow.com/a/19527434/507544) which use various useful options and charsets with `string.encode()`. — nealmcb, Sep 09 '17 at 04:20
... Or, to get `?` instead of spaces, use something like `print s.encode('ascii', 'replace')` => `ABRA?O JOS?` for `ABRAÃO JOSÉ` — nealmcb, Sep 09 '17 at 04:31

score 269 · Accepted Answer · answered Nov 19 '13 at 18:11

269

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

answered Nov 19 '13 at 18:11

Martijn Pieters

889,049
245
3,507
2,997

19

@dstromberg: slower; `str.join()` *needs* a list (it'll pass over the values twice), and a generator expression will first be converted to one. Giving it a list comprehension is simply faster. See [this post](http://stackoverflow.com/a/9061024). – Martijn Pieters Nov 19 '13 at 18:42
1

The first piece of code will insert multiple blanks per character if you feed it a UTF-8 byte string. – Mark Ransom Nov 19 '13 at 19:13
@MarkRansom: I was assuming this to be Python 3. – Martijn Pieters Nov 19 '13 at 19:15
2

*"`–` character is replaced with 3 spaces"* in the question implies that the input is a bytestring (not Unicode) and therefore Python 2 is used (otherwise `''.join` would fail). If OP wants a single space per Unicode codepoint then the input should be decoded into Unicode first. – jfs Feb 19 '16 at 17:01

score 64 · Answer 2 · edited Feb 22 '18 at 17:58

64

For you the get the most alike representation of your original string I recommend the unidecode module:

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

edited Feb 22 '18 at 17:58

idbrii

9,440
5
50
93

answered Feb 18 '16 at 20:50

Alvaro Fuentes

860
7
6

interesting suggestion, but it assumes the user wishes non ascii to become what the rules for unidecode are. This however poses a follow up question to the asker about why they insist on spaces, to perhaps replace with another character? – jxramos Feb 18 '16 at 21:15
Thank you, this is a good answer. It doesn't work for the purpose of _this question_ because most of the data that I'm dealing with does not have an ASCII-like representation. Such as `דותן`. However, in the general sense this is great, thank you! – dotancohen Feb 20 '16 at 20:16
1

Yes, I know this does not work for _this_ question, but I landed here trying to solve that problem, so I thought I’d just share my solution to my own problem, which I think is very common for people as @dotancohen who deal with non-ascii characters all the time. – Alvaro Fuentes Feb 24 '16 at 19:13
There have been some security vulnerabilities with stuff like this in the past. Just be careful how you implement this! – deweydb Nov 07 '16 at 18:44
Does not seem to work with UTF-16 encoded text strings – user5359531 Dec 14 '16 at 20:58
4

@AlvaroFuentes, how to handle/rewrite your wonderful code for Python 3 since [this](http://stackoverflow.com/questions/19877306/nameerror-global-name-unicode-is-not-defined-in-python-3)? Error: **NameError: global name 'unicode' is not defined** – Igor Savinkin Jan 25 '17 at 10:16
This works for Python3 - if you use `unidecode(text)`. I got some quotation marks from funny unicode characters during a crawl this way. – rjurney Dec 26 '20 at 11:24

Mark Tolonen · Answer 3 · 2013-11-19T21:26:15.297

24

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

edited Nov 19 '13 at 21:26

answered Nov 19 '13 at 18:29

Mark Tolonen

132,868
21
152
208

Thank you, this is an important observation. If you do find a logical way to handle the case of combining-marks, I would happily add a bounty to the question. I suppose that simply removing the combining mark yet leaving the uncombined character alone would be best. – dotancohen Nov 20 '13 at 10:50
1

A partial solution is to use `ud.normalize('NFC',s)` to combine marks, but not all combining combinations are represented by single codepoints. You'd need a smarter solution looking at the `ud.category()` of the character. – Mark Tolonen Nov 20 '13 at 10:55
1

@dotancohen: there is a notion of "user-perceived character" in Unicode that may span several Unicode codepoints. `\X` (eXtended grapheme cluster) regex (supported by `regex` module) allows to iterate over such characters (note: [*"graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes"*](http://unicode.org/faq/char_combmark.html)). – jfs Feb 19 '16 at 17:08

AXO · Answer 4 · 2017-01-03T11:12:33.457

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

Replace the ? with a another character or space afterwards if needed, and you'd still be faster. — Moritz, Jan 18 '18 at 11:23

score 8 · Answer 5 · answered Aug 20 '16 at 22:35

8

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

answered Aug 20 '16 at 22:35

parsecer

3,076
4
35
78

1

Though this is rather inelegant, it is very readable. Thank you. – dotancohen Aug 21 '16 at 06:28
1

+1 for unicode handling... @dotancohen IMNSHO "readable" implies "practical" which adds to "elegant", so i'd say "a bit inelegant" – qneill Oct 01 '16 at 00:00

score 5 · Answer 6 · answered Jan 23 '18 at 14:39

5

As a native and efficient approach, you don't need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

answered Jan 23 '18 at 14:39

kasravnd

94,640
16
137
166

In python3, this `encode` will return a bytestring, so keep that in mind. Also, this method won't strip out characters such as newline. – Kyle Gibson Jan 25 '19 at 16:25
1

new_string = old_string.encode('ascii', errors='ignore').decode() – Hamid Fadishei Jul 31 '20 at 09:50

score 1 · Answer 7 · edited Dec 23 '20 at 08:54

When we use the ascii() it escapes the non-ascii characters and it doesn't change ascii characters correctly. So my main thought is, it doesn't change the ASCII characters, so I am iterating through the string and checking if the character is changed. If it changed then replacing it with the replacer, what you give.
For example: ' '(a single space) or '?' (with a question mark).

def remove(x, replacer):

     for i in x:
        if f"'{i}'" == ascii(i):
            pass
        else:
            x=x.replace(i,replacer)
     return x
remove('hái',' ')

Result: "h i" (with single space between).

Syntax : remove(str,non_ascii_replacer)
str = Here you will give the string you want to work with.
non_ascii_replacer = Here you will give the replacer which you want to replace all the non ASCII characters with.

Nice edit, adding an explanation. :-) And now that I get the idea of your code I like the approach. (And as promised I did my best with formatting it for you; I hope you like it.) — Yunnosch, Dec 23 '20 at 08:55

score -1 · Answer 8 · answered Apr 08 '19 at 15:03

Potentially for a different question, but I'm providing my version of @Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

to

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python's normal split, and strip each "bit",

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an if test,

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.

Replace non-ASCII characters with a single space

8 Answers8

Linked

Related