Stripping non printable characters from a string in python

Question

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

score 89 · Accepted Answer · edited Jun 24 '20 at 21:54

89

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601

Edit Adding suggestions from the comments.

edited Jun 24 '20 at 21:54

darkdragon

150
1
10

answered Sep 18 '08 at 14:28

Ants Aasma

48,030
12
84
89

4

Is 'Cc' enough here? I don't know, I'm just asking -- it seems to me that some of the other 'C' categories may be candidates for this filter as well. – Patrick Johnmeyer Sep 18 '08 at 17:10
This code doesn't work in 2.6 or 3.2, which version does it run in? – Seth Aug 09 '11 at 03:41
1

This function, as published, removes half of the Hebrew characters. I get the same effect for both of the methods given. – dotancohen Dec 11 '12 at 15:32
1

From performance perspective, wouldn't string.translate() work faster in this case? See http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python – Kashyap Oct 03 '13 at 20:19
This fails for a "narrow" build of python (16-bit unicode). That's the standard build for Mac. http://stackoverflow.com/questions/7105874/ – Edward Falk Dec 14 '14 at 17:03
@ants aasma: pls tell me, how can your approach of building a character class be used to count the control chars in the string (not strip them)? I don't see any suitable method in re. – chrisinmtown Apr 02 '15 at 11:46
@Edward Falk: For the narrow build, put all_chars = (unichr(i) for i in xrange(0x110000) in a try clause, then same with xrange(0x10000) in the except clause -- allows it to work with a "Narrow" build (like OSX) – Dave May 23 '15 at 01:01
1

@PatrickJohnmeyer You've got a good point, and this bit me. I fixed it by checking if the unicodedata.category(c) is in a set of any of the 'Other' unicode categories (see: http://www.fileformat.info/info/unicode/category/index.htm ), ie set(['Cc','Cf','Cn','Co','Cs']). Note that I'm using English fonts, so ymmv using other fonts. – Dave May 23 '15 at 01:05
3

Use `all_chars = (unichr(i) for i in xrange(sys.maxunicode))` to avoid the narrow build error. – danmichaelo Nov 24 '15 at 21:01
4

For me `control_chars == '\x00-\x1f\x7f-\x9f'` (tested on Python 3.5.2) – AXO Sep 09 '16 at 16:03
can i apply this on pandas dataframe, if yes please explain how – Wcan Oct 19 '17 at 19:56
On Python3 use `chr()` instead of `unichr()` and `range()` instead of `xrange()`. Furthermore, for combination of the two iterators returned by `range()` one should use `itertools.chain()`: `itertools.chain(range(), range())`. For readability, I suggest to use hex numbers (thanks @AXO) in the static ranges: `range(0x00,0x20)` and `range(0x7f,0xa0)`. – darkdragon Jun 24 '20 at 06:43
it still keeps codes like `\xa0` – Dima Lituiev Dec 29 '20 at 02:10

score 77 · Answer 2 · edited May 07 '14 at 22:48

77

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

edited May 07 '14 at 22:48

zmo

22,917
4
48
82

answered Sep 18 '08 at 13:23

William Keller

5,008
1
23
22

12

You probably want filtered_string = ''.join(filter(lambda x:x in string.printable, myStr) so that you get back a string. – Nathan Shively-Sanders Sep 18 '08 at 13:27
15

Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... maybe there is something else? – Vinko Vrsalovic Sep 18 '08 at 13:29
17

You should be using a list comprehension or generator expressions, not filter + lambda. One of these will 99.9% of the time be faster. ''.join(s for s in myStr if s in string.printable) – habnabit Sep 18 '08 at 22:49
1

The lot of you are correct, of course. I should stop trying to help people while sleep-deprived! – William Keller Sep 19 '08 at 03:20
3

@AaronGallagher: 99.9% faster? From whence do you pluck that figure? The performance comparison is nowhere near that bad. – Chris Morgan Jan 14 '12 at 04:01
1

It's perhaps worth turning `string.printable` into a `set` before doing the filter. – Gareth Rees Sep 12 '12 at 12:25
4

Hi William. This method seems to remove all non-ASCII characters. There are many printable non-ASCII characters in Unicode! – dotancohen Dec 11 '12 at 15:28
3

@ChrisMorgan: Late response, but the claim is it will almost always be faster, not that it will be much, much faster. – Oddthinking Jul 30 '14 at 20:10
1

Be aware: In Python3, filter returns a generator. So either use Nathans `''.join(...)` or `str(filter(...))` – marsl Apr 06 '18 at 14:11
Here's my version that gives a clue about what was eliminated: ''.join( (s if s in string.printable else 'X') for s in s_string_to_print ) – TaiwanGrapefruitTea Oct 18 '18 at 11:00
Not that tab, newline and a few more are part of the printable characters. So if you don't want to include those, you should use `string.printable[:-5]` – LoMaPh Nov 12 '20 at 19:38

Ber · Answer 3 · 2019-06-04T09:57:40.237

20

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

edited Jun 04 '19 at 09:57

answered Sep 18 '08 at 15:25

Ber

34,859
15
60
79

you started a list comprehension which did not end in your final line. I suggest you remove the opening bracket completely. – tzot Sep 19 '08 at 12:13
Thank you for pointing this out. I edited the post accordingly – Ber Oct 05 '08 at 15:32
1

This seems the most direct, straightforward method. Thanks. – dotancohen Jul 21 '13 at 05:34
it should be `printable = set(['Lu', 'Ll'])` shouldn't it ? – Fabrizio Miano Apr 04 '19 at 14:27
@FabrizioMiano You are right. Or set(('Lu', 'Ll')) Thanx – Ber Apr 05 '19 at 13:10
@Ber You meant to say `printable = {'Lu', 'Ll'}` ? – Csaba Toth May 31 '19 at 07:04
1

@CsabaToth All three are valid and yield the same set. Your's is maybe the nicest way to specify a set literal. – Ber Jun 04 '19 at 09:56
@Ber All of them result with the same set, certain linters advise you to use the one I advised. – Csaba Toth Jun 04 '19 at 18:15
but this removes the space in the string. How to maintain the space in the string? – Anudocs Nov 06 '19 at 09:51
2

@AnubhavJhalani You can add more Unicode categories to the filter. To reserve spaces and digits in addition to letters use `printable = {'Lu', 'Ll', Zs', 'Nd'}` – Ber Nov 07 '19 at 19:41
I suggest removing only control characters. See my answer for an example. – darkdragon Jun 23 '20 at 08:27
I found that after adding `'Zs'` to include spaces this method did not strip the `'\xa0'` character which Python does not seem to print. It is a 'non-breaking space' apparently. According to [this post](https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python) you need to remove this manually which is a pain. – Bill Jan 18 '21 at 23:12

score 13 · Answer 4 · edited Jun 24 '20 at 21:59

13

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

edited Jun 24 '20 at 21:59

darkdragon

150
1
10

answered Sep 14 '14 at 02:20

shawnrad

131
1
2

It would be better to use Unicode ranges (see @Ants Aasma's answer). The result would be `text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))})`. – darkdragon Jun 23 '20 at 08:56

score 12 · Answer 5 · answered Jan 31 '19 at 00:58

12

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

answered Jan 31 '19 at 00:58

ChrisP

5,242
1
26
34

This is the only answer that works for me with unicode characters. Awesome that you provided test cases! – pir Sep 12 '19 at 02:02
1

If you want to allow for line breaks, add `LINE_BREAK_CHARACTERS = set(["\n", "\r"])` and `and not chr(i) in LINE_BREAK_CHARACTERS` when building the table. – pir Sep 12 '19 at 02:03
That should be the accepted answer. – Philippe Remy Apr 08 '21 at 02:07

score 7 · Answer 6 · edited Sep 27 '18 at 17:22

7

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

edited Sep 27 '18 at 17:22

Alex Myers

4,257
7
17
33

answered Sep 27 '18 at 15:16

c6401

81
1
1

This worked super great for me and its 1 line. thanks – Chop Labalagun Jun 27 '19 at 21:47
1

for some reason this works great on windows but cant use it on linux, i had to change the f for an r but i am not sure that is the solution. – Chop Labalagun Jul 13 '19 at 00:02
Sounds like your Linux Python was too old to support f-strings then. r-strings are quite different, though you could say `r'[^' + re.escape(string.printable) + r']'`. (I don't think `re.escape()` is entirely correct here, but if it works...) – tripleee Nov 29 '19 at 15:14
Sadly string.printable does not contain unicode characters, and thus ü or ó will not be in the output... – the_economist Feb 12 '21 at 13:00

score 6 · Answer 7 · edited Jan 14 '12 at 03:52

6

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

edited Jan 14 '12 at 03:52

rmmh

6,641
22
37

answered Sep 18 '08 at 13:26

Kirk Strauser

27,753
5
45
62

darkdragon · Answer 8 · 2020-06-24T07:09:57.603

4

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

edited Jun 24 '20 at 07:09

answered Jun 23 '20 at 08:25

darkdragon

150
1
10

This is a great answer! – tdc Jun 23 '20 at 22:42
You may be on to something with `startswith('C')` but this was far less performant in my testing than any other solution. – Big McLargeHuge Oct 08 '20 at 19:13
big-mclargehuge: The goal of my solution was the combination of completeness and simplicity/readability. You could try to use `if unicodedata.category(c)[0] != 'C'` instead. Does it perform better? If you prefer execution speed over memory requirements, one can pre-compute the table as shown in https://stackoverflow.com/a/93029/3779655 – darkdragon Oct 11 '20 at 15:25

Nilav Baran Ghosh · Answer 9 · 2018-01-07T02:45:48.110

2

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])

edited Jan 07 '18 at 02:45

answered Jan 07 '18 at 02:13

Nilav Baran Ghosh

1,111
9
17

`"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])` – evandrix Jun 24 '19 at 20:55

score 2 · Answer 10 · answered Jul 05 '18 at 07:04

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

Vinko Vrsalovic · Answer 11 · 2008-09-18T13:47:28.953

2

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

edited Sep 18 '08 at 13:47

answered Sep 18 '08 at 13:17

Vinko Vrsalovic

244,143
49
315
361

1

Unless you're on python 2.3, the inner []s are redundant. "return ''.join(c for c ...)" – habnabit Sep 19 '08 at 04:08
Not quite redundant—they have different meanings (and performance characteristics), though the end result is the same. – Miles Jun 03 '09 at 23:31
Should the other end of the range not be protected too?: "ord(c) <= 126" – Gearoid Murphy Mar 16 '11 at 17:48
7

But there are Unicode characters which are not printable, too. – tripleee Aug 14 '12 at 08:02

score 0 · Answer 12 · answered Sep 11 '17 at 05:22

0

To remove 'whitespace',

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

answered Sep 11 '17 at 05:22

knowingpark

470
4
11

Actually you don't need the square brackets either then. – tripleee Nov 29 '19 at 15:16

score 0 · Answer 13 · answered Jun 17 '20 at 19:42

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

tested on Python 3.7.7

Stripping non printable characters from a string in python

13 Answers13

Linked

Related