The character "\x3a"
is not a multi-byte Unicode character. It is the ASCII character ":"
. Once you have specified the string "\x3a"
, it is stored internally as the character ":"
. Python isn't seeing any "\"
action happening. So you can't strip out "\x3a"
as a multi-byte Unicode because Python is only seeing single byte ASCII character ":"
.
$ python
>>> '\x3a' == ':'
True
>>> "10\x3a00\x3a00" == "10:00:00"
True
Check out the description section of the Wikipedia article on UTF-8. See that characters in the range U+0000-U+007F
are encoded as a single ASCII character.
If you want to strip non-ASCII characters then do following:
>>> print u'R\xe9n\xe9'
Réné
>>> ''.join([x for x in u'R\xe9n\xe9' if ord(x) < 127])
u'Rn'
>>> ''.join([x for x in 'Réné' if ord(x) < 127])
'Rn'
If you want to retain European characters but discard Unicode characters with higher code points, then change the 127
in ord(x) < 127
to some higher value.
The post replace 3 byte unicode, has another approach. You can also strip out code point ranges with:
>>> str = u'[\uE000-\uFFFF]'
>>> len(str)
5
>>> import re
>>> pattern = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)
>>> pattern.sub('?', u'ab\uFFFDcd')
u'ab?cd'
Notice that working with \u
may be easier than working with \x
for specifying characters.
On the other hand, you could have the string "\\x3a"
which you could strip out. Of course, that string isn't actually a multi-byte Unicode character but rather 4 ASCII characters.
$ python
>>> print '\\x3a'
\x3a
>>> '\\x3a' == ':'
False
>>> '\\x3a' == '\\' + 'x3a'
True
>>> (len('\x3a'), len('\\x3a'))
(1, 4)
You can also strip out the ASCII character ":"
:
>>> "10:00:00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace("\x3a", "")
'100000'