4

This is my first time posting on Stack. I would really appreciate if someone could assist me with this.

I’m trying to remove Unicode characters (\x3a in my case) from a text file containing the following:

10\x3a00\x3a00

The final output is supposed to be:

100000

Basically, we are being instructed to delete all traces of \xXX where X can be any of the following: 0123456789ABCDEF. I tried using regular expressions as follows to delete any \xXX.

Re.sub(‘\\\x[a-fA-F0-9]{2}’,””, a)

Where “a” is a line of a text file.

When I try that, I get an error saying “invalid \x escape”.

I’ve been struggling with this for hours. What’s wrong with my regular expression?

Aqueous Carlos
  • 415
  • 6
  • 18
Kamran
  • 41
  • 1
  • 2
  • 1
    Possible duplicate of [Removing unicode \u2026 like characters in a string in python2.7](https://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2-7) – Rick Majpruz Oct 09 '17 at 22:54
  • @RickMajpruz I tried the solution mentioned in that question but the output gives me 10:00:00 (not 100000). Don't know why that is. – Kamran Oct 09 '17 at 23:14

2 Answers2

3

The character "\x3a" is not a multi-byte Unicode character. It is the ASCII character ":". Once you have specified the string "\x3a", it is stored internally as the character ":". Python isn't seeing any "\" action happening. So you can't strip out "\x3a" as a multi-byte Unicode because Python is only seeing single byte ASCII character ":".

$ python
>>> '\x3a' == ':'
True
>>> "10\x3a00\x3a00" == "10:00:00"
True

Check out the description section of the Wikipedia article on UTF-8. See that characters in the range U+0000-U+007F are encoded as a single ASCII character.

If you want to strip non-ASCII characters then do following:

>>> print u'R\xe9n\xe9'
Réné
>>> ''.join([x for x in u'R\xe9n\xe9' if ord(x) < 127])
u'Rn'
>>> ''.join([x for x in 'Réné' if ord(x) < 127])
'Rn'

If you want to retain European characters but discard Unicode characters with higher code points, then change the 127 in ord(x) < 127 to some higher value.

The post replace 3 byte unicode, has another approach. You can also strip out code point ranges with:

>>> str = u'[\uE000-\uFFFF]'
>>> len(str)
5
>>> import re
>>> pattern = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)
>>> pattern.sub('?', u'ab\uFFFDcd')
u'ab?cd'

Notice that working with \u may be easier than working with \x for specifying characters.

On the other hand, you could have the string "\\x3a" which you could strip out. Of course, that string isn't actually a multi-byte Unicode character but rather 4 ASCII characters.

$ python
>>> print '\\x3a'
\x3a
>>> '\\x3a' == ':'
False
>>> '\\x3a' == '\\' + 'x3a'
True
>>> (len('\x3a'), len('\\x3a'))
(1, 4)

You can also strip out the ASCII character ":":

>>> "10:00:00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace("\x3a", "")
'100000'
Rick Majpruz
  • 641
  • 3
  • 16
-1

try this

import re
tagRe = re.compile(r'\\x.*?(2)')
normalText = tagRe.sub('', myText)

change myText with your string

Yilmazam
  • 41
  • 7