1

I am trying to remove all non alphanumeric characters except emojis. So the wrote the following code:

>>> import re
>>> re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _")

Its works fine and returns:

'THAT ASc776  ?  #> _'

But if I put emoji in the text, I still get the same result:

>>> re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _")
'THAT ASc776  ?  #> _'

I realize that Emojis are unicode, so I also tried the following

>>> RE_EMOJI = re.compile('[^\U00010000-\U0010ffffa-zA-Z0-9_#@\s]', flags=re.UNICODE)
>>> RE_EMOJI.sub('','AHAT ASc776 ^ .? + #> _')
'AHAT ASc776  ?  #> _'

But it still doesn't recognize the emoji. So what's the correct way to remove all alphanumeric characters excluding emojis from a text.

EDIT:

With python3.5 the code works correctly and produces the correct output. However, I am using python2.7, and it doesn't work with python2.7.

KarateKid
  • 2,036
  • 3
  • 15
  • 35
  • Well, your emoji is an unicode character with [code U+1F60B](https://emojipedia.org/face-savouring-delicious-food/). Isn't it out of range you defined: `\U00010000-\U0010ffff` ? – running.t Jul 05 '18 at 12:12
  • [`print(re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _"))`](https://ideone.com/h6eCiZ) returns `THAT ASc776 # _` – Wiktor Stribiżew Jul 05 '18 at 12:16
  • @WiktorStribiżew Not for me atleast. Tried it with python2.7 on Mac and Ubuntu – KarateKid Jul 05 '18 at 12:19
  • Aha, then you are missing `u` prefix with the Unicode strings (at least) – Wiktor Stribiżew Jul 05 '18 at 12:20
  • @running.t I think that the difference in encoding: https://stackoverflow.com/questions/30186631/what-is-the-difference-between-utf-32-and-ucs-4 – KarateKid Jul 05 '18 at 12:20
  • @glibdud: Thanks for the link. But I would say emoji is an important subset of unicode characters and if there is a solution particularly suited for emojis than it should be mentioned. Also its good if not everyone has to construct there own solution from the general unicode character set. – KarateKid Jul 05 '18 at 12:22
  • @WiktorStribiżew Okay, I just tried it with python3.5 and you are right it works fine. However, it doesn't work for python2.7 on the same machine. This is very annoying. – KarateKid Jul 05 '18 at 12:25
  • `print(re.sub(ur"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', u"THAT ASc776 ^ .? + #> _))` returns `THAT ASc776 # _` in my Python 2.7.12 (default, Dec 4 2017, 14:50:18) on Ubuntu 16.04. – Wiktor Stribiżew Jul 05 '18 at 12:46

0 Answers0