Replace all non alphanumeric characters except emojis

Question

I am trying to remove all non alphanumeric characters except emojis. So the wrote the following code:

>>> import re
>>> re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _")

Its works fine and returns:

'THAT ASc776  ?  #> _'

But if I put emoji in the text, I still get the same result:

>>> re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _")
'THAT ASc776  ?  #> _'

I realize that Emojis are unicode, so I also tried the following

>>> RE_EMOJI = re.compile('[^\U00010000-\U0010ffffa-zA-Z0-9_#@\s]', flags=re.UNICODE)
>>> RE_EMOJI.sub('','AHAT ASc776 ^ .? + #> _')
'AHAT ASc776  ?  #> _'

But it still doesn't recognize the emoji. So what's the correct way to remove all alphanumeric characters excluding emojis from a text.

EDIT:

With python3.5 the code works correctly and produces the correct output. However, I am using python2.7, and it doesn't work with python2.7.

Well, your emoji is an unicode character with [code U+1F60B](https://emojipedia.org/face-savouring-delicious-food/). Isn't it out of range you defined: `\U00010000-\U0010ffff` ? — running.t, Jul 05 '18 at 12:12
[`print(re.sub(r"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', "THAT ASc776 ^ .? + #> _"))`](https://ideone.com/h6eCiZ) returns `THAT ASc776 # _` — Wiktor Stribiżew, Jul 05 '18 at 12:16
@WiktorStribiżew Not for me atleast. Tried it with python2.7 on Mac and Ubuntu — KarateKid, Jul 05 '18 at 12:19
Aha, then you are missing `u` prefix with the Unicode strings (at least) — Wiktor Stribiżew, Jul 05 '18 at 12:20
@running.t I think that the difference in encoding: https://stackoverflow.com/questions/30186631/what-is-the-difference-between-utf-32-and-ucs-4 — KarateKid, Jul 05 '18 at 12:20
@glibdud: Thanks for the link. But I would say emoji is an important subset of unicode characters and if there is a solution particularly suited for emojis than it should be mentioned. Also its good if not everyone has to construct there own solution from the general unicode character set. — KarateKid, Jul 05 '18 at 12:22
@WiktorStribiżew Okay, I just tried it with python3.5 and you are right it works fine. However, it doesn't work for python2.7 on the same machine. This is very annoying. — KarateKid, Jul 05 '18 at 12:25
`print(re.sub(ur"[^a-zA-Z0-9_#@\s\U00010000-\U0010ffff]", '', u"THAT ASc776 ^ .? + #> _))` returns `THAT ASc776 # _` in my Python 2.7.12 (default, Dec 4 2017, 14:50:18) on Ubuntu 16.04. — Wiktor Stribiżew, Jul 05 '18 at 12:46

Replace all non alphanumeric characters except emojis

0 Answers0